3 minute read

How is open collaboration in open large language model (LLM) projects actually organized—and why does it matter? These are questions me, Cailean Osborne, Jennifer Ding and Ben Burthenshaw map and address in a new paper just released on Arxiv. Drawing on interviews with developers from 14 open LLM initiatives across four continents (spanning grassroots, research institutes, startups, and big tech), we find that collaboration in open LLMs extends far beyond the models themselves. It encompasses datasets, benchmarks, open source frameworks, leaderboards, forums, and compute partnerships—revealing a much broader and more interconnected ecosystem than often assumed.

🏛️ Five Distinct Governance Models Our analysis identifies five distinct governance models shaping open LLM projects, including: single company projects, single research institute projects, multi-organizational research institute projects, non-profit-sponsored grassroots projects, and company-sponsored grassroots projects. These models differ in their approaches to centralization, collaboration, and community engagement, reflecting the diversity and complexity of today’s open AI ecosystem, but also the need for organisational structures and processes to manage open collaboration of open LLMs at scale.

💡 What Drives Developer Engagement? Developers are motivated by a range of factors: from democratizing AI access and expanding language representation, to advancing open science and fostering regional innovation. Each governance model influences the motivations and incentives for participation, shaping collaboration patterns and sustainability.

🔄 Collaboration Across the LLM Lifecycle The lifecycle of open LLM development shows shifting collaboration patterns—from selective partnerships and resource pooling during pre-training, through targeted refinements in post-training, to broad community participation after release via platform-mediated interactions and derivative development. These dynamics demonstrate the need to move beyond narrow, model-centric views of “open source AI” and to recognize the layered, artifact-rich nature of true collaborative development.

📝 Recommendations for Stakeholders Based on these findings, we outline actionable recommendations for stakeholders:

Researchers and developers:

  • Proactively engage with open communities, experiment with transparent “open lab” processes, focus efforts on underrepresented languages, and share best practices for safe and responsible development.
  • Industry: Invest in collaborative ecosystem development, contribute compute resources, clarify licensing and data provenance, and explore hybrid governance frameworks to balance oversight and community input.
  • Policy makers: Prioritize funding for public AI infrastructure, support inclusive regional initiatives, and require permissive licensing of public AI assets to maximize societal benefit.
  • Foundations and platform providers: Create pathways for non-code contributions, support diverse participation, and facilitate effective governance across multi-organizational projects.

For AI researchers and developers:

  • Proactively engage with open source AI projects and communities, and avoid reinventing the wheel.
  • Experiment with ``open lab’’ approaches that make the entire research process transparent.
  • Build evaluation datasets to support targeted improvements in underrepresented languages.
  • Prioritize data quality improvements, especially for underrepresented languages.
  • Develop open source tools and best practices for AI safety.
  • Develop full-stack AI expertise beyond fine-tuning.
  • Foster synergies between AI infrastructure providers, AI researchers, and industry.
  • Ensure fair compensation for inclusive participation.
  • Default to using licenses for open LLMs that uphold the four freedoms of open source, as proposed by the Open Source AI definition.

For AI companies:

  • Invest in ecosystems rather than competing alone.
  • Explore hybrid governance frameworks.
  • Invest and collaborate on data provenance and licensing clarity.
  • Partner up and contribute to open source AI through compute donations.

For Policymakers:

  • Fund public AI infrastructure, not just models.
  • Support public interest projects via compute donations or subsidies.
  • Support regional language and cultural representation through targeted initiatives.
  • Ensure ``public money, public AI’’ principles with appropriate licensing.
  • Bridge the developer-policy gap on AI safety governance.
  • Promote a unified definition of open source AI.

For Platform providers and foundations:

  • Enable multi-modal contribution pathways beyond code, including infrastructure for data annotation, evaluation feedback, documentation, and non-technical contributions.
  • Provide support structures for open LLM projects that spans across artifacts, such as as software, data and models.

Each of these steps is critical for building a more open, inclusive, and sustainable future for collaborative AI.

Categories:

Updated: