Paywalls, licenses and copyright rules often re- strict the broad dissemination and reuse of scien- tific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphras- ing may not be legally sound. We urge the com- munity to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing enti- ties, attributes and relationships without stylistic content. We provide evidence that Knowledge Units (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copy- right law and U.S. Fair Use doctrine, and (2) pre- serve most (∼95%) factual knowledge from origi- nal text, measured by MCQ performance on facts from the original copyrighted text across four re- search domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open- source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scien- tific knowledge while respecting copyright.
@article{schuhmann2025projectalexandria,
title={Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs},
author={Christoph Schuhmann and Gollam Rabby and Ameya Prabhu and Tawsif Ahmed and Andreas Hochlehnert and Huu Nguyen and Nick Akinci Heidrich and Ludwig Schmidt and Robert Kaczmarczyk and Sören Auer and Jenia Jitsev and Matthias Bethge},
journal={arXiv preprint arXiv:2502.19413},
year={2025}
}