Grobid (GeneRation Of BIbliographic Data) is a machine learning library for extracting, parsing, and restructuring raw documents such as PDF into structured TEI-encoded documents, with a particular focus on technical and scientific publications.

Official repository on GitHub.

What you can do from this console:

  • Process — Extract structured data from PDFs (headers, full text, references) or parse raw text (dates, names, affiliations, citations)
  • Annotate PDF — Upload a PDF and get back an annotated version with visual overlays highlighting the extracted structures
  • Patents — Extract and annotate citations from patent documents

Documentation:

🚀 Getting started Installation, first steps, and basic usage
⚙️ How Grobid works Architecture and processing pipeline overview
🔌 Web service API REST API endpoints and client libraries
📄 Understanding the output (TEI) TEI is the XML format used by Grobid to structure extracted data
🐳 Running with Docker Container-based deployment for easy setup
FAQ & Troubleshooting Common questions and solutions
📚 Full documentation Complete reference on readthedocs

Contact: Patrice Lopez and Luca Foppiano.

Extract structured data (headers, references, citations) from PDFs or raw text

Service to call  
 
Flavor:
Select file Change Remove
 
 

Visualize extracted annotations overlaid on your PDF

Service to call  
 
 
Select file Change Remove
 

Extract and annotate citations from patent documents

Service to call  
 
 
Select file Change Remove