Grobid Web Application

Grobid (GeneRation Of BIbliographic Data) is a machine learning library for extracting, parsing, and restructuring raw documents such as PDF into structured TEI-encoded documents, with a particular focus on technical and scientific publications.

Official repository on GitHub.

What you can do from this console:

Process — Extract structured data from PDFs (headers, full text, references) or parse raw text (dates, names, affiliations, citations)
Annotate PDF — Upload a PDF and get back an annotated version with visual overlays highlighting the extracted structures
Patents — Extract and annotate citations from patent documents

Documentation:

🚀 Getting started	Installation, first steps, and basic usage
⚙️ How Grobid works	Architecture and processing pipeline overview
🔌 Web service API	REST API endpoints and client libraries
📄 Understanding the output (TEI)	TEI is the XML format used by Grobid to structure extracted data
🐳 Running with Docker	Container-based deployment for easy setup
❓ FAQ & Troubleshooting	Common questions and solutions
📚 Full documentation	Complete reference on readthedocs

Contact: Patrice Lopez and Luca Foppiano.

Extract structured data (headers, references, citations) from PDFs or raw text

Service to call

Consolidate header	Consolidate citations	Consolidate funders
Include raw affiliations	Include raw citations	Include raw copyrights
Segment sentences	Add coordinates	Flavor:

Select file Change Remove

Grobid
About Process Annotate PDF Patents

About Process Annotate PDF Patents

Service to call

	Consolidate references Include figures and tables Select file Change Remove

Grobid About Process Annotate PDF Patents

About Process Annotate PDF Patents

Grobid
About Process Annotate PDF Patents