Grobid
About
Process
Annotate PDF
Patents
Official repository on GitHub.
What you can do from this console:
- Process — Extract structured data from PDFs (headers, full text, references) or parse raw text (dates, names, affiliations, citations)
- Annotate PDF — Upload a PDF and get back an annotated version with visual overlays highlighting the extracted structures
- Patents — Extract and annotate citations from patent documents
Documentation:
| 🚀 Getting started | Installation, first steps, and basic usage |
| ⚙️ How Grobid works | Architecture and processing pipeline overview |
| 🔌 Web service API | REST API endpoints and client libraries |
| 📄 Understanding the output (TEI) | TEI is the XML format used by Grobid to structure extracted data |
| 🐳 Running with Docker | Container-based deployment for easy setup |
| ❓ FAQ & Troubleshooting | Common questions and solutions |
| 📚 Full documentation | Complete reference on readthedocs |
Contact: Patrice Lopez and Luca Foppiano.
Extract structured data (headers, references, citations) from PDFs or raw text
Visualize extracted annotations overlaid on your PDF
Extract and annotate citations from patent documents