Skip to main content

Grobid

GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.

It is designed and expected to be used to parse academic papers, where it works particularly well.

Note: if the articles supplied to Grobid are large documents (e.g. dissertations) exceeding a certain number of elements, they might not be processed.

This page covers how to use the Grobid to parse articles for LangChain.

Installation​

The grobid installation is described in details in https://grobid.readthedocs.io/en/latest/Install-Grobid/. However, it is probably easier and less troublesome to run grobid through a docker container, as documented here.

Use Grobid with LangChain​

Once grobid is installed and up and running (you can check by accessing it http://localhost:8070), you're ready to go.

You can now use the GrobidParser to produce documents

from langchain.document_loaders.parsers import GrobidParser
from langchain.document_loaders.generic import GenericLoader

#Produce chunks from article paragraphs
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

#Produce chunks from article sentences
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

Chunk metadata will include Bounding Boxes. Although these are a bit funky to parse, they are explained in https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/