Unstructured
The
unstructured
package from Unstructured.IO extracts clean text from raw source documents like PDFs and Word documents. This page covers how to use theunstructured
ecosystem within LangChain.
Installation and Setupβ
If you are using a loader that runs locally, use the following steps to get unstructured
and
its dependencies running locally.
- Install the Python SDK with
pip install unstructured
.- You can install document specific dependencies with extras, i.e.
pip install "unstructured[docx]"
. - To install the dependencies for all document types, use
pip install "unstructured[all-docs]"
.
- You can install document specific dependencies with extras, i.e.
- Install the following system dependencies if they are not already available on your system.
Depending on what document types you're parsing, you may not need all of these.
libmagic-dev
(filetype detection)poppler-utils
(images and PDFs)tesseract-ocr
(images and PDFs)libreoffice
(MS Office docs)pandoc
(EPUBs)
If you want to get up and running with less set up, you can
simply run pip install unstructured
and use UnstructuredAPIFileLoader
or
UnstructuredAPIFileIOLoader
. That will process your document using the hosted Unstructured API.
The Unstructured API requires API keys to make requests. You can generate a free API key here and start using it today! Checkout the README here here to get started making API calls. We'd love to hear your feedback, let us know how it goes in our community slack. And stay tuned for improvements to both quality and performance! Check out the instructions here if you'd like to self-host the Unstructured API or run it locally.
Wrappersβ
Data Loadersβ
The primary unstructured
wrappers within langchain
are data loaders. The following
shows how to use the most basic unstructured data loader. There are other file-specific
data loaders available in the langchain.document_loaders
module.
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("state_of_the_union.txt")
loader.load()
If you instantiate the loader with UnstructuredFileLoader(mode="elements")
, the loader
will track additional metadata like the page number and text type (i.e. title, narrative text)
when that information is available.