Skip to main content

XML parser

This output parser allows users to obtain results from LLM in the popular XML format.

Keep in mind that large language models are leaky abstractions! Youโ€™ll have to use an LLM with sufficient capacity to generate well-formed XML.

In the following example we use Claude model (https://docs.anthropic.com/claude/docs) which works really well with XML tags.

from langchain.llms import Anthropic
from langchain.output_parsers import XMLOutputParser
from langchain.prompts import PromptTemplate
model = Anthropic(model="claude-2", max_tokens_to_sample=512, temperature=0.1)
/Users/harrisonchase/workplace/langchain/libs/langchain/langchain/llms/anthropic.py:171: UserWarning: This Anthropic LLM is deprecated. Please use `from langchain.chat_models import ChatAnthropic` instead
warnings.warn(

Letโ€™s start with the simple request to the model.

actor_query = "Generate the shortened filmography for Tom Hanks."
output = model(
f"""

Human:
{actor_query}
Please enclose the movies in <movie></movie> tags
Assistant:
"""
)
print(output)
 Here is the shortened filmography for Tom Hanks enclosed in <movie> tags:

<movie>Splash (1984)</movie>
<movie>Big (1988)</movie>
<movie>A League of Their Own (1992)</movie>
<movie>Sleepless in Seattle (1993)</movie>
<movie>Forrest Gump (1994)</movie>
<movie>Apollo 13 (1995)</movie>
<movie>Toy Story (1995)</movie>
<movie>Saving Private Ryan (1998)</movie>
<movie>Cast Away (2000)</movie>
<movie>The Da Vinci Code (2006)</movie>
<movie>Toy Story 3 (2010)</movie>
<movie>Captain Phillips (2013)</movie>
<movie>Bridge of Spies (2015)</movie>
<movie>Toy Story 4 (2019)</movie>

Now we will use the XMLOutputParser in order to get the structured output.

parser = XMLOutputParser()

prompt = PromptTemplate(
template="""

Human:
{query}
{format_instructions}
Assistant:""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

output = chain.invoke({"query": actor_query})
print(output)
{'filmography': [{'movie': [{'title': 'Splash'}, {'year': '1984'}]}, {'movie': [{'title': 'Big'}, {'year': '1988'}]}, {'movie': [{'title': 'A League of Their Own'}, {'year': '1992'}]}, {'movie': [{'title': 'Sleepless in Seattle'}, {'year': '1993'}]}, {'movie': [{'title': 'Forrest Gump'}, {'year': '1994'}]}, {'movie': [{'title': 'Toy Story'}, {'year': '1995'}]}, {'movie': [{'title': 'Apollo 13'}, {'year': '1995'}]}, {'movie': [{'title': 'Saving Private Ryan'}, {'year': '1998'}]}, {'movie': [{'title': 'Cast Away'}, {'year': '2000'}]}, {'movie': [{'title': 'Catch Me If You Can'}, {'year': '2002'}]}, {'movie': [{'title': 'The Polar Express'}, {'year': '2004'}]}, {'movie': [{'title': 'Bridge of Spies'}, {'year': '2015'}]}]}

Finally, letโ€™s add some tags to tailor the output to our needs.

parser = XMLOutputParser(tags=["movies", "actor", "film", "name", "genre"])
prompt = PromptTemplate(
template="""

Human:
{query}
{format_instructions}
Assistant:""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)


chain = prompt | model | parser

output = chain.invoke({"query": actor_query})

print(output)
{'movies': [{'actor': [{'name': 'Tom Hanks'}, {'film': [{'name': 'Splash'}, {'genre': 'Comedy'}]}, {'film': [{'name': 'Big'}, {'genre': 'Comedy'}]}, {'film': [{'name': 'A League of Their Own'}, {'genre': 'Comedy'}]}, {'film': [{'name': 'Sleepless in Seattle'}, {'genre': 'Romance'}]}, {'film': [{'name': 'Forrest Gump'}, {'genre': 'Drama'}]}, {'film': [{'name': 'Toy Story'}, {'genre': 'Animation'}]}, {'film': [{'name': 'Apollo 13'}, {'genre': 'Drama'}]}, {'film': [{'name': 'Saving Private Ryan'}, {'genre': 'War'}]}, {'film': [{'name': 'Cast Away'}, {'genre': 'Adventure'}]}, {'film': [{'name': 'The Green Mile'}, {'genre': 'Drama'}]}]}]}