XML — Getting Started Notes
📅 Fri. 2026-03-27 🕐 Current 🤖 Claude Sonnet 4.6 👉 #XML #AI #DataFormats #Python #WebDev 📎 W3C XML 1.0 Spec 📎 Python ElementTree Docs 📎 lxml Documentation 📎 OWASP XXE Prevention Cheat Sheet 📎 defusedxml on PyPI 📎 W3C XML Standards Index
1. Overview
1.1. What Is XML and Why It Exists
XML — Extensible Markup Language — is a W3C-standardized text format for encoding structured, hierarchical data in a way that is both human-readable and machine-parseable. It became a W3C Recommendation on February 10, 1998, and its Fifth Edition (XML 1.0) remains the active normative standard today. As of early 2026, no organization has announced a formal XML 2.0 effort, though the MicroXML Community Group published a significantly simplified subset specification.
(1) Design Intent
The W3C Working Group that created XML articulated its goals explicitly in the specification. XML was designed to be:
- Straightforwardly usable over the Internet
- Supportive of a wide variety of applications
- Compatible with SGML (its parent standard, ISO 8879)
- Easy to write programs that process XML documents
- Minimally optional in features — ideally zero optional features
- Human-legible and reasonably clear
- Formally and concisely specified
- Easy to author
These goals explain the angle-bracket syntax, the strict well-formedness rules, and the separation between structure (elements/attributes) and content (text nodes). XML deliberately chose verbosity over terseness, which is why it has been widely criticized but also why it remains self-documenting and universally parseable.
(2) Pain Points XML Solves
Before XML, data interchange between heterogeneous systems required negotiating proprietary binary formats or attempting to parse non-standardized plain text. XML solved several concrete problems:
- Structural ambiguity: Every element has explicit open/close tags and a clear parent-child hierarchy.
- Encoding diversity: XML mandates Unicode (UTF-8 by default) and carries its encoding declaration inline, eliminating charset mismatch bugs.
- Schema enforcement: DTD (Document Type Definition) and later XSD (XML Schema Definition) provide formal grammar-based validation, enabling automated contract checking between producer and consumer.
- Cross-language tooling: Because XML is a text-based, standardized format, parsers exist in virtually every programming language and platform, eliminating integration friction.
- Namespace collision: XML Namespaces (a separate W3C recommendation) allow elements from multiple vocabularies to coexist in a single document without name clashes — critical for mashup formats like SOAP envelopes containing XHTML payloads.
(3) Key Features
- Well-formedness rules that any conformant parser enforces automatically
- Hierarchical (tree) data model — every document has exactly one root element
- Unicode support with explicit encoding declarations
- Processing instructions for parser directives outside document content
- Comments, CDATA sections, and entities for special-case content
- CDATA sections: a block of text meant to be interpreted literally by the parser, prevents parsing errors
- Namespace support via prefixed element/attribute names
- Validation via DTD, XSD (XML Schema), or RELAX NG
- XPath — a query language for addressing nodes in the tree
- XSLT — a transformation language for producing new documents from XML
- XQuery — a full query language for XML databases
(4) Primary Use Cases
- Configuration files: Maven
pom.xml, Spring beans, Ant build files, AndroidManifest.xml, .NET project files - Document markup: DITA, DocBook, OOXML (
.docx,.xlsx), ODF - Data interchange / APIs: SOAP web services, RSS/Atom feeds, Open API specs (historically), Salesforce metadata
- Database persistence: Native XML databases (eXist-db, BaseX), SQL Server's XML columns
- Office automation: All modern Office files (Word, Excel, PowerPoint) are ZIP archives containing XML
- SVG graphics: Scalable Vector Graphics is an XML vocabulary
- AI/ML data annotation: Clinical trial datasets, legal contract markup, NLP corpora annotated in XML-based formats (TEI, JATS)
1.2. Competitors & Alternatives
XML does not exist in a vacuum. Understanding where it wins and where alternatives dominate is essential before deciding to use or process it.
(1) Market Perspective
| Format | Dominant Market | Notes | ||
|---|---|---|---|---|
| XML | Enterprise integration, government, healthcare, publishing | Deep install base; SOAP, HL7, DITA, Maven | ||
| JSON | Web APIs, JavaScript ecosystems, NoSQL databases | Displaced XML as the default REST payload format from ~2010 onward | ||
| YAML | Configuration files, CI/CD pipelines, Kubernetes manifests | Human-friendlier than XML/JSON for config; anchors enable reuse | ||
| Protocol Buffers (protobuf) | High-performance microservices, gRPC | Binary, schema-first, Google-developed; 3–10x smaller payloads | ||
| MessagePack | IoT, gaming, embedded systems | Binary JSON alternative; compact and fast | ||
| Apache Avro | Big data, Kafka, Hadoop ecosystems | Schema-evolution-friendly; binary with schema-in-header | ||
| TOML | Developer-facing config files | Minimal, explicit; Rust's Cargo.toml, Python's pyproject.toml | ||
| CSV | Tabular data exchange | Lowest common denominator; no hierarchy, no types | ||
| Format | Definition | Problem Solved | Dominant Market | Notes |
| XML | A tag-based markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. | Need for a strictly structured, self-describing, and hierarchical way to exchange complex data across incompatible systems. | Enterprise integration, government, healthcare, publishing | Deep install base; SOAP, HL7, DITA, Maven |
| JSON | A lightweight data-interchange format based on a subset of the JavaScript Programming Language syntax. | XML was too verbose and computationally expensive for web browsers; JSON provides a minimal, faster alternative for web traffic. | Web APIs, JavaScript ecosystems, NoSQL databases | Displaced XML as the default REST payload format from ~2010 onward |
| YAML | A human-friendly data serialization standard that uses indentation to indicate nesting and structure. | JSON and XML are difficult for humans to read and write manually (e.g., for settings); YAML maximizes legibility for configuration. | Configuration files, CI/CD pipelines, Kubernetes manifests | Human-friendlier than XML/JSON; anchors enable reuse |
| Protobuf | A binary, language-neutral, platform-neutral, extensible mechanism for serializing structured data. | Text-based formats (JSON) are too large and slow for high-frequency internal microservice communication. | High-performance microservices, gRPC | Binary, schema-first, Google-developed; 3–10x smaller payloads |
| MessagePack | An efficient binary serialization format that lets you exchange data like JSON but much faster and smaller. | Standard JSON is inefficient for resource-constrained environments where bandwidth and storage are at a premium. | IoT, gaming, embedded systems | Binary JSON alternative; compact and fast |
| Apache Avro | A remote procedure call and data serialization framework developed within Apache's Hadoop project. | Big data pipelines require a format that supports "schema evolution," allowing data structures to change over time without breaking old data. | Big data, Kafka, Hadoop ecosystems | Schema-evolution-friendly; binary with schema-in-header |
| TOML | A configuration file format that is easy to read due to obvious semantics and a focus on minimal complexity. | YAML’s "significant whitespace" can be ambiguous and error-prone; TOML offers a more explicit, "flat" syntax for developer settings. | Developer-facing config files | Minimal, explicit; Rust's Cargo.toml, Python's pyproject.toml |
| CSV | A simple text format where each line is a data record and each record consists of one or more fields, separated by commas. | The need for a universal, "lowest common denominator" format for importing and exporting flat, tabular datasets. | Tabular data exchange | Lowest common denominator; no hierarchy, no types |
| ###### (2) User Segment Perspective |
- Enterprise architects designing system integrations still frequently encounter SOAP/WSDL, WS-Security, EDI over XML, and XML-based regulatory filings. XML is unavoidable here.
- Web developers working on REST APIs default to JSON for new development but must parse XML when integrating with legacy services, RSS feeds, sitemaps, or SVG.
- Data engineers in healthcare, pharma, and legal work with XML-annotated corpora (HL7 FHIR R4 still offers XML serialization alongside JSON; CDA documents are pure XML).
- DevOps engineers encounter XML in Maven/Gradle, SonarQube configurations, and legacy CI systems, but have mostly migrated to YAML for new config.
- AI/ML engineers increasingly use XML annotation schemas for LLM fine-tuning datasets (TEI-XML for historical corpora, BioC for biomedical NLP) and must parse OOXML files to extract training data.
(3) Technical Domain Perspective
| Dimension | XML | JSON | YAML | Protobuf |
|---|---|---|---|---|
| Schema/Validation | XSD, DTD, RELAX NG (rich) | JSON Schema (limited) | None standard | .proto file |
| Namespaces | Native | None | None | Package namespacing |
| Mixed content (text + child elements) | Native | Awkward | Awkward | Not applicable |
| Transformation | XSLT (powerful) | jq (limited) | None standard | Code generation |
| Query | XPath, XQuery | JSONPath (not standardized until 2024) | None | None |
| Human-readability | Medium | High | Very high | Low (binary) |
| Parse performance | Slower than binary | Faster than XML | Slower than JSON | Fastest |
| Streaming large files | SAX/iterparse | YAJL, streaming JSON | None standard | Streaming protobuf |
Verdict for AI agents: You will encounter XML most often when integrating with enterprise APIs, parsing OOXML/ODF files, consuming RSS/Atom feeds, processing SVG, or working with annotated NLP corpora. JSON is your default for new system design; XML is your "must-know" for legacy and document-centric workloads.
2. Concept, Component, & Architecture
2.1. Key Concepts
The following concepts are introduced from simplest to most complex, respecting their logical dependencies.
(1) Well-Formed Document
An XML document is well-formed if it follows all syntactic rules required by the XML specification. These rules are:
- It has exactly one root element that contains all other elements.
- All elements are properly nested (no overlapping).
- Every open tag has a matching close tag (or is a self-closing empty element tag
<br/>). - Attribute values are always quoted (single or double quotes).
- The special characters
<,>, and&are always escaped as<,>, and&when they appear in content. - Element and attribute names are case-sensitive.
A well-formed document can be parsed by any conformant parser. Well-formedness is the minimum bar — it does not guarantee the document structure is correct for a given application.
(2) Valid Document
A valid XML document is well-formed AND conforms to a declared schema (DTD, XSD, or RELAX NG). Validity checking is optional and requires a validating parser. An XML document can be well-formed but invalid (structure violations against the schema) or valid but semantically incorrect (correct structure, wrong data values).
(3) Elements, Attributes, and Text Nodes
The three fundamental information-carrying constructs in XML:
- Elements are the primary containers:
<book id="001"><title>Clean Code</title></book>. They form the tree hierarchy.- child element: element nested with another element (the parent)
- Attributes are key-value metadata on elements:
id="001"in the example above. They cannot be repeated on the same element, have no ordering guarantee across parsers, and cannot contain child content. - Text nodes are the character data within element tags. In the example,
Clean Codeis the text node content of<title>.
The design choice between encoding data as an attribute or a child element is a common source of debate. The general guidance is: use attributes for metadata that describes the element itself (like an identifier), and use child elements for data that is part of the document content.
(4) Character Data (CDATA) Sections
A CDATA section wraps literal text that should not be parsed for markup. It is the mechanism for embedding content that contains < or & without requiring entity escaping:
<script>
<![CDATA[
if (x < 10 && y > 5) { doSomething(); }
]]>
</script>
The parser passes the raw content inside <![CDATA[ ... ]]> directly to the application as character data without interpretation.
(5) Entities
XML defines five built-in character entities for escaping special characters:
| Entity | Character |
|---|---|
< |
< |
> |
> |
& |
& |
' |
' |
" |
" |
Additionally, documents can define custom entities in a DTD (internal entity declarations) or reference external entity files (external entity declarations). External entity processing is the root cause of XXE (XML External Entity) vulnerabilities — see Section 3.3 on security.
(6) Processing Instructions
Processing instructions (PIs) carry application-specific directives to the parser or post-processing application. They are not considered document content:
<?xml-stylesheet type="text/xsl" href="transform.xsl"?>
<?target instruction-data?>
The most important PI is the XML declaration at the very top of a document:
<?xml version="1.0" encoding="UTF-8"?>
(7) Comments
XML comments have the same syntax as HTML comments and are not part of the document's information content:
<!-- This is a comment. It cannot contain double-dashes: -- -->
(8) Namespaces
XML Namespaces allow element and attribute names from different vocabularies to coexist in one document without collision. A namespace is identified by a URI (which does not need to resolve to anything real — it is just a unique identifier):
<root xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:svg="http://www.w3.org/2000/svg">
<xhtml:p>Paragraph</xhtml:p>
<svg:circle cx="50" cy="50" r="25"/>
</root>
The default namespace (no prefix) is declared with xmlns="URI" and applies to all unprefixed elements in scope. Understanding namespaces is critical when using ElementTree in Python because namespace URIs are baked into tag names: {http://www.w3.org/1999/xhtml}p.
(9) XPath
XPath is a query language for selecting nodes from an XML document. It models the document as a tree of nodes and provides path expressions analogous to filesystem paths:
//book— all<book>elements anywhere in the document/library/book[@category='fiction']—<book>elements in<library>with acategoryattribute equal tofiction//book/title/text()— text content of all<title>elements that are children of<book>elementscount(//book)— count of all book elements
XPath 1.0 is the version supported by ElementTree. lxml supports XPath 1.0 fully; true XPath 2.0/3.0 support requires Saxon or similar.
(10) XSD (XML Schema Definition)
XSD is a W3C standard for formally describing the structure and data types permissible in an XML document. An XSD schema is itself an XML document. It provides:
- Element and attribute declarations with data types (xs:string, xs:integer, xs:date, etc.)
- Complex type definitions with ordering constraints (xs:sequence, xs:choice, xs:all)
- Cardinality constraints (minOccurs, maxOccurs)
- Inheritance and extension mechanisms
XSD validation in Python is done via lxml:
from lxml import etree
with open("schema.xsd") as f:
schema_doc = etree.parse(f)
schema = etree.XMLSchema(schema_doc)
doc = etree.parse("document.xml")
is_valid = schema.validate(doc)
print(schema.error_log) # details on failures
(11) XSLT
XSLT (Extensible Stylesheet Language Transformations) is a functional, template-matching language for transforming one XML document into another document (XML, HTML, or plain text). It is declared as an XML document itself and is applied by an XSLT processor. lxml ships with an XSLT 1.0 processor. XSLT 2.0/3.0 requires Saxon or Xalan:
from lxml import etree
transform = etree.XSLT(etree.parse("stylesheet.xsl"))
result = transform(etree.parse("input.xml"))
print(str(result))
2.2. Core Components
(1) The XML Document Itself
An XML document is a text file with a tree structure. Every well-formed document has exactly one root element. The typical anatomy is:
<?xml version="1.0" encoding="UTF-8"?>
<!-- Optional: DOCTYPE declaration linking to DTD -->
<!DOCTYPE library SYSTEM "library.dtd">
<!-- Root element -->
<library xmlns:dc="http://purl.org/dc/elements/1.1/">
<!-- Child elements -->
<book id="001" category="programming">
<dc:title>Clean Code</dc:title>
<author>Robert C. Martin</author>
<year>2008</year>
<price currency="USD">35.99</price>
</book>
<book id="002" category="fiction">
<dc:title>1984</dc:title>
<author>George Orwell</author>
<year>1949</year>
<price currency="USD">8.99</price>
</book>
</library>
(2) DOM Parser (Document Object Model)
A DOM parser reads the entire XML document into memory and builds a complete tree of node objects. The application can then navigate, query, and modify any part of the tree. DOM is the oldest and most feature-rich parsing model.
- Pros: Random access to any node; supports modification; well-suited for small-to-medium documents.
- Cons: Entire document loaded into RAM — prohibitive for large files (>100MB).
- Python:
xml.dom.minidom(stdlib, verbose API);lxml.etree(DOM-like, much faster).
(3) SAX Parser (Simple API for XML)
A SAX parser streams through the document and fires events (startElement, endElement, characters) without building a tree in memory. The application registers handler callbacks and processes data as it arrives.
- Pros: Constant memory footprint regardless of document size; fastest for extraction of small amounts of data from large files.
- Cons: Read-only; stateless — managing context between events requires application-level state tracking; harder to code.
- Python:
xml.sax(stdlib).
(4) ElementTree API (Python-specific)
ElementTree is a middle ground between DOM and SAX. It builds a tree in memory (like DOM) but uses a simpler, Pythonic API. It is the recommended starting point for most Python XML work:
import xml.etree.ElementTree as ET
tree = ET.parse("library.xml")
root = tree.getroot()
for book in root.findall("book"):
title = book.find("title").text
book_id = book.get("id")
print(f"{book_id}: {title}")
The standard library's ElementTree (xml.etree.ElementTree) is written in C (since Python 3.3) and is fast for typical use cases. It handles 90% of real-world XML tasks. Key limitation: does not support XSD validation, XSLT, full XPath 2.0, or parent-node traversal.
(5) lxml
lxml is a Python binding to the C libraries libxml2 (XML/HTML parsing) and libxslt (XSLT transformation). It is the recommended library for production XML work requiring performance, full XPath, XSLT, or schema validation:
from lxml import etree
tree = etree.parse("library.xml")
root = tree.getroot()
# Full XPath support
books = root.xpath("//book[@category='programming']/title/text()")
print(books) # ['Clean Code']
# XSD validation
schema = etree.XMLSchema(etree.parse("library.xsd"))
print(schema.validate(tree))
lxml is API-compatible with ElementTree for most operations, so migrating from stdlib to lxml requires only changing the import:
try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree
(6) iterparse for Streaming
When documents are too large to fit in memory, iterparse allows incremental, event-driven processing without the complexity of SAX callback registration. It is the recommended approach for large-file scenarios:
import xml.etree.ElementTree as ET
def process_large_file(xml_file: str) -> None:
for event, elem in ET.iterparse(xml_file, events=("end",)):
if elem.tag == "book":
title = elem.findtext("title", default="")
print(title)
elem.clear() # CRITICAL: release memory after processing
process_large_file("large_catalog.xml")
The elem.clear() call is mandatory — without it, ElementTree retains references to processed elements and the memory advantage of iterparse is lost entirely.
(7) defusedxml
defusedxml is a security-hardened drop-in replacement for Python's stdlib XML parsers. It disables all entity expansion, external entity resolution, and other attack vectors by default. It is the recommended parser for any XML received from untrusted sources:
# Replace: import xml.etree.ElementTree as ET
# With:
from defusedxml import ElementTree as ET
tree = ET.parse("untrusted_input.xml") # Raises if XXE or Billion Laughs detected
2.3. Architecture & Design
(1) XML Information Set (Infoset)
The XML Information Set specification (W3C Recommendation) defines an abstract data model for the information in a well-formed XML document. It separates what an XML document means (the information) from how it is serialized (the bytes on disk). All XML-related standards (XPath, XQuery, XSLT, XSD) operate on the Infoset, not the raw text, which is why they are interoperable across different parser implementations.
(2) Document Tree Model
The fundamental architectural pattern of XML is the ordered, labeled tree. This model has been extremely influential and underpins HTML's DOM, JSON's object tree, and YAML's document model. The tree has these node types:
Document
├── ProcessingInstruction (<?xml-stylesheet ...?>)
├── Comment (<!-- ... -->)
└── Element (root)
├── Attribute (on the element itself)
├── Text node
├── CDATA section
└── Element (child)
├── Attribute
└── Text node
(3) Architecture Diagram — XML Processing Pipeline
flowchart LR
A[XML Source\n.xml file / HTTP / stream] --> B[Parser\nElementTree / lxml / SAX]
B --> C{Validate?}
C -->|Yes| D[XSD / DTD\nValidation]
C -->|No| E[In-memory Tree\nor Events]
D --> E
E --> F{Transform?}
F -->|XSLT| G[XSLT Processor\nlxml.etree.XSLT]
F -->|XPath| H[XPath Query\nroot.xpath]
F -->|Python| I[Custom Logic\nElementTree API]
G --> J[Output\nXML / HTML / Text]
H --> J
I --> J
(4) SAX vs DOM vs ElementTree Design Trade-offs
The three parsing paradigms reflect different trade-offs along the axes of memory, speed, and API convenience:
- SAX is designed for forward-only, read-only streaming. It is the lowest-level API and mirrors the push-parser model — the parser drives execution by firing events.
- DOM is designed for random-access modification. The entire document graph lives in memory. It mirrors the object-graph model — the application drives execution by navigating the tree.
- ElementTree is designed for the common case — parse once, query a subset, serialize or discard. It uses pull-parsing internally and exposes a simplified tree-like API.
lxml unifies all three: it uses libxml2's streaming parser internally, builds a tree compatible with ElementTree, and exposes SAX-like event APIs through lxml.etree.iterparse and lxml.etree.XMLPullParser.
(5) Evolution of XML Standards
The XML standards family has grown iteratively over time:
| Year | Standard |
|---|---|
| 1998 | XML 1.0 becomes W3C Recommendation |
| 1998 | XML Namespaces |
| 1999 | XPath 1.0, XSLT 1.0 |
| 2001 | XSD 1.0 (XML Schema) |
| 2003 | XPath 2.0 (draft), XML 1.1 |
| 2004 | RELAX NG |
| 2007 | XPath 2.0 / XSLT 2.0 / XQuery 1.0 (Recommendation) |
| 2008 | XML 1.0 Fifth Edition (current normative version) |
| 2011 | Efficient XML Interchange (EXI) — W3C Recommendation (binary XML) |
| 2017 | XPath 3.1, XSLT 3.0, XQuery 3.1 |
| 2012 | MicroXML Community Group formed |
| 2025-2026 | XML standards maintained by W3C Internationalization Working Group; no XML 2.0 announced |
2.4. Ecosystem
(1) Python XML Ecosystem
| Library | Role | Install |
|---|---|---|
xml.etree.ElementTree |
Stdlib parser/builder; C-accelerated since Python 3.3 | Built-in |
xml.dom.minidom |
Stdlib DOM parser (verbose API) | Built-in |
xml.sax |
Stdlib SAX event parser | Built-in |
lxml |
Full-featured: libxml2/libxslt binding; XPath, XSLT, XSD | pip install lxml |
defusedxml |
Security-safe drop-in replacement for stdlib parsers | pip install defusedxml |
xmltodict |
Convert XML to/from Python dict (JSON-like access) | pip install xmltodict |
BeautifulSoup (bs4) |
Lenient HTML/XML parser; handles malformed input | pip install beautifulsoup4 lxml |
xmlschema |
Pure-Python XSD 1.0/1.1 validator with data binding | pip install xmlschema |
(2) Integration with External Systems
- REST APIs that return XML: RSS/Atom feeds, Open Graph metadata, Salesforce SOAP API, SAP BAPI calls, HL7 FHIR XML serialization.
- OOXML (Office files): Word (
.docx), Excel (.xlsx), PowerPoint (.pptx) are ZIP archives containing XML. Python librariespython-docx,openpyxl, andpython-pptxabstract the raw XML; direct XML manipulation is used for advanced cases. - SVG: Python's
lxml,svglib, andreportlabgenerate/parse SVG (an XML vocabulary) for programmatic graphics. - Databases: PostgreSQL has an
xmldata type with XPath functions; SQL Server has native XML columns; eXist-db and BaseX are native XML databases with XQuery. - CI/CD and Build Tools: Maven (
pom.xml), Ant (build.xml), Gradle (optional XML DSL), SonarQube configuration, JUnit XML test results format. - AI/ML Pipelines: XML-annotated corpora (TEI, JATS, BioC) are parsed with lxml/ElementTree and fed into LLM preprocessing pipelines. OOXML metadata extraction with ElementTree feeds Salesforce CRM automation and RAG (Retrieval-Augmented Generation) document ingestion.
3. Install, Configure, Secure, & Cheatsheet
3.1. Install
(1) Python Environment Setup (macOS with Homebrew)
# Install Python (if not already present)
brew install python@3.12
# Create and activate a virtual environment
python3.12 -m venv .venv
source .venv/bin/activate
# Core XML libraries
pip install lxml defusedxml xmltodict xmlschema beautifulsoup4
# Verify installations
python -c "from lxml import etree; print(etree.__version__)"
python -c "import defusedxml; print(defusedxml.__version__)"
(2) Linux / Shell Setup
# Ubuntu / Debian — system-level libxml2 dependency for lxml
sudo apt-get update && sudo apt-get install -y libxml2-dev libxslt1-dev
# Install Python packages
pip install lxml defusedxml xmltodict xmlschema --break-system-packages
# Optional: xmllint CLI for validation and formatting (ships with libxml2)
sudo apt-get install -y libxml2-utils
# xmllint is also available on macOS via:
brew install libxml2
(3) Node.js / TypeScript (Secondary Stack)
# Install fast-xml-parser — the dominant XML library for Node.js
npm install fast-xml-parser
# For XSLT in Node.js (Saxon-JS)
npm install saxon-js
# TypeScript types
npm install --save-dev @types/node
(4) CLI Tools
# xmllint — validate and pretty-print XML
xmllint --format input.xml # pretty-print
xmllint --schema schema.xsd input.xml # XSD validation
xmllint --noout --valid input.xml # DTD validation (quiet)
# xsltproc — apply XSLT 1.0 transforms (ships with libxslt)
xsltproc stylesheet.xsl input.xml > output.html
# xmlstarlet — XPath queries and transforms from the shell
brew install xmlstarlet
xmlstarlet sel -t -v "//book[@category='fiction']/title" catalog.xml
3.2. Configure
(1) Namespace-Aware Parsing (ElementTree)
When parsing XML with namespaces using stdlib ElementTree, tag names include the full namespace URI:
import xml.etree.ElementTree as ET
xml_str = """
<root xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>My Document</dc:title>
</root>
"""
root = ET.fromstring(xml_str)
# Tag name includes namespace URI:
for child in root:
print(child.tag) # {http://purl.org/dc/elements/1.1/}title
# Use namespace map for cleaner XPath:
ns = {"dc": "http://purl.org/dc/elements/1.1/"}
title = root.find("dc:title", ns)
print(title.text) # My Document
(2) Namespace-Aware Parsing (lxml — Recommended for Production)
lxml provides the same namespace handling but with full XPath and the nsmap attribute:
from lxml import etree
xml_str = b"""
<root xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>My Document</dc:title>
</root>
"""
root = etree.fromstring(xml_str)
# Access namespace map on any element
print(root.nsmap) # {'dc': 'http://purl.org/dc/elements/1.1/'}
# XPath with namespace prefix
titles = root.xpath("//dc:title/text()", namespaces={"dc": "http://purl.org/dc/elements/1.1/"})
print(titles) # ['My Document']
(3) Iterparse for Large Files (Memory-Efficient)
import xml.etree.ElementTree as ET
from typing import Iterator, Generator
def stream_books(xml_path: str) -> Generator[dict, None, None]:
"""Stream book records from a large catalog XML without loading all into RAM."""
context = ET.iterparse(xml_path, events=("start", "end"))
_, root = next(context) # get root element reference
for event, elem in context:
if event == "end" and elem.tag == "book":
yield {
"id": elem.get("id"),
"title": elem.findtext("title", default=""),
"author": elem.findtext("author", default=""),
"year": elem.findtext("year", default=""),
}
elem.clear() # free memory
root.clear() # also clear root's children list
for book in stream_books("large_catalog.xml"):
print(book)
(4) XSD Validation with lxml
from lxml import etree
from pathlib import Path
def validate_xml(xml_path: str, xsd_path: str) -> bool:
"""Validate an XML document against an XSD schema. Returns True if valid."""
schema_doc = etree.parse(xsd_path)
schema = etree.XMLSchema(schema_doc)
doc = etree.parse(xml_path)
if not schema.validate(doc):
for error in schema.error_log:
print(f"Line {error.line}: {error.message}")
return False
return True
result = validate_xml("library.xml", "library.xsd")
print(f"Valid: {result}")
(5) XSLT Transform with lxml
from lxml import etree
def apply_xslt(xml_path: str, xsl_path: str, output_path: str) -> None:
"""Apply an XSLT 1.0 stylesheet to an XML document and write output."""
dom = etree.parse(xml_path)
xslt = etree.XSLT(etree.parse(xsl_path))
result = xslt(dom)
with open(output_path, "wb") as f:
f.write(bytes(result))
apply_xslt("library.xml", "html_view.xsl", "library.html")
(6) xmltodict for JSON-Like Access
import xmltodict
import json
from pathlib import Path
# Parse XML to dict
with open("library.xml", "rb") as f:
data = xmltodict.parse(f)
# Navigate like a dict/JSON
books = data["library"]["book"]
print(json.dumps(books, indent=2))
# Serialize back to XML
xml_output = xmltodict.unparse(data, pretty=True)
print(xml_output)
(7) Node.js / TypeScript — fast-xml-parser
import { XMLParser, XMLBuilder } from "fast-xml-parser";
import { readFileSync, writeFileSync } from "fs";
// Parse XML
const parser = new XMLParser({
ignoreAttributes: false, // include attributes
attributeNamePrefix: "@_", // prefix attributes with @_
parseAttributeValue: true, // convert numeric attribute values
trimValues: true,
});
const xmlContent = readFileSync("library.xml", "utf-8");
const data = parser.parse(xmlContent);
console.log(data.library.book);
// Build XML from object
const builder = new XMLBuilder({
ignoreAttributes: false,
attributeNamePrefix: "@_",
format: true,
indentBy: " ",
});
const xmlOutput = builder.build(data);
writeFileSync("output.xml", xmlOutput);
3.3. Secure
(1) Threat Model Overview
XML parsers are vulnerable to three primary classes of attacks when processing untrusted input:
- XXE (XML External Entity) Injection: A malicious DTD references an external entity (
SYSTEM "file:///etc/passwd") which the parser resolves and injects into the output, exposing sensitive files or enabling SSRF. Critical CVEs in 2024-2025 (CVE-2024-1455 in LangChain, CVE-2025-3225 in sitemap parsers, CVE-2025-30220 in GeoServer) demonstrate this remains actively exploited. - Billion Laughs (XML Bomb / XEE): Nested entity definitions that expand exponentially during parsing, consuming all available RAM and causing DoS. A 1KB payload can expand to gigabytes.
- Quadratic Blowup: Entities that expand to large strings referenced many times cause O(n²) memory growth.
(2) Python — Use defusedxml for ALL Untrusted Input
# WRONG — vulnerable to XXE and Billion Laughs
import xml.etree.ElementTree as ET
tree = ET.parse(untrusted_file) # DO NOT DO THIS
# CORRECT — defusedxml is a drop-in replacement
from defusedxml import ElementTree as ET
tree = ET.parse(untrusted_file) # Raises EntitiesForbidden if XXE/Billion Laughs detected
# Also available:
from defusedxml import minidom, sax, expatbuilder
(3) Python — Harden lxml Parser Explicitly
When lxml is required for performance or features, harden the parser explicitly:
from lxml import etree
def create_safe_parser() -> etree.XMLParser:
"""Return a hardened lxml parser safe for untrusted XML input."""
return etree.XMLParser(
resolve_entities=False, # Block entity resolution (XXE)
no_network=True, # Block network fetches from DTDs
dtd_validation=False, # Disable DTD-based validation
load_dtd=False, # Do not load external DTD files
huge_tree=False, # Prevent deeply nested DoS attacks
)
parser = create_safe_parser()
tree = etree.parse("untrusted_input.xml", parser=parser)
(4) Find Vulnerable Parsers in Your Codebase
# Scan for stdlib XML usage that should use defusedxml
grep -rn "xml.etree\|xml.dom\|xml.sax\|minidom" --include="*.py" .
# Run Bandit security linter — flags XXE-vulnerable XML usage
pip install bandit
bandit -r . -t B317,B318,B319,B320,B405,B406,B407,B408,B409,B410,B411
(5) Node.js — Harden fast-xml-parser
import { XMLParser } from "fast-xml-parser";
// Disable external entities and DTD processing
const parser = new XMLParser({
allowBooleanAttributes: false,
processEntities: false, // Do not expand entities
htmlEntities: false,
stopNodes: [],
});
(6) Network-Level Mitigations
- Run XML-processing services in network-isolated containers (no outbound HTTP/file:// from the parsing process).
- Set file size limits before parsing: reject requests with
Content-Lengthexceeding your threshold (e.g., 10MB for API endpoints). - Log and alert on
DOCTYPEdeclarations in incoming XML at the WAF or API gateway level. - Use Semgrep rules for CI/CD static analysis (
semgrep --config "p/owasp-top-ten"includes XXE rules).
(7) Security Summary Table
| Attack | Stdlib ET | defusedxml | lxml (unhardened) | lxml (hardened) |
|---|---|---|---|---|
| XXE | Vulnerable | Safe | Partially safe | Safe |
| Billion Laughs | Vulnerable | Safe | Vulnerable | Safe |
| Quadratic Blowup | Vulnerable | Safe | Vulnerable | Mitigated |
| DoS via huge tree | Vulnerable | Safe | Configurable | Safe |
3.4. Cheatsheet
(1) XML Syntax Quick Reference
<?xml version="1.0" encoding="UTF-8"?> <!-- XML declaration -->
<!-- Comment --> <!-- Comment syntax -->
<!DOCTYPE root SYSTEM "schema.dtd"> <!-- DTD reference -->
<root xmlns="http://example.com/ns" <!-- Default namespace -->
xmlns:xlink="http://www.w3.org/1999/xlink"> <!-- Prefixed namespace -->
<element attribute="value">Text content</element> <!-- Element + attribute + text -->
<self-closing-element/> <!-- Empty element -->
<element><![CDATA[<raw> & content]]></element> <!-- CDATA section -->
< > & ' " <!-- Built-in entities -->
</root>
(2) Python ElementTree Cheatsheet
import xml.etree.ElementTree as ET
# --- PARSE ---
tree = ET.parse("file.xml") # from file
root = tree.getroot()
root = ET.fromstring("<root/>") # from string
# --- NAVIGATE ---
root.tag # element tag name
root.attrib # dict of all attributes
root.attrib.get("id", "default") # single attribute (safe)
root.text # text content
root.tail # text after closing tag
list(root) # direct child elements
# --- SEARCH ---
root.find("book") # first matching child (shallow)
root.findall("book") # all matching children (shallow)
root.findall(".//book") # all in subtree
root.findall(".//book[@category='fiction']") # with attribute filter
root.findtext("title", default="") # text of first match
# --- ITERATE ---
for child in root: # iterate direct children
print(child.tag, child.attrib)
for elem in root.iter("book"): # iterate all descendants by tag
print(elem.get("id"))
# --- MODIFY ---
elem = ET.SubElement(root, "book") # add new child
elem.set("id", "003") # set attribute
elem.text = "New book title" # set text
root.remove(elem) # remove child
elem.attrib.pop("id", None) # remove attribute
# --- SERIALIZE ---
tree.write("output.xml", encoding="unicode", xml_declaration=True)
print(ET.tostring(root, encoding="unicode"))
# --- NAMESPACES ---
ns = {"dc": "http://purl.org/dc/elements/1.1/"}
title = root.find("dc:title", ns)
ET.register_namespace("dc", "http://purl.org/dc/elements/1.1/") # pretty output
(3) lxml XPath Cheatsheet
from lxml import etree
root = etree.parse("library.xml").getroot()
ns = {"dc": "http://purl.org/dc/elements/1.1/"}
root.xpath("//book") # all book elements
root.xpath("//book/@id") # list of id attributes
root.xpath("//book/title/text()") # text nodes
root.xpath("//book[@category='fiction']") # attribute predicate
root.xpath("count(//book)") # numeric result
root.xpath("//dc:title", namespaces=ns) # with namespace
root.xpath("//book[position()=1]") # first book
root.xpath("//book[last()]") # last book
root.xpath("//book[contains(title, 'Code')]") # contains() function
(4) xmllint CLI Cheatsheet
# Pretty-print
xmllint --format input.xml
# Validate against DTD declared in DOCTYPE
xmllint --valid --noout input.xml
# Validate against external XSD
xmllint --schema schema.xsd --noout input.xml
# XPath query from the command line
xmllint --xpath "//book[@category='fiction']/title/text()" library.xml
# Check well-formedness only
xmllint --noout input.xml && echo "Well-formed"
(5) xmlstarlet CLI Cheatsheet
# Select node values (XPath)
xmlstarlet sel -t -v "//book/title" -n library.xml
# Count elements
xmlstarlet sel -t -v "count(//book)" library.xml
# Edit: add attribute to all book elements
xmlstarlet ed -a "//book" -t attr -n "reviewed" -v "no" library.xml
# Format / indent
xmlstarlet fo library.xml
# Validate against XSD
xmlstarlet val -e -s schema.xsd library.xml
(6) Build XML from Scratch (Python lxml)
from lxml import etree
def build_library_xml() -> bytes:
"""Build a library XML document programmatically."""
root = etree.Element("library")
book = etree.SubElement(root, "book", id="001", category="programming")
title_el = etree.SubElement(book, "title")
title_el.text = "Clean Code"
author_el = etree.SubElement(book, "author")
author_el.text = "Robert C. Martin"
# Pretty-print with declaration
return etree.tostring(
root,
pretty_print=True,
xml_declaration=True,
encoding="UTF-8",
)
print(build_library_xml().decode("utf-8"))
(7) XML ↔ JSON Conversion (Python)
import xmltodict
import json
# XML → JSON-like dict → JSON string
with open("library.xml", "rb") as f:
data = xmltodict.parse(f, force_list={"book"}) # always a list even for one book
json_str = json.dumps(data, indent=2, ensure_ascii=False)
print(json_str)
# JSON dict → XML string
json_data = json.loads(json_str)
xml_str = xmltodict.unparse(json_data, pretty=True, indent=" ")
print(xml_str)
4. Bootcamp & Workshops
4.1. Official and Popular Training Resources
(1) W3Schools XML Tutorial
- URL: https://www.w3schools.com/xml/
- Learning Objectives: Core XML syntax, DTD, XML Schema (XSD), XPath, XSLT, XQuery, DOM, and SAX. Interactive "Try It Yourself" editor makes it beginner-friendly.
- Target Audience: Absolute beginners; reference lookups for experienced developers.
- Format: Web-based tutorials with live editor.
(2) Python Official Documentation — xml Package
- URL: https://docs.python.org/3/library/xml.html
- Learning Objectives: Complete reference for
xml.etree.ElementTree,xml.dom.minidom,xml.sax, and the XML security vulnerability overview. - Target Audience: Python developers; authoritative for API details.
- Format: Reference documentation with annotated code examples.
(3) lxml Official Tutorial
- URL: https://lxml.de/tutorial.html
- Learning Objectives: ElementTree API, lxml-specific extensions (parent traversal,
nsmap,getparent()), XPath, XSLT, XSD validation, HTML parsing, iterparse streaming. - Target Audience: Python developers who need production-grade XML processing.
- Format: Long-form narrative tutorial with code examples.
(4) Real Python — XML Parsing Roadmap
- URL: https://realpython.com/python-xml-parser/
- Learning Objectives: Comparison of all Python XML parsing strategies (ElementTree, minidom, SAX, lxml, BeautifulSoup, xmltodict), data binding, performance considerations.
- Target Audience: Intermediate Python developers choosing between parser options.
- Format: Long-form tutorial article.
(5) OWASP XXE Prevention Cheat Sheet
- URL: https://cheatsheetseries.owasp.org/cheatsheets/XML_External_Entity_Prevention_Cheat_Sheet.html
- Learning Objectives: XXE attack mechanics, per-language prevention guidance (Python, Java, .NET, PHP), testing techniques.
- Target Audience: Security engineers and any developer accepting XML from external sources.
- Format: Reference cheat sheet.
(6) DataCamp — Python XML Tutorial
- URL: https://www.datacamp.com/tutorial/python-xml-elementtree
- Learning Objectives: ElementTree for loops, XPath expressions, modifying XML, populating XML files from data, practical data-science contexts.
- Target Audience: Data scientists and analysts new to XML.
- Format: Tutorial with runnable notebook cells.
(7) PortSwigger Web Security Academy — XXE
- URL: https://portswigger.net/web-security/xxe
- Learning Objectives: Hands-on labs for exploiting and defending XXE in web applications; covers in-band, blind, and SSRF-via-XXE attack patterns.
- Target Audience: Security engineers, penetration testers, developers who accept XML input.
- Format: Free interactive labs with guided exploitation.
4.2. Troubleshooting — Rapid Root Cause Analysis
(1) ParseError: Not Well-Formed
Symptoms: xml.etree.ElementTree.ParseError: not well-formed (invalid token) or lxml.etree.XMLSyntaxError.
Root Causes and Fixes:
- Unescaped
<,>, or&in element text: replace with<,>,&. - Mismatched tags: open tag
<book>without matching</book>(or misspelling). - Multiple root elements — XML allows only one root.
- Byte-order mark (BOM) in a file declared as UTF-8 without BOM.
# Quick diagnostic: print the raw bytes around the error location
with open("broken.xml", "rb") as f:
content = f.read()
print(content[max(0, error_offset - 100):error_offset + 100])
# Use lxml for better error messages
from lxml import etree
try:
etree.parse("broken.xml")
except etree.XMLSyntaxError as e:
print(f"Line {e.lineno}, Column {e.offset}: {e.msg}")
(2) Namespace-Related find() / findall() Returns None
Symptom: root.find("title") returns None even though <title> clearly exists.
Root Cause: The element is in a namespace. ElementTree requires the Clark notation {URI}localname or a namespace map dict argument.
# WRONG — ignores namespace
title = root.find("title") # returns None if title is in a namespace
# CORRECT — use Clark notation
title = root.find("{http://example.com/ns}title")
# CORRECT — use namespace map (cleaner)
ns = {"ns": "http://example.com/ns"}
title = root.find("ns:title", ns)
(3) iterparse Memory Leak
Symptom: Memory usage grows continuously while processing a large file with iterparse.
Root Cause: elem.clear() is not being called after processing each element, so ElementTree retains references in the tree.
# WRONG — memory leak
for event, elem in ET.iterparse(xml_file, events=("end",)):
if elem.tag == "record":
process(elem)
# Missing elem.clear() — tree grows indefinitely
# CORRECT — clear after processing
for event, elem in ET.iterparse(xml_file, events=("end",)):
if elem.tag == "record":
process(elem)
elem.clear() # release element memory
(4) EntitiesForbidden Exception with defusedxml
Symptom: defusedxml.common.EntitiesForbidden raised when parsing valid-looking XML.
Root Cause: The XML contains entity declarations (either legitimate or malicious). defusedxml blocks all entity expansion by design.
Fix Options:
- If the entities are benign and the source is trusted, switch to lxml with explicit safe parser flags (see Section 3.3.3).
- If the source is untrusted, this exception is the correct behavior — do not suppress it.
- If entities need to be allowed selectively, use defusedxml.ElementTree.parse(forbid_entities=False) only for trusted, well-understood inputs.
(5) UnicodeDecodeError When Parsing
Symptom: UnicodeDecodeError: 'utf-8' codec can't decode byte when calling ET.parse() or ET.fromstring().
Root Causes:
- The file is not UTF-8 (it may be ISO-8859-1 / Latin-1 or Windows-1252) but declares encoding="UTF-8" in the XML declaration — or declares the wrong encoding.
- The XML declaration says encoding="ISO-8859-1" but you opened the file in text mode (Python's open() applies platform encoding before passing to the parser).
# CORRECT — always open XML files in binary mode for parsing
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml") # ET handles encoding via XML declaration
# If ET fails, use lxml with explicit encoding recovery
from lxml import etree
parser = etree.XMLParser(recover=True, encoding="iso-8859-1")
tree = etree.parse("data.xml", parser=parser)
(6) XSD Validation Errors That Are Hard to Read
Symptom: schema.validate(doc) returns False, but schema.error_log messages reference line numbers you cannot find or schema types you do not recognize.
Fix: Iterate the error log with full details and cross-reference the line in the document:
from lxml import etree
schema = etree.XMLSchema(etree.parse("schema.xsd"))
doc = etree.parse("invalid.xml")
if not schema.validate(doc):
for error in schema.error_log:
print(f"[{error.level_name}] Line {error.line}: {error.message}")
# Also print the offending element from the document
lines = open("invalid.xml").readlines()
print(f" Context: {lines[error.line - 1].strip()}")
(7) XSLT Transform Produces Empty Output or Wrong Result
Root Causes:
- The XPath expressions in the XSLT template match rules use namespace prefixes that are not declared in the stylesheet.
- The template match="book" does not fire because elements are in a default namespace and the XSLT does not account for it.
- The XSLT version is 2.0 or 3.0 but lxml only supports XSLT 1.0 — use Saxon-C or Saxon-JS for higher versions.
# Debugging XSLT with lxml — print the error log
from lxml import etree
transform = etree.XSLT(etree.parse("stylesheet.xsl"))
result = transform(etree.parse("input.xml"))
# Check for transform-time messages
for error in transform.error_log:
print(error.message)
# Print raw output
print(bytes(result).decode("utf-8"))
4.3. Q&A — Common Community and Forum Questions
(1) Q: When should I choose XML over JSON for a new project?
A: Choose XML when you need any of the following: mixed content (text interleaved with markup elements, like in DITA or DocBook), rich schema validation with data typing (XSD provides far more type granularity than JSON Schema), XSLT-based transformation pipelines, namespacing to combine multiple vocabularies in one document, or when integrating with existing enterprise systems (SOAP, EDI, OOXML, government standards like HL7 CDA). For straightforward REST APIs, configuration files, or JavaScript-heavy frontends, JSON is simpler and lighter. YAML is better than either for human-authored configuration files.
(2) Q: What is the difference between find() and findall() in ElementTree?
A: find() returns the first matching element or None if no match. findall() returns a (possibly empty) list of all matching elements. Both accept the same XPath subset expressions. For attribute-safe access, prefer findtext() which returns the text content of the first match (or a default string) without raising AttributeError if the element is missing:
# find() — may return None
title = root.find("book/title")
if title is not None:
print(title.text)
# findtext() — returns default if not found (safer)
title_text = root.findtext("book/title", default="Unknown")
print(title_text)
# findall() — always returns a list (may be empty)
books = root.findall(".//book")
for book in books:
print(book.get("id"))
(3) Q: How do I add an XML declaration (<?xml version="1.0" ...?>) to output?
A: Use xml_declaration=True in ElementTree's write() method. With tostring(), add it manually if needed:
import xml.etree.ElementTree as ET
root = ET.Element("library")
tree = ET.ElementTree(root)
# Write to file with XML declaration
tree.write("output.xml", encoding="UTF-8", xml_declaration=True)
# Write to string with declaration (note: encoding must be bytes-compatible)
output = ET.tostring(root, encoding="UTF-8", xml_declaration=True)
print(output) # b"<?xml version='1.0' encoding='UTF-8'?>\n<library />"
(4) Q: How do I pretty-print XML in Python?
# Option 1: ElementTree (Python 3.9+)
import xml.etree.ElementTree as ET
ET.indent(root, space=" ") # modifies tree in-place
print(ET.tostring(root, encoding="unicode"))
# Option 2: lxml (any version)
from lxml import etree
print(etree.tostring(root, pretty_print=True).decode("utf-8"))
# Option 3: xmllint from shell
# xmllint --format input.xml
# Option 4: minidom (older approach — adds extra whitespace text nodes)
import xml.dom.minidom
dom = xml.dom.minidom.parseString(ET.tostring(root))
print(dom.toprettyxml(indent=" "))
(5) Q: How do I handle encoding when serializing XML to a string vs. a file?
A: This is one of the most common sources of confusion. ET.tostring() with encoding="unicode" returns a Python str (no BOM, no XML declaration). With a byte encoding like encoding="UTF-8", it returns bytes with an XML declaration. For file writing, always use tree.write() with the encoding parameter — it handles the BOM and declaration correctly:
import xml.etree.ElementTree as ET
root = ET.Element("root")
ET.SubElement(root, "item").text = "Héllo"
# For in-memory string (no declaration)
text_str: str = ET.tostring(root, encoding="unicode")
# For bytes (with declaration)
byte_str: bytes = ET.tostring(root, encoding="UTF-8", xml_declaration=True)
# For file output (recommended)
tree = ET.ElementTree(root)
tree.write("output.xml", encoding="UTF-8", xml_declaration=True)
(6) Q: Can I use lxml and still be compatible with code written for stdlib ElementTree?
A: Yes. lxml's ElementTree API is intentionally compatible with the stdlib API. The recommended pattern is a try/except import that falls back gracefully:
try:
from lxml import etree as ET
print("Using lxml")
except ImportError:
import xml.etree.ElementTree as ET
print("Using stdlib ElementTree")
# Your code below uses ET.parse(), ET.Element(), etc. unchanged
tree = ET.parse("library.xml")
root = tree.getroot()
The only incompatibilities arise when using lxml-exclusive features: getparent(), getprevious(), getnext(), nsmap, iterancestors(), iterchildren(), full XPath, and XSLT.
(7) Q: How do I convert an XML document into a Python dataclass or Pydantic model?
A: The cleanest approach uses xmlschema or manual parsing into a dataclass:
from dataclasses import dataclass
from typing import List
import xml.etree.ElementTree as ET
@dataclass
class Book:
id: str
title: str
author: str
year: int
price: float
currency: str
def parse_books(xml_path: str) -> List[Book]:
root = ET.parse(xml_path).getroot()
books = []
for book_el in root.findall("book"):
price_el = book_el.find("price")
books.append(Book(
id=book_el.get("id", ""),
title=book_el.findtext("title", ""),
author=book_el.findtext("author", ""),
year=int(book_el.findtext("year", "0")),
price=float(price_el.text) if price_el is not None else 0.0,
currency=price_el.get("currency", "USD") if price_el is not None else "USD",
))
return books
# For XSD-based data binding, use xmlschema:
# import xmlschema
# xs = xmlschema.XMLSchema("library.xsd")
# data = xs.to_dict("library.xml") # returns a dict matching the schema structure
(8) Q: What is the difference between elem.text and elem.tail?
A: This is a common source of confusion in ElementTree's data model. text is the character data immediately inside the element's opening tag. tail is the character data that follows the element's closing tag, before the next sibling or parent's closing tag:
<parent>
Leading text ← parent.text
<child>Child text</child> ← child.text = "Child text"
Trailing text ← child.tail = "\n Trailing text\n"
</parent>
When building XML programmatically for documents with mixed content, you must set both text and tail on elements to preserve whitespace and prose flow correctly.
(9) Q: How do I merge or concatenate two XML documents?
from lxml import etree
def merge_libraries(file_a: str, file_b: str) -> bytes:
"""Merge all book elements from file_b into the library root of file_a."""
tree_a = etree.parse(file_a)
tree_b = etree.parse(file_b)
root_a = tree_a.getroot()
root_b = tree_b.getroot()
for element in root_b:
root_a.append(element) # lxml appends a deep copy by default
return etree.tostring(root_a, pretty_print=True, xml_declaration=True, encoding="UTF-8")
merged = merge_libraries("library_a.xml", "library_b.xml")
print(merged.decode("utf-8"))
(10) Q: How do I use XML in an AI agent context — specifically for Claude's structured output?
A: Anthropic's Claude model uses XML-like tags natively for structured outputs in its chain-of-thought and tool-use responses. When building an AI agent, you can instruct Claude to return structured data in XML tags and then parse them with Python's XML tools. The pattern is:
from defusedxml import ElementTree as ET
system_prompt = """
When extracting data, return it in this exact format:
<extraction>
<entity name="ENTITY_NAME" type="ENTITY_TYPE">
<value>extracted value</value>
<confidence>0.95</confidence>
</entity>
</extraction>
"""
# After receiving Claude's response text, extract the XML block
import re
def extract_xml_block(response: str, tag: str) -> str | None:
"""Extract the content of a specific XML block from a response."""
pattern = rf"<{tag}>(.*?)</{tag}>"
match = re.search(pattern, response, re.DOTALL)
return match.group(0) if match else None
xml_block = extract_xml_block(claude_response, "extraction")
if xml_block:
root = ET.fromstring(xml_block) # safe parse with defusedxml
for entity in root.findall("entity"):
print(entity.get("name"), entity.findtext("value"), entity.findtext("confidence"))
This pattern is the basis of structured output pipelines in Claude-based agents, where XML tags serve as reliable delimiters for programmatic parsing without requiring JSON mode.
End of XML Getting Started Notes — v1.0 | 2026-03-27
Appendix: JSON and YAML — Sister Formats to XML
Merged from the original
XML_JSON.md(2026-03-29) andXML.md(2026-03-27) during the 2026-05 knowledge-base reorganization.
A.1. The Three Data-Serialization Formats Compared
The "why" behind these formats is universal data exchange across disparate systems. They solve the impedance mismatch between in-memory data structures (objects, arrays) and persistent storage / network transmission.
- XML (eXtensible Markup Language): designed for document integrity and complex metadata; solves strict schema validation and hierarchical document representation. (See the rest of this file for a deep dive.)
- JSON (JavaScript Object Notation): designed for speed and ease in web environments; solves the verbosity of XML, providing a lightweight map-like structure that maps directly to programming-language primitives.
- YAML (YAML Ain't Markup Language): designed for human readability; solves the visual noise of braces and tags, making it the industry standard for configuration files (CI/CD, Kubernetes).
A.2. Quick Format Snippets
JSON example:
{
"model_config": {
"name": "gemini-3-flash",
"parameters": {
"temperature": 0.7,
"max_tokens": 2048,
"stop_sequences": ["\n", "User:"]
}
}
}
YAML — equivalent of the JSON above:
model_config:
name: "gemini-3-flash"
parameters:
temperature: 0.7
max_tokens: 2048
stop_sequences: ["\n", "User:"]
XML — equivalent of the same:
<model_config>
<name>gemini-3-flash</name>
<parameters>
<temperature>0.7</temperature>
<max_tokens>2048</max_tokens>
<stop_sequences>
<stop>\n</stop>
<stop>User:</stop>
</stop_sequences>
</parameters>
</model_config>
A.3. When to Use Which
| Concern | XML | JSON | YAML |
|---|---|---|---|
| Document fidelity | Best (mixed content, metadata) | Limited | Limited |
| Schema validation | Mature (XSD, Schematron) | JSON Schema | JSON Schema (via converters) |
| Human readability | Verbose | Decent | Best |
| Programming-language match | Awkward (DOM trees) | Native (objects/arrays) | Native (objects/arrays) |
| Parse speed | Slow | Fast | Slow |
| Common ecosystem | SOAP, HL7, Office docs, RSS | REST APIs, NoSQL, web frontends | Configs (Kubernetes, GitHub Actions, Compose) |
| Recommended for | Documents, regulated industries, legacy systems | API payloads, structured data, web apps | Configuration files, IaC, pipelines |
A.4. JSON in Python
import json
# Parse
data = json.loads('{"key": "value"}') # str -> dict
with open("data.json") as f:
data = json.load(f) # file -> dict
# Serialize
text = json.dumps(data, indent=2) # dict -> str
with open("out.json", "w") as f:
json.dump(data, f, indent=2) # dict -> file
# JSON Schema validation (via jsonschema library)
from jsonschema import validate
schema = {"type": "object", "properties": {"key": {"type": "string"}}}
validate(instance=data, schema=schema)
A.5. YAML in Python
import yaml
# Parse
data = yaml.safe_load(open("config.yaml")) # always use safe_load — never load()
# Serialize
yaml.safe_dump(data, open("out.yaml", "w"), default_flow_style=False)
A.6. Shell Tools
# JSON: jq
cat data.json | jq . # pretty-print
cat data.json | jq '.users[] | select(.age > 30) | .name' # query
# YAML: yq
yq eval '.model_config.name' config.yaml # query
yq eval -o=json config.yaml # convert YAML to JSON
# XML: see the main XML toolchain section above (xmllint, xsltproc, xmlstarlet)
A.7. Security Reminders Across All Three
- Schema-validate input from untrusted sources before processing — for JSON, use JSON Schema; for XML, use XSD; for YAML, validate after converting to a typed structure.
- Never hardcode credentials in YAML configs. Use environment variables or a secrets manager.
- Sanitize user input before interpolating it into AI prompts that may be wrapped in JSON/XML/YAML.
- YAML safety: in Python, always use
yaml.safe_load(), notyaml.load()— the latter can execute arbitrary code via!python/objecttags. - XML safety: see the XXE / Billion-Laughs sections in the main XML chapters above. Disable external entity resolution unless you specifically need it.
A.8. AI-Agent Use Cases for Each Format
| Use case | Recommended format | Why |
|---|---|---|
| LLM API request/response | JSON | Native fit; most LLM APIs (OpenAI, Anthropic) speak JSON |
| Function-calling tool schemas | JSON | OpenAI/Anthropic standard |
| Structured Output / Strict mode | JSON | Backed by JSON Schema validators |
| System prompts with sections | XML tags inside text | LLMs (especially Claude) follow XML tags well |
| Project configuration | YAML | Human-friendly; what every CI/CD and IaC tool uses |
| Complex documents (HL7, .docx) | XML | Industry-standard formats already in XML |
| Vector embeddings transport | JSON or Protobuf | JSON for readability; Protobuf for performance |
A.9. Common Q & A
- Q: Why is XML still relevant in 2026?
- A: Industry standards like HL7 (healthcare), DOCX/XLSX (Office), SOAP (older enterprise services), and certain financial messaging (FIX, FpML) rely on XML for document fidelity that JSON cannot match cleanly.
- Q: Can I use YAML for huge datasets?
- A: No. YAML is slow to parse. For large data, use JSON, Parquet, or Protobuf.
- Q: When should I use JSON Schema vs. XSD?
- A: JSON Schema for JSON; XSD for XML. They're not interchangeable. JSON Schema is simpler and JSON-native; XSD is more powerful (with imports, derived types) and XML-native.
- Q: Are there typed YAML alternatives?
- A: TOML (used by
pyproject.toml,Cargo.toml) and HCL (Terraform) are stricter and avoid YAML's whitespace-sensitivity bugs. Both are gaining traction for config files.