Skip to content

XML — Getting Started Notes

📅 Fri. 2026-03-27 🕐 Current 🤖 Claude Sonnet 4.6 👉 #XML #AI #DataFormats #Python #WebDev 📎 W3C XML 1.0 Spec 📎 Python ElementTree Docs 📎 lxml Documentation 📎 OWASP XXE Prevention Cheat Sheet 📎 defusedxml on PyPI 📎 W3C XML Standards Index


1. Overview

1.1. What Is XML and Why It Exists

XML — Extensible Markup Language — is a W3C-standardized text format for encoding structured, hierarchical data in a way that is both human-readable and machine-parseable. It became a W3C Recommendation on February 10, 1998, and its Fifth Edition (XML 1.0) remains the active normative standard today. As of early 2026, no organization has announced a formal XML 2.0 effort, though the MicroXML Community Group published a significantly simplified subset specification.

(1) Design Intent

The W3C Working Group that created XML articulated its goals explicitly in the specification. XML was designed to be:

  • Straightforwardly usable over the Internet
  • Supportive of a wide variety of applications
  • Compatible with SGML (its parent standard, ISO 8879)
  • Easy to write programs that process XML documents
  • Minimally optional in features — ideally zero optional features
  • Human-legible and reasonably clear
  • Formally and concisely specified
  • Easy to author

These goals explain the angle-bracket syntax, the strict well-formedness rules, and the separation between structure (elements/attributes) and content (text nodes). XML deliberately chose verbosity over terseness, which is why it has been widely criticized but also why it remains self-documenting and universally parseable.

(2) Pain Points XML Solves

Before XML, data interchange between heterogeneous systems required negotiating proprietary binary formats or attempting to parse non-standardized plain text. XML solved several concrete problems:

  • Structural ambiguity: Every element has explicit open/close tags and a clear parent-child hierarchy.
  • Encoding diversity: XML mandates Unicode (UTF-8 by default) and carries its encoding declaration inline, eliminating charset mismatch bugs.
  • Schema enforcement: DTD (Document Type Definition) and later XSD (XML Schema Definition) provide formal grammar-based validation, enabling automated contract checking between producer and consumer.
  • Cross-language tooling: Because XML is a text-based, standardized format, parsers exist in virtually every programming language and platform, eliminating integration friction.
  • Namespace collision: XML Namespaces (a separate W3C recommendation) allow elements from multiple vocabularies to coexist in a single document without name clashes — critical for mashup formats like SOAP envelopes containing XHTML payloads.
(3) Key Features
  • Well-formedness rules that any conformant parser enforces automatically
  • Hierarchical (tree) data model — every document has exactly one root element
  • Unicode support with explicit encoding declarations
  • Processing instructions for parser directives outside document content
  • Comments, CDATA sections, and entities for special-case content
    • CDATA sections: a block of text meant to be interpreted literally by the parser, prevents parsing errors
  • Namespace support via prefixed element/attribute names
  • Validation via DTD, XSD (XML Schema), or RELAX NG
  • XPath — a query language for addressing nodes in the tree
  • XSLT — a transformation language for producing new documents from XML
  • XQuery — a full query language for XML databases
(4) Primary Use Cases
  • Configuration files: Maven pom.xml, Spring beans, Ant build files, AndroidManifest.xml, .NET project files
  • Document markup: DITA, DocBook, OOXML (.docx, .xlsx), ODF
  • Data interchange / APIs: SOAP web services, RSS/Atom feeds, Open API specs (historically), Salesforce metadata
  • Database persistence: Native XML databases (eXist-db, BaseX), SQL Server's XML columns
  • Office automation: All modern Office files (Word, Excel, PowerPoint) are ZIP archives containing XML
  • SVG graphics: Scalable Vector Graphics is an XML vocabulary
  • AI/ML data annotation: Clinical trial datasets, legal contract markup, NLP corpora annotated in XML-based formats (TEI, JATS)
1.2. Competitors & Alternatives

XML does not exist in a vacuum. Understanding where it wins and where alternatives dominate is essential before deciding to use or process it.

(1) Market Perspective
Format Dominant Market Notes
XML Enterprise integration, government, healthcare, publishing Deep install base; SOAP, HL7, DITA, Maven
JSON Web APIs, JavaScript ecosystems, NoSQL databases Displaced XML as the default REST payload format from ~2010 onward
YAML Configuration files, CI/CD pipelines, Kubernetes manifests Human-friendlier than XML/JSON for config; anchors enable reuse
Protocol Buffers (protobuf) High-performance microservices, gRPC Binary, schema-first, Google-developed; 3–10x smaller payloads
MessagePack IoT, gaming, embedded systems Binary JSON alternative; compact and fast
Apache Avro Big data, Kafka, Hadoop ecosystems Schema-evolution-friendly; binary with schema-in-header
TOML Developer-facing config files Minimal, explicit; Rust's Cargo.toml, Python's pyproject.toml
CSV Tabular data exchange Lowest common denominator; no hierarchy, no types
Format Definition Problem Solved Dominant Market Notes
XML A tag-based markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. Need for a strictly structured, self-describing, and hierarchical way to exchange complex data across incompatible systems. Enterprise integration, government, healthcare, publishing Deep install base; SOAP, HL7, DITA, Maven
JSON A lightweight data-interchange format based on a subset of the JavaScript Programming Language syntax. XML was too verbose and computationally expensive for web browsers; JSON provides a minimal, faster alternative for web traffic. Web APIs, JavaScript ecosystems, NoSQL databases Displaced XML as the default REST payload format from ~2010 onward
YAML A human-friendly data serialization standard that uses indentation to indicate nesting and structure. JSON and XML are difficult for humans to read and write manually (e.g., for settings); YAML maximizes legibility for configuration. Configuration files, CI/CD pipelines, Kubernetes manifests Human-friendlier than XML/JSON; anchors enable reuse
Protobuf A binary, language-neutral, platform-neutral, extensible mechanism for serializing structured data. Text-based formats (JSON) are too large and slow for high-frequency internal microservice communication. High-performance microservices, gRPC Binary, schema-first, Google-developed; 3–10x smaller payloads
MessagePack An efficient binary serialization format that lets you exchange data like JSON but much faster and smaller. Standard JSON is inefficient for resource-constrained environments where bandwidth and storage are at a premium. IoT, gaming, embedded systems Binary JSON alternative; compact and fast
Apache Avro A remote procedure call and data serialization framework developed within Apache's Hadoop project. Big data pipelines require a format that supports "schema evolution," allowing data structures to change over time without breaking old data. Big data, Kafka, Hadoop ecosystems Schema-evolution-friendly; binary with schema-in-header
TOML A configuration file format that is easy to read due to obvious semantics and a focus on minimal complexity. YAML’s "significant whitespace" can be ambiguous and error-prone; TOML offers a more explicit, "flat" syntax for developer settings. Developer-facing config files Minimal, explicit; Rust's Cargo.toml, Python's pyproject.toml
CSV A simple text format where each line is a data record and each record consists of one or more fields, separated by commas. The need for a universal, "lowest common denominator" format for importing and exporting flat, tabular datasets. Tabular data exchange Lowest common denominator; no hierarchy, no types
###### (2) User Segment Perspective
  • Enterprise architects designing system integrations still frequently encounter SOAP/WSDL, WS-Security, EDI over XML, and XML-based regulatory filings. XML is unavoidable here.
  • Web developers working on REST APIs default to JSON for new development but must parse XML when integrating with legacy services, RSS feeds, sitemaps, or SVG.
  • Data engineers in healthcare, pharma, and legal work with XML-annotated corpora (HL7 FHIR R4 still offers XML serialization alongside JSON; CDA documents are pure XML).
  • DevOps engineers encounter XML in Maven/Gradle, SonarQube configurations, and legacy CI systems, but have mostly migrated to YAML for new config.
  • AI/ML engineers increasingly use XML annotation schemas for LLM fine-tuning datasets (TEI-XML for historical corpora, BioC for biomedical NLP) and must parse OOXML files to extract training data.
(3) Technical Domain Perspective
Dimension XML JSON YAML Protobuf
Schema/Validation XSD, DTD, RELAX NG (rich) JSON Schema (limited) None standard .proto file
Namespaces Native None None Package namespacing
Mixed content (text + child elements) Native Awkward Awkward Not applicable
Transformation XSLT (powerful) jq (limited) None standard Code generation
Query XPath, XQuery JSONPath (not standardized until 2024) None None
Human-readability Medium High Very high Low (binary)
Parse performance Slower than binary Faster than XML Slower than JSON Fastest
Streaming large files SAX/iterparse YAJL, streaming JSON None standard Streaming protobuf

Verdict for AI agents: You will encounter XML most often when integrating with enterprise APIs, parsing OOXML/ODF files, consuming RSS/Atom feeds, processing SVG, or working with annotated NLP corpora. JSON is your default for new system design; XML is your "must-know" for legacy and document-centric workloads.


2. Concept, Component, & Architecture

2.1. Key Concepts

The following concepts are introduced from simplest to most complex, respecting their logical dependencies.

(1) Well-Formed Document

An XML document is well-formed if it follows all syntactic rules required by the XML specification. These rules are:

  • It has exactly one root element that contains all other elements.
  • All elements are properly nested (no overlapping).
  • Every open tag has a matching close tag (or is a self-closing empty element tag <br/>).
  • Attribute values are always quoted (single or double quotes).
  • The special characters <, >, and & are always escaped as &lt;, &gt;, and &amp; when they appear in content.
  • Element and attribute names are case-sensitive.

A well-formed document can be parsed by any conformant parser. Well-formedness is the minimum bar — it does not guarantee the document structure is correct for a given application.

(2) Valid Document

A valid XML document is well-formed AND conforms to a declared schema (DTD, XSD, or RELAX NG). Validity checking is optional and requires a validating parser. An XML document can be well-formed but invalid (structure violations against the schema) or valid but semantically incorrect (correct structure, wrong data values).

(3) Elements, Attributes, and Text Nodes

The three fundamental information-carrying constructs in XML:

  • Elements are the primary containers: <book id="001"><title>Clean Code</title></book>. They form the tree hierarchy.
    • child element: element nested with another element (the parent)
  • Attributes are key-value metadata on elements: id="001" in the example above. They cannot be repeated on the same element, have no ordering guarantee across parsers, and cannot contain child content.
  • Text nodes are the character data within element tags. In the example, Clean Code is the text node content of <title>.

The design choice between encoding data as an attribute or a child element is a common source of debate. The general guidance is: use attributes for metadata that describes the element itself (like an identifier), and use child elements for data that is part of the document content.

(4) Character Data (CDATA) Sections

A CDATA section wraps literal text that should not be parsed for markup. It is the mechanism for embedding content that contains < or & without requiring entity escaping:

<script>
  <![CDATA[
    if (x < 10 && y > 5) { doSomething(); }
  ]]>
</script>

The parser passes the raw content inside <![CDATA[ ... ]]> directly to the application as character data without interpretation.

(5) Entities

XML defines five built-in character entities for escaping special characters:

Entity Character
&lt; <
&gt; >
&amp; &
&apos; '
&quot; "

Additionally, documents can define custom entities in a DTD (internal entity declarations) or reference external entity files (external entity declarations). External entity processing is the root cause of XXE (XML External Entity) vulnerabilities — see Section 3.3 on security.

(6) Processing Instructions

Processing instructions (PIs) carry application-specific directives to the parser or post-processing application. They are not considered document content:

<?xml-stylesheet type="text/xsl" href="transform.xsl"?>
<?target instruction-data?>

The most important PI is the XML declaration at the very top of a document:

<?xml version="1.0" encoding="UTF-8"?>
(7) Comments

XML comments have the same syntax as HTML comments and are not part of the document's information content:

<!-- This is a comment. It cannot contain double-dashes: -- -->
(8) Namespaces

XML Namespaces allow element and attribute names from different vocabularies to coexist in one document without collision. A namespace is identified by a URI (which does not need to resolve to anything real — it is just a unique identifier):

<root xmlns:xhtml="http://www.w3.org/1999/xhtml"
      xmlns:svg="http://www.w3.org/2000/svg">
  <xhtml:p>Paragraph</xhtml:p>
  <svg:circle cx="50" cy="50" r="25"/>
</root>

The default namespace (no prefix) is declared with xmlns="URI" and applies to all unprefixed elements in scope. Understanding namespaces is critical when using ElementTree in Python because namespace URIs are baked into tag names: {http://www.w3.org/1999/xhtml}p.

(9) XPath

XPath is a query language for selecting nodes from an XML document. It models the document as a tree of nodes and provides path expressions analogous to filesystem paths:

  • //book — all <book> elements anywhere in the document
  • /library/book[@category='fiction']<book> elements in <library> with a category attribute equal to fiction
  • //book/title/text() — text content of all <title> elements that are children of <book> elements
  • count(//book) — count of all book elements

XPath 1.0 is the version supported by ElementTree. lxml supports XPath 1.0 fully; true XPath 2.0/3.0 support requires Saxon or similar.

(10) XSD (XML Schema Definition)

XSD is a W3C standard for formally describing the structure and data types permissible in an XML document. An XSD schema is itself an XML document. It provides:

  • Element and attribute declarations with data types (xs:string, xs:integer, xs:date, etc.)
  • Complex type definitions with ordering constraints (xs:sequence, xs:choice, xs:all)
  • Cardinality constraints (minOccurs, maxOccurs)
  • Inheritance and extension mechanisms

XSD validation in Python is done via lxml:

from lxml import etree

with open("schema.xsd") as f:
  schema_doc = etree.parse(f)

schema = etree.XMLSchema(schema_doc)
doc = etree.parse("document.xml")
is_valid = schema.validate(doc)
print(schema.error_log)  # details on failures
(11) XSLT

XSLT (Extensible Stylesheet Language Transformations) is a functional, template-matching language for transforming one XML document into another document (XML, HTML, or plain text). It is declared as an XML document itself and is applied by an XSLT processor. lxml ships with an XSLT 1.0 processor. XSLT 2.0/3.0 requires Saxon or Xalan:

from lxml import etree

transform = etree.XSLT(etree.parse("stylesheet.xsl"))
result = transform(etree.parse("input.xml"))
print(str(result))
2.2. Core Components
(1) The XML Document Itself

An XML document is a text file with a tree structure. Every well-formed document has exactly one root element. The typical anatomy is:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Optional: DOCTYPE declaration linking to DTD -->
<!DOCTYPE library SYSTEM "library.dtd">
<!-- Root element -->
<library xmlns:dc="http://purl.org/dc/elements/1.1/">
  <!-- Child elements -->
  <book id="001" category="programming">
    <dc:title>Clean Code</dc:title>
    <author>Robert C. Martin</author>
    <year>2008</year>
    <price currency="USD">35.99</price>
  </book>
  <book id="002" category="fiction">
    <dc:title>1984</dc:title>
    <author>George Orwell</author>
    <year>1949</year>
    <price currency="USD">8.99</price>
  </book>
</library>
(2) DOM Parser (Document Object Model)

A DOM parser reads the entire XML document into memory and builds a complete tree of node objects. The application can then navigate, query, and modify any part of the tree. DOM is the oldest and most feature-rich parsing model.

  • Pros: Random access to any node; supports modification; well-suited for small-to-medium documents.
  • Cons: Entire document loaded into RAM — prohibitive for large files (>100MB).
  • Python: xml.dom.minidom (stdlib, verbose API); lxml.etree (DOM-like, much faster).
(3) SAX Parser (Simple API for XML)

A SAX parser streams through the document and fires events (startElement, endElement, characters) without building a tree in memory. The application registers handler callbacks and processes data as it arrives.

  • Pros: Constant memory footprint regardless of document size; fastest for extraction of small amounts of data from large files.
  • Cons: Read-only; stateless — managing context between events requires application-level state tracking; harder to code.
  • Python: xml.sax (stdlib).
(4) ElementTree API (Python-specific)

ElementTree is a middle ground between DOM and SAX. It builds a tree in memory (like DOM) but uses a simpler, Pythonic API. It is the recommended starting point for most Python XML work:

import xml.etree.ElementTree as ET

tree = ET.parse("library.xml")
root = tree.getroot()

for book in root.findall("book"):
  title = book.find("title").text
  book_id = book.get("id")
  print(f"{book_id}: {title}")

The standard library's ElementTree (xml.etree.ElementTree) is written in C (since Python 3.3) and is fast for typical use cases. It handles 90% of real-world XML tasks. Key limitation: does not support XSD validation, XSLT, full XPath 2.0, or parent-node traversal.

(5) lxml

lxml is a Python binding to the C libraries libxml2 (XML/HTML parsing) and libxslt (XSLT transformation). It is the recommended library for production XML work requiring performance, full XPath, XSLT, or schema validation:

from lxml import etree

tree = etree.parse("library.xml")
root = tree.getroot()

# Full XPath support
books = root.xpath("//book[@category='programming']/title/text()")
print(books)  # ['Clean Code']

# XSD validation
schema = etree.XMLSchema(etree.parse("library.xsd"))
print(schema.validate(tree))

lxml is API-compatible with ElementTree for most operations, so migrating from stdlib to lxml requires only changing the import:

try:
  from lxml import etree
except ImportError:
  import xml.etree.ElementTree as etree
(6) iterparse for Streaming

When documents are too large to fit in memory, iterparse allows incremental, event-driven processing without the complexity of SAX callback registration. It is the recommended approach for large-file scenarios:

import xml.etree.ElementTree as ET

def process_large_file(xml_file: str) -> None:
  for event, elem in ET.iterparse(xml_file, events=("end",)):
    if elem.tag == "book":
      title = elem.findtext("title", default="")
      print(title)
      elem.clear()  # CRITICAL: release memory after processing

process_large_file("large_catalog.xml")

The elem.clear() call is mandatory — without it, ElementTree retains references to processed elements and the memory advantage of iterparse is lost entirely.

(7) defusedxml

defusedxml is a security-hardened drop-in replacement for Python's stdlib XML parsers. It disables all entity expansion, external entity resolution, and other attack vectors by default. It is the recommended parser for any XML received from untrusted sources:

# Replace: import xml.etree.ElementTree as ET
# With:
from defusedxml import ElementTree as ET

tree = ET.parse("untrusted_input.xml")  # Raises if XXE or Billion Laughs detected
2.3. Architecture & Design
(1) XML Information Set (Infoset)

The XML Information Set specification (W3C Recommendation) defines an abstract data model for the information in a well-formed XML document. It separates what an XML document means (the information) from how it is serialized (the bytes on disk). All XML-related standards (XPath, XQuery, XSLT, XSD) operate on the Infoset, not the raw text, which is why they are interoperable across different parser implementations.

(2) Document Tree Model

The fundamental architectural pattern of XML is the ordered, labeled tree. This model has been extremely influential and underpins HTML's DOM, JSON's object tree, and YAML's document model. The tree has these node types:

Document
├── ProcessingInstruction (<?xml-stylesheet ...?>)
├── Comment (<!-- ... -->)
└── Element (root)
    ├── Attribute (on the element itself)
    ├── Text node
    ├── CDATA section
    └── Element (child)
        ├── Attribute
        └── Text node
(3) Architecture Diagram — XML Processing Pipeline
flowchart LR
  A[XML Source\n.xml file / HTTP / stream] --> B[Parser\nElementTree / lxml / SAX]
  B --> C{Validate?}
  C -->|Yes| D[XSD / DTD\nValidation]
  C -->|No| E[In-memory Tree\nor Events]
  D --> E
  E --> F{Transform?}
  F -->|XSLT| G[XSLT Processor\nlxml.etree.XSLT]
  F -->|XPath| H[XPath Query\nroot.xpath]
  F -->|Python| I[Custom Logic\nElementTree API]
  G --> J[Output\nXML / HTML / Text]
  H --> J
  I --> J
(4) SAX vs DOM vs ElementTree Design Trade-offs

The three parsing paradigms reflect different trade-offs along the axes of memory, speed, and API convenience:

  • SAX is designed for forward-only, read-only streaming. It is the lowest-level API and mirrors the push-parser model — the parser drives execution by firing events.
  • DOM is designed for random-access modification. The entire document graph lives in memory. It mirrors the object-graph model — the application drives execution by navigating the tree.
  • ElementTree is designed for the common case — parse once, query a subset, serialize or discard. It uses pull-parsing internally and exposes a simplified tree-like API.

lxml unifies all three: it uses libxml2's streaming parser internally, builds a tree compatible with ElementTree, and exposes SAX-like event APIs through lxml.etree.iterparse and lxml.etree.XMLPullParser.

(5) Evolution of XML Standards

The XML standards family has grown iteratively over time:

Year Standard
1998 XML 1.0 becomes W3C Recommendation
1998 XML Namespaces
1999 XPath 1.0, XSLT 1.0
2001 XSD 1.0 (XML Schema)
2003 XPath 2.0 (draft), XML 1.1
2004 RELAX NG
2007 XPath 2.0 / XSLT 2.0 / XQuery 1.0 (Recommendation)
2008 XML 1.0 Fifth Edition (current normative version)
2011 Efficient XML Interchange (EXI) — W3C Recommendation (binary XML)
2017 XPath 3.1, XSLT 3.0, XQuery 3.1
2012 MicroXML Community Group formed
2025-2026 XML standards maintained by W3C Internationalization Working Group; no XML 2.0 announced
2.4. Ecosystem
(1) Python XML Ecosystem
Library Role Install
xml.etree.ElementTree Stdlib parser/builder; C-accelerated since Python 3.3 Built-in
xml.dom.minidom Stdlib DOM parser (verbose API) Built-in
xml.sax Stdlib SAX event parser Built-in
lxml Full-featured: libxml2/libxslt binding; XPath, XSLT, XSD pip install lxml
defusedxml Security-safe drop-in replacement for stdlib parsers pip install defusedxml
xmltodict Convert XML to/from Python dict (JSON-like access) pip install xmltodict
BeautifulSoup (bs4) Lenient HTML/XML parser; handles malformed input pip install beautifulsoup4 lxml
xmlschema Pure-Python XSD 1.0/1.1 validator with data binding pip install xmlschema
(2) Integration with External Systems
  • REST APIs that return XML: RSS/Atom feeds, Open Graph metadata, Salesforce SOAP API, SAP BAPI calls, HL7 FHIR XML serialization.
  • OOXML (Office files): Word (.docx), Excel (.xlsx), PowerPoint (.pptx) are ZIP archives containing XML. Python libraries python-docx, openpyxl, and python-pptx abstract the raw XML; direct XML manipulation is used for advanced cases.
  • SVG: Python's lxml, svglib, and reportlab generate/parse SVG (an XML vocabulary) for programmatic graphics.
  • Databases: PostgreSQL has an xml data type with XPath functions; SQL Server has native XML columns; eXist-db and BaseX are native XML databases with XQuery.
  • CI/CD and Build Tools: Maven (pom.xml), Ant (build.xml), Gradle (optional XML DSL), SonarQube configuration, JUnit XML test results format.
  • AI/ML Pipelines: XML-annotated corpora (TEI, JATS, BioC) are parsed with lxml/ElementTree and fed into LLM preprocessing pipelines. OOXML metadata extraction with ElementTree feeds Salesforce CRM automation and RAG (Retrieval-Augmented Generation) document ingestion.

3. Install, Configure, Secure, & Cheatsheet

3.1. Install
(1) Python Environment Setup (macOS with Homebrew)
# Install Python (if not already present)
brew install python@3.12

# Create and activate a virtual environment
python3.12 -m venv .venv
source .venv/bin/activate

# Core XML libraries
pip install lxml defusedxml xmltodict xmlschema beautifulsoup4

# Verify installations
python -c "from lxml import etree; print(etree.__version__)"
python -c "import defusedxml; print(defusedxml.__version__)"
(2) Linux / Shell Setup
# Ubuntu / Debian — system-level libxml2 dependency for lxml
sudo apt-get update && sudo apt-get install -y libxml2-dev libxslt1-dev

# Install Python packages
pip install lxml defusedxml xmltodict xmlschema --break-system-packages

# Optional: xmllint CLI for validation and formatting (ships with libxml2)
sudo apt-get install -y libxml2-utils
# xmllint is also available on macOS via:
brew install libxml2
(3) Node.js / TypeScript (Secondary Stack)
# Install fast-xml-parser — the dominant XML library for Node.js
npm install fast-xml-parser

# For XSLT in Node.js (Saxon-JS)
npm install saxon-js

# TypeScript types
npm install --save-dev @types/node
(4) CLI Tools
# xmllint — validate and pretty-print XML
xmllint --format input.xml               # pretty-print
xmllint --schema schema.xsd input.xml    # XSD validation
xmllint --noout --valid input.xml        # DTD validation (quiet)

# xsltproc — apply XSLT 1.0 transforms (ships with libxslt)
xsltproc stylesheet.xsl input.xml > output.html

# xmlstarlet — XPath queries and transforms from the shell
brew install xmlstarlet
xmlstarlet sel -t -v "//book[@category='fiction']/title" catalog.xml
3.2. Configure
(1) Namespace-Aware Parsing (ElementTree)

When parsing XML with namespaces using stdlib ElementTree, tag names include the full namespace URI:

import xml.etree.ElementTree as ET

xml_str = """
<root xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:title>My Document</dc:title>
</root>
"""

root = ET.fromstring(xml_str)

# Tag name includes namespace URI:
for child in root:
  print(child.tag)  # {http://purl.org/dc/elements/1.1/}title

# Use namespace map for cleaner XPath:
ns = {"dc": "http://purl.org/dc/elements/1.1/"}
title = root.find("dc:title", ns)
print(title.text)  # My Document

lxml provides the same namespace handling but with full XPath and the nsmap attribute:

from lxml import etree

xml_str = b"""
<root xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:title>My Document</dc:title>
</root>
"""

root = etree.fromstring(xml_str)

# Access namespace map on any element
print(root.nsmap)  # {'dc': 'http://purl.org/dc/elements/1.1/'}

# XPath with namespace prefix
titles = root.xpath("//dc:title/text()", namespaces={"dc": "http://purl.org/dc/elements/1.1/"})
print(titles)  # ['My Document']
(3) Iterparse for Large Files (Memory-Efficient)
import xml.etree.ElementTree as ET
from typing import Iterator, Generator

def stream_books(xml_path: str) -> Generator[dict, None, None]:
  """Stream book records from a large catalog XML without loading all into RAM."""
  context = ET.iterparse(xml_path, events=("start", "end"))
  _, root = next(context)  # get root element reference

  for event, elem in context:
    if event == "end" and elem.tag == "book":
      yield {
        "id": elem.get("id"),
        "title": elem.findtext("title", default=""),
        "author": elem.findtext("author", default=""),
        "year": elem.findtext("year", default=""),
      }
      elem.clear()         # free memory
      root.clear()         # also clear root's children list

for book in stream_books("large_catalog.xml"):
  print(book)
(4) XSD Validation with lxml
from lxml import etree
from pathlib import Path

def validate_xml(xml_path: str, xsd_path: str) -> bool:
  """Validate an XML document against an XSD schema. Returns True if valid."""
  schema_doc = etree.parse(xsd_path)
  schema = etree.XMLSchema(schema_doc)
  doc = etree.parse(xml_path)

  if not schema.validate(doc):
    for error in schema.error_log:
      print(f"Line {error.line}: {error.message}")
    return False
  return True

result = validate_xml("library.xml", "library.xsd")
print(f"Valid: {result}")
(5) XSLT Transform with lxml
from lxml import etree

def apply_xslt(xml_path: str, xsl_path: str, output_path: str) -> None:
  """Apply an XSLT 1.0 stylesheet to an XML document and write output."""
  dom = etree.parse(xml_path)
  xslt = etree.XSLT(etree.parse(xsl_path))
  result = xslt(dom)

  with open(output_path, "wb") as f:
    f.write(bytes(result))

apply_xslt("library.xml", "html_view.xsl", "library.html")
(6) xmltodict for JSON-Like Access
import xmltodict
import json
from pathlib import Path

# Parse XML to dict
with open("library.xml", "rb") as f:
  data = xmltodict.parse(f)

# Navigate like a dict/JSON
books = data["library"]["book"]
print(json.dumps(books, indent=2))

# Serialize back to XML
xml_output = xmltodict.unparse(data, pretty=True)
print(xml_output)
(7) Node.js / TypeScript — fast-xml-parser
import { XMLParser, XMLBuilder } from "fast-xml-parser";
import { readFileSync, writeFileSync } from "fs";

// Parse XML
const parser = new XMLParser({
  ignoreAttributes: false,        // include attributes
  attributeNamePrefix: "@_",      // prefix attributes with @_
  parseAttributeValue: true,      // convert numeric attribute values
  trimValues: true,
});

const xmlContent = readFileSync("library.xml", "utf-8");
const data = parser.parse(xmlContent);
console.log(data.library.book);

// Build XML from object
const builder = new XMLBuilder({
  ignoreAttributes: false,
  attributeNamePrefix: "@_",
  format: true,
  indentBy: "  ",
});

const xmlOutput = builder.build(data);
writeFileSync("output.xml", xmlOutput);
3.3. Secure
(1) Threat Model Overview

XML parsers are vulnerable to three primary classes of attacks when processing untrusted input:

  • XXE (XML External Entity) Injection: A malicious DTD references an external entity (SYSTEM "file:///etc/passwd") which the parser resolves and injects into the output, exposing sensitive files or enabling SSRF. Critical CVEs in 2024-2025 (CVE-2024-1455 in LangChain, CVE-2025-3225 in sitemap parsers, CVE-2025-30220 in GeoServer) demonstrate this remains actively exploited.
  • Billion Laughs (XML Bomb / XEE): Nested entity definitions that expand exponentially during parsing, consuming all available RAM and causing DoS. A 1KB payload can expand to gigabytes.
  • Quadratic Blowup: Entities that expand to large strings referenced many times cause O(n²) memory growth.
(2) Python — Use defusedxml for ALL Untrusted Input
# WRONG — vulnerable to XXE and Billion Laughs
import xml.etree.ElementTree as ET
tree = ET.parse(untrusted_file)   # DO NOT DO THIS

# CORRECT — defusedxml is a drop-in replacement
from defusedxml import ElementTree as ET
tree = ET.parse(untrusted_file)  # Raises EntitiesForbidden if XXE/Billion Laughs detected

# Also available:
from defusedxml import minidom, sax, expatbuilder
(3) Python — Harden lxml Parser Explicitly

When lxml is required for performance or features, harden the parser explicitly:

from lxml import etree

def create_safe_parser() -> etree.XMLParser:
  """Return a hardened lxml parser safe for untrusted XML input."""
  return etree.XMLParser(
    resolve_entities=False,   # Block entity resolution (XXE)
    no_network=True,          # Block network fetches from DTDs
    dtd_validation=False,     # Disable DTD-based validation
    load_dtd=False,           # Do not load external DTD files
    huge_tree=False,          # Prevent deeply nested DoS attacks
  )

parser = create_safe_parser()
tree = etree.parse("untrusted_input.xml", parser=parser)
(4) Find Vulnerable Parsers in Your Codebase
# Scan for stdlib XML usage that should use defusedxml
grep -rn "xml.etree\|xml.dom\|xml.sax\|minidom" --include="*.py" .

# Run Bandit security linter — flags XXE-vulnerable XML usage
pip install bandit
bandit -r . -t B317,B318,B319,B320,B405,B406,B407,B408,B409,B410,B411
(5) Node.js — Harden fast-xml-parser
import { XMLParser } from "fast-xml-parser";

// Disable external entities and DTD processing
const parser = new XMLParser({
  allowBooleanAttributes: false,
  processEntities: false,       // Do not expand entities
  htmlEntities: false,
  stopNodes: [],
});
(6) Network-Level Mitigations
  • Run XML-processing services in network-isolated containers (no outbound HTTP/file:// from the parsing process).
  • Set file size limits before parsing: reject requests with Content-Length exceeding your threshold (e.g., 10MB for API endpoints).
  • Log and alert on DOCTYPE declarations in incoming XML at the WAF or API gateway level.
  • Use Semgrep rules for CI/CD static analysis (semgrep --config "p/owasp-top-ten" includes XXE rules).
(7) Security Summary Table
Attack Stdlib ET defusedxml lxml (unhardened) lxml (hardened)
XXE Vulnerable Safe Partially safe Safe
Billion Laughs Vulnerable Safe Vulnerable Safe
Quadratic Blowup Vulnerable Safe Vulnerable Mitigated
DoS via huge tree Vulnerable Safe Configurable Safe
3.4. Cheatsheet
(1) XML Syntax Quick Reference
<?xml version="1.0" encoding="UTF-8"?>           <!-- XML declaration -->
<!-- Comment -->                                   <!-- Comment syntax -->
<!DOCTYPE root SYSTEM "schema.dtd">               <!-- DTD reference -->
<root xmlns="http://example.com/ns"               <!-- Default namespace -->
      xmlns:xlink="http://www.w3.org/1999/xlink"> <!-- Prefixed namespace -->
  <element attribute="value">Text content</element>  <!-- Element + attribute + text -->
  <self-closing-element/>                            <!-- Empty element -->
  <element><![CDATA[<raw> & content]]></element>     <!-- CDATA section -->
  &lt; &gt; &amp; &apos; &quot;                      <!-- Built-in entities -->
</root>
(2) Python ElementTree Cheatsheet
import xml.etree.ElementTree as ET

# --- PARSE ---
tree = ET.parse("file.xml")          # from file
root = tree.getroot()
root = ET.fromstring("<root/>")      # from string

# --- NAVIGATE ---
root.tag                             # element tag name
root.attrib                          # dict of all attributes
root.attrib.get("id", "default")     # single attribute (safe)
root.text                            # text content
root.tail                            # text after closing tag
list(root)                           # direct child elements

# --- SEARCH ---
root.find("book")                    # first matching child (shallow)
root.findall("book")                 # all matching children (shallow)
root.findall(".//book")              # all in subtree
root.findall(".//book[@category='fiction']")   # with attribute filter
root.findtext("title", default="")   # text of first match

# --- ITERATE ---
for child in root:                   # iterate direct children
  print(child.tag, child.attrib)

for elem in root.iter("book"):       # iterate all descendants by tag
  print(elem.get("id"))

# --- MODIFY ---
elem = ET.SubElement(root, "book")   # add new child
elem.set("id", "003")                # set attribute
elem.text = "New book title"         # set text
root.remove(elem)                    # remove child
elem.attrib.pop("id", None)          # remove attribute

# --- SERIALIZE ---
tree.write("output.xml", encoding="unicode", xml_declaration=True)
print(ET.tostring(root, encoding="unicode"))

# --- NAMESPACES ---
ns = {"dc": "http://purl.org/dc/elements/1.1/"}
title = root.find("dc:title", ns)
ET.register_namespace("dc", "http://purl.org/dc/elements/1.1/")  # pretty output
(3) lxml XPath Cheatsheet
from lxml import etree

root = etree.parse("library.xml").getroot()
ns = {"dc": "http://purl.org/dc/elements/1.1/"}

root.xpath("//book")                                   # all book elements
root.xpath("//book/@id")                               # list of id attributes
root.xpath("//book/title/text()")                      # text nodes
root.xpath("//book[@category='fiction']")              # attribute predicate
root.xpath("count(//book)")                            # numeric result
root.xpath("//dc:title", namespaces=ns)                # with namespace
root.xpath("//book[position()=1]")                     # first book
root.xpath("//book[last()]")                           # last book
root.xpath("//book[contains(title, 'Code')]")          # contains() function
(4) xmllint CLI Cheatsheet
# Pretty-print
xmllint --format input.xml

# Validate against DTD declared in DOCTYPE
xmllint --valid --noout input.xml

# Validate against external XSD
xmllint --schema schema.xsd --noout input.xml

# XPath query from the command line
xmllint --xpath "//book[@category='fiction']/title/text()" library.xml

# Check well-formedness only
xmllint --noout input.xml && echo "Well-formed"
(5) xmlstarlet CLI Cheatsheet
# Select node values (XPath)
xmlstarlet sel -t -v "//book/title" -n library.xml

# Count elements
xmlstarlet sel -t -v "count(//book)" library.xml

# Edit: add attribute to all book elements
xmlstarlet ed -a "//book" -t attr -n "reviewed" -v "no" library.xml

# Format / indent
xmlstarlet fo library.xml

# Validate against XSD
xmlstarlet val -e -s schema.xsd library.xml
(6) Build XML from Scratch (Python lxml)
from lxml import etree

def build_library_xml() -> bytes:
  """Build a library XML document programmatically."""
  root = etree.Element("library")

  book = etree.SubElement(root, "book", id="001", category="programming")
  title_el = etree.SubElement(book, "title")
  title_el.text = "Clean Code"
  author_el = etree.SubElement(book, "author")
  author_el.text = "Robert C. Martin"

  # Pretty-print with declaration
  return etree.tostring(
    root,
    pretty_print=True,
    xml_declaration=True,
    encoding="UTF-8",
  )

print(build_library_xml().decode("utf-8"))
(7) XML ↔ JSON Conversion (Python)
import xmltodict
import json

# XML → JSON-like dict → JSON string
with open("library.xml", "rb") as f:
  data = xmltodict.parse(f, force_list={"book"})  # always a list even for one book

json_str = json.dumps(data, indent=2, ensure_ascii=False)
print(json_str)

# JSON dict → XML string
json_data = json.loads(json_str)
xml_str = xmltodict.unparse(json_data, pretty=True, indent="  ")
print(xml_str)

4. Bootcamp & Workshops

(1) W3Schools XML Tutorial
  • URL: https://www.w3schools.com/xml/
  • Learning Objectives: Core XML syntax, DTD, XML Schema (XSD), XPath, XSLT, XQuery, DOM, and SAX. Interactive "Try It Yourself" editor makes it beginner-friendly.
  • Target Audience: Absolute beginners; reference lookups for experienced developers.
  • Format: Web-based tutorials with live editor.
(2) Python Official Documentation — xml Package
  • URL: https://docs.python.org/3/library/xml.html
  • Learning Objectives: Complete reference for xml.etree.ElementTree, xml.dom.minidom, xml.sax, and the XML security vulnerability overview.
  • Target Audience: Python developers; authoritative for API details.
  • Format: Reference documentation with annotated code examples.
(3) lxml Official Tutorial
  • URL: https://lxml.de/tutorial.html
  • Learning Objectives: ElementTree API, lxml-specific extensions (parent traversal, nsmap, getparent()), XPath, XSLT, XSD validation, HTML parsing, iterparse streaming.
  • Target Audience: Python developers who need production-grade XML processing.
  • Format: Long-form narrative tutorial with code examples.
(4) Real Python — XML Parsing Roadmap
  • URL: https://realpython.com/python-xml-parser/
  • Learning Objectives: Comparison of all Python XML parsing strategies (ElementTree, minidom, SAX, lxml, BeautifulSoup, xmltodict), data binding, performance considerations.
  • Target Audience: Intermediate Python developers choosing between parser options.
  • Format: Long-form tutorial article.
(5) OWASP XXE Prevention Cheat Sheet
(6) DataCamp — Python XML Tutorial
  • URL: https://www.datacamp.com/tutorial/python-xml-elementtree
  • Learning Objectives: ElementTree for loops, XPath expressions, modifying XML, populating XML files from data, practical data-science contexts.
  • Target Audience: Data scientists and analysts new to XML.
  • Format: Tutorial with runnable notebook cells.
(7) PortSwigger Web Security Academy — XXE
  • URL: https://portswigger.net/web-security/xxe
  • Learning Objectives: Hands-on labs for exploiting and defending XXE in web applications; covers in-band, blind, and SSRF-via-XXE attack patterns.
  • Target Audience: Security engineers, penetration testers, developers who accept XML input.
  • Format: Free interactive labs with guided exploitation.
4.2. Troubleshooting — Rapid Root Cause Analysis
(1) ParseError: Not Well-Formed

Symptoms: xml.etree.ElementTree.ParseError: not well-formed (invalid token) or lxml.etree.XMLSyntaxError.

Root Causes and Fixes:

  • Unescaped <, >, or & in element text: replace with &lt;, &gt;, &amp;.
  • Mismatched tags: open tag <book> without matching </book> (or misspelling).
  • Multiple root elements — XML allows only one root.
  • Byte-order mark (BOM) in a file declared as UTF-8 without BOM.
# Quick diagnostic: print the raw bytes around the error location
with open("broken.xml", "rb") as f:
  content = f.read()
  print(content[max(0, error_offset - 100):error_offset + 100])

# Use lxml for better error messages
from lxml import etree
try:
  etree.parse("broken.xml")
except etree.XMLSyntaxError as e:
  print(f"Line {e.lineno}, Column {e.offset}: {e.msg}")

Symptom: root.find("title") returns None even though <title> clearly exists.

Root Cause: The element is in a namespace. ElementTree requires the Clark notation {URI}localname or a namespace map dict argument.

# WRONG — ignores namespace
title = root.find("title")        # returns None if title is in a namespace

# CORRECT — use Clark notation
title = root.find("{http://example.com/ns}title")

# CORRECT — use namespace map (cleaner)
ns = {"ns": "http://example.com/ns"}
title = root.find("ns:title", ns)
(3) iterparse Memory Leak

Symptom: Memory usage grows continuously while processing a large file with iterparse.

Root Cause: elem.clear() is not being called after processing each element, so ElementTree retains references in the tree.

# WRONG — memory leak
for event, elem in ET.iterparse(xml_file, events=("end",)):
  if elem.tag == "record":
    process(elem)
    # Missing elem.clear() — tree grows indefinitely

# CORRECT — clear after processing
for event, elem in ET.iterparse(xml_file, events=("end",)):
  if elem.tag == "record":
    process(elem)
    elem.clear()   # release element memory
(4) EntitiesForbidden Exception with defusedxml

Symptom: defusedxml.common.EntitiesForbidden raised when parsing valid-looking XML.

Root Cause: The XML contains entity declarations (either legitimate or malicious). defusedxml blocks all entity expansion by design.

Fix Options: - If the entities are benign and the source is trusted, switch to lxml with explicit safe parser flags (see Section 3.3.3). - If the source is untrusted, this exception is the correct behavior — do not suppress it. - If entities need to be allowed selectively, use defusedxml.ElementTree.parse(forbid_entities=False) only for trusted, well-understood inputs.

(5) UnicodeDecodeError When Parsing

Symptom: UnicodeDecodeError: 'utf-8' codec can't decode byte when calling ET.parse() or ET.fromstring().

Root Causes: - The file is not UTF-8 (it may be ISO-8859-1 / Latin-1 or Windows-1252) but declares encoding="UTF-8" in the XML declaration — or declares the wrong encoding. - The XML declaration says encoding="ISO-8859-1" but you opened the file in text mode (Python's open() applies platform encoding before passing to the parser).

# CORRECT — always open XML files in binary mode for parsing
import xml.etree.ElementTree as ET

tree = ET.parse("data.xml")                       # ET handles encoding via XML declaration

# If ET fails, use lxml with explicit encoding recovery
from lxml import etree
parser = etree.XMLParser(recover=True, encoding="iso-8859-1")
tree = etree.parse("data.xml", parser=parser)
(6) XSD Validation Errors That Are Hard to Read

Symptom: schema.validate(doc) returns False, but schema.error_log messages reference line numbers you cannot find or schema types you do not recognize.

Fix: Iterate the error log with full details and cross-reference the line in the document:

from lxml import etree

schema = etree.XMLSchema(etree.parse("schema.xsd"))
doc = etree.parse("invalid.xml")

if not schema.validate(doc):
  for error in schema.error_log:
    print(f"[{error.level_name}] Line {error.line}: {error.message}")
    # Also print the offending element from the document
    lines = open("invalid.xml").readlines()
    print(f"  Context: {lines[error.line - 1].strip()}")
(7) XSLT Transform Produces Empty Output or Wrong Result

Root Causes: - The XPath expressions in the XSLT template match rules use namespace prefixes that are not declared in the stylesheet. - The template match="book" does not fire because elements are in a default namespace and the XSLT does not account for it. - The XSLT version is 2.0 or 3.0 but lxml only supports XSLT 1.0 — use Saxon-C or Saxon-JS for higher versions.

# Debugging XSLT with lxml — print the error log
from lxml import etree

transform = etree.XSLT(etree.parse("stylesheet.xsl"))
result = transform(etree.parse("input.xml"))

# Check for transform-time messages
for error in transform.error_log:
  print(error.message)

# Print raw output
print(bytes(result).decode("utf-8"))
4.3. Q&A — Common Community and Forum Questions
(1) Q: When should I choose XML over JSON for a new project?

A: Choose XML when you need any of the following: mixed content (text interleaved with markup elements, like in DITA or DocBook), rich schema validation with data typing (XSD provides far more type granularity than JSON Schema), XSLT-based transformation pipelines, namespacing to combine multiple vocabularies in one document, or when integrating with existing enterprise systems (SOAP, EDI, OOXML, government standards like HL7 CDA). For straightforward REST APIs, configuration files, or JavaScript-heavy frontends, JSON is simpler and lighter. YAML is better than either for human-authored configuration files.

(2) Q: What is the difference between find() and findall() in ElementTree?

A: find() returns the first matching element or None if no match. findall() returns a (possibly empty) list of all matching elements. Both accept the same XPath subset expressions. For attribute-safe access, prefer findtext() which returns the text content of the first match (or a default string) without raising AttributeError if the element is missing:

# find() — may return None
title = root.find("book/title")
if title is not None:
  print(title.text)

# findtext() — returns default if not found (safer)
title_text = root.findtext("book/title", default="Unknown")
print(title_text)

# findall() — always returns a list (may be empty)
books = root.findall(".//book")
for book in books:
  print(book.get("id"))
(3) Q: How do I add an XML declaration (<?xml version="1.0" ...?>) to output?

A: Use xml_declaration=True in ElementTree's write() method. With tostring(), add it manually if needed:

import xml.etree.ElementTree as ET

root = ET.Element("library")
tree = ET.ElementTree(root)

# Write to file with XML declaration
tree.write("output.xml", encoding="UTF-8", xml_declaration=True)

# Write to string with declaration (note: encoding must be bytes-compatible)
output = ET.tostring(root, encoding="UTF-8", xml_declaration=True)
print(output)  # b"<?xml version='1.0' encoding='UTF-8'?>\n<library />"
(4) Q: How do I pretty-print XML in Python?
# Option 1: ElementTree (Python 3.9+)
import xml.etree.ElementTree as ET

ET.indent(root, space="  ")   # modifies tree in-place
print(ET.tostring(root, encoding="unicode"))

# Option 2: lxml (any version)
from lxml import etree
print(etree.tostring(root, pretty_print=True).decode("utf-8"))

# Option 3: xmllint from shell
# xmllint --format input.xml

# Option 4: minidom (older approach — adds extra whitespace text nodes)
import xml.dom.minidom
dom = xml.dom.minidom.parseString(ET.tostring(root))
print(dom.toprettyxml(indent="  "))
(5) Q: How do I handle encoding when serializing XML to a string vs. a file?

A: This is one of the most common sources of confusion. ET.tostring() with encoding="unicode" returns a Python str (no BOM, no XML declaration). With a byte encoding like encoding="UTF-8", it returns bytes with an XML declaration. For file writing, always use tree.write() with the encoding parameter — it handles the BOM and declaration correctly:

import xml.etree.ElementTree as ET

root = ET.Element("root")
ET.SubElement(root, "item").text = "Héllo"

# For in-memory string (no declaration)
text_str: str = ET.tostring(root, encoding="unicode")

# For bytes (with declaration)
byte_str: bytes = ET.tostring(root, encoding="UTF-8", xml_declaration=True)

# For file output (recommended)
tree = ET.ElementTree(root)
tree.write("output.xml", encoding="UTF-8", xml_declaration=True)
(6) Q: Can I use lxml and still be compatible with code written for stdlib ElementTree?

A: Yes. lxml's ElementTree API is intentionally compatible with the stdlib API. The recommended pattern is a try/except import that falls back gracefully:

try:
  from lxml import etree as ET
  print("Using lxml")
except ImportError:
  import xml.etree.ElementTree as ET
  print("Using stdlib ElementTree")

# Your code below uses ET.parse(), ET.Element(), etc. unchanged
tree = ET.parse("library.xml")
root = tree.getroot()

The only incompatibilities arise when using lxml-exclusive features: getparent(), getprevious(), getnext(), nsmap, iterancestors(), iterchildren(), full XPath, and XSLT.

(7) Q: How do I convert an XML document into a Python dataclass or Pydantic model?

A: The cleanest approach uses xmlschema or manual parsing into a dataclass:

from dataclasses import dataclass
from typing import List
import xml.etree.ElementTree as ET

@dataclass
class Book:
  id: str
  title: str
  author: str
  year: int
  price: float
  currency: str

def parse_books(xml_path: str) -> List[Book]:
  root = ET.parse(xml_path).getroot()
  books = []
  for book_el in root.findall("book"):
    price_el = book_el.find("price")
    books.append(Book(
      id=book_el.get("id", ""),
      title=book_el.findtext("title", ""),
      author=book_el.findtext("author", ""),
      year=int(book_el.findtext("year", "0")),
      price=float(price_el.text) if price_el is not None else 0.0,
      currency=price_el.get("currency", "USD") if price_el is not None else "USD",
    ))
  return books

# For XSD-based data binding, use xmlschema:
# import xmlschema
# xs = xmlschema.XMLSchema("library.xsd")
# data = xs.to_dict("library.xml")  # returns a dict matching the schema structure
(8) Q: What is the difference between elem.text and elem.tail?

A: This is a common source of confusion in ElementTree's data model. text is the character data immediately inside the element's opening tag. tail is the character data that follows the element's closing tag, before the next sibling or parent's closing tag:

<parent>
  Leading text           ← parent.text
  <child>Child text</child>   ← child.text = "Child text"
  Trailing text          ← child.tail = "\n  Trailing text\n"
</parent>

When building XML programmatically for documents with mixed content, you must set both text and tail on elements to preserve whitespace and prose flow correctly.

(9) Q: How do I merge or concatenate two XML documents?
from lxml import etree

def merge_libraries(file_a: str, file_b: str) -> bytes:
  """Merge all book elements from file_b into the library root of file_a."""
  tree_a = etree.parse(file_a)
  tree_b = etree.parse(file_b)

  root_a = tree_a.getroot()
  root_b = tree_b.getroot()

  for element in root_b:
    root_a.append(element)   # lxml appends a deep copy by default

  return etree.tostring(root_a, pretty_print=True, xml_declaration=True, encoding="UTF-8")

merged = merge_libraries("library_a.xml", "library_b.xml")
print(merged.decode("utf-8"))
(10) Q: How do I use XML in an AI agent context — specifically for Claude's structured output?

A: Anthropic's Claude model uses XML-like tags natively for structured outputs in its chain-of-thought and tool-use responses. When building an AI agent, you can instruct Claude to return structured data in XML tags and then parse them with Python's XML tools. The pattern is:

from defusedxml import ElementTree as ET

system_prompt = """
When extracting data, return it in this exact format:
<extraction>
  <entity name="ENTITY_NAME" type="ENTITY_TYPE">
    <value>extracted value</value>
    <confidence>0.95</confidence>
  </entity>
</extraction>
"""

# After receiving Claude's response text, extract the XML block
import re

def extract_xml_block(response: str, tag: str) -> str | None:
  """Extract the content of a specific XML block from a response."""
  pattern = rf"<{tag}>(.*?)</{tag}>"
  match = re.search(pattern, response, re.DOTALL)
  return match.group(0) if match else None

xml_block = extract_xml_block(claude_response, "extraction")
if xml_block:
  root = ET.fromstring(xml_block)   # safe parse with defusedxml
  for entity in root.findall("entity"):
    print(entity.get("name"), entity.findtext("value"), entity.findtext("confidence"))

This pattern is the basis of structured output pipelines in Claude-based agents, where XML tags serve as reliable delimiters for programmatic parsing without requiring JSON mode.


End of XML Getting Started Notes — v1.0 | 2026-03-27


Appendix: JSON and YAML — Sister Formats to XML

Merged from the original XML_JSON.md (2026-03-29) and XML.md (2026-03-27) during the 2026-05 knowledge-base reorganization.

A.1. The Three Data-Serialization Formats Compared

The "why" behind these formats is universal data exchange across disparate systems. They solve the impedance mismatch between in-memory data structures (objects, arrays) and persistent storage / network transmission.

  • XML (eXtensible Markup Language): designed for document integrity and complex metadata; solves strict schema validation and hierarchical document representation. (See the rest of this file for a deep dive.)
  • JSON (JavaScript Object Notation): designed for speed and ease in web environments; solves the verbosity of XML, providing a lightweight map-like structure that maps directly to programming-language primitives.
  • YAML (YAML Ain't Markup Language): designed for human readability; solves the visual noise of braces and tags, making it the industry standard for configuration files (CI/CD, Kubernetes).

A.2. Quick Format Snippets

JSON example:

{
  "model_config": {
    "name": "gemini-3-flash",
    "parameters": {
      "temperature": 0.7,
      "max_tokens": 2048,
      "stop_sequences": ["\n", "User:"]
    }
  }
}

YAML — equivalent of the JSON above:

model_config:
  name: "gemini-3-flash"
  parameters:
    temperature: 0.7
    max_tokens: 2048
    stop_sequences: ["\n", "User:"]

XML — equivalent of the same:

<model_config>
  <name>gemini-3-flash</name>
  <parameters>
    <temperature>0.7</temperature>
    <max_tokens>2048</max_tokens>
    <stop_sequences>
      <stop>\n</stop>
      <stop>User:</stop>
    </stop_sequences>
  </parameters>
</model_config>

A.3. When to Use Which

Concern XML JSON YAML
Document fidelity Best (mixed content, metadata) Limited Limited
Schema validation Mature (XSD, Schematron) JSON Schema JSON Schema (via converters)
Human readability Verbose Decent Best
Programming-language match Awkward (DOM trees) Native (objects/arrays) Native (objects/arrays)
Parse speed Slow Fast Slow
Common ecosystem SOAP, HL7, Office docs, RSS REST APIs, NoSQL, web frontends Configs (Kubernetes, GitHub Actions, Compose)
Recommended for Documents, regulated industries, legacy systems API payloads, structured data, web apps Configuration files, IaC, pipelines

A.4. JSON in Python

import json

# Parse
data = json.loads('{"key": "value"}')          # str  -> dict
with open("data.json") as f:
    data = json.load(f)                         # file -> dict

# Serialize
text = json.dumps(data, indent=2)              # dict -> str
with open("out.json", "w") as f:
    json.dump(data, f, indent=2)               # dict -> file

# JSON Schema validation (via jsonschema library)
from jsonschema import validate
schema = {"type": "object", "properties": {"key": {"type": "string"}}}
validate(instance=data, schema=schema)

A.5. YAML in Python

import yaml

# Parse
data = yaml.safe_load(open("config.yaml"))     # always use safe_load — never load()

# Serialize
yaml.safe_dump(data, open("out.yaml", "w"), default_flow_style=False)

A.6. Shell Tools

# JSON: jq
cat data.json | jq .                                       # pretty-print
cat data.json | jq '.users[] | select(.age > 30) | .name' # query

# YAML: yq
yq eval '.model_config.name' config.yaml                  # query
yq eval -o=json config.yaml                                # convert YAML to JSON

# XML: see the main XML toolchain section above (xmllint, xsltproc, xmlstarlet)

A.7. Security Reminders Across All Three

  • Schema-validate input from untrusted sources before processing — for JSON, use JSON Schema; for XML, use XSD; for YAML, validate after converting to a typed structure.
  • Never hardcode credentials in YAML configs. Use environment variables or a secrets manager.
  • Sanitize user input before interpolating it into AI prompts that may be wrapped in JSON/XML/YAML.
  • YAML safety: in Python, always use yaml.safe_load(), not yaml.load() — the latter can execute arbitrary code via !python/object tags.
  • XML safety: see the XXE / Billion-Laughs sections in the main XML chapters above. Disable external entity resolution unless you specifically need it.

A.8. AI-Agent Use Cases for Each Format

Use case Recommended format Why
LLM API request/response JSON Native fit; most LLM APIs (OpenAI, Anthropic) speak JSON
Function-calling tool schemas JSON OpenAI/Anthropic standard
Structured Output / Strict mode JSON Backed by JSON Schema validators
System prompts with sections XML tags inside text LLMs (especially Claude) follow XML tags well
Project configuration YAML Human-friendly; what every CI/CD and IaC tool uses
Complex documents (HL7, .docx) XML Industry-standard formats already in XML
Vector embeddings transport JSON or Protobuf JSON for readability; Protobuf for performance

A.9. Common Q & A

  • Q: Why is XML still relevant in 2026?
  • A: Industry standards like HL7 (healthcare), DOCX/XLSX (Office), SOAP (older enterprise services), and certain financial messaging (FIX, FpML) rely on XML for document fidelity that JSON cannot match cleanly.
  • Q: Can I use YAML for huge datasets?
  • A: No. YAML is slow to parse. For large data, use JSON, Parquet, or Protobuf.
  • Q: When should I use JSON Schema vs. XSD?
  • A: JSON Schema for JSON; XSD for XML. They're not interchangeable. JSON Schema is simpler and JSON-native; XSD is more powerful (with imports, derived types) and XML-native.
  • Q: Are there typed YAML alternatives?
  • A: TOML (used by pyproject.toml, Cargo.toml) and HCL (Terraform) are stricter and avoid YAML's whitespace-sensitivity bugs. Both are gaining traction for config files.