11.YAML
📅 Sat. 2026-04-11 🕐 21:39 from Gemini 3 Flash 👉 #YAML #DevOps #DataSerialization #Python 📎 YAML Spec 1.2 | PyYAML Documentation
1. Overview.
(1) Design Intent
YAML (YAML Ain't Markup Language) (Yet Another Markup Language) is engineered strictly for human-centric data serialization. While JSON and XML prioritize machine-readability and structural strictness, YAML prioritizes visual scannability and structural deduction through whitespace. It is designed to natively map to the common data structures of high-level languages (dictionaries, lists, scalars) without abstraction layers. - Data serialization is the process of translating complex, in-memory data structures (like objects, lists, and variables in your code) into a standard text format that can be stored in a file, sent over a network, and perfectly reconstructed by another program later.
(2) Pain Points Solved
- Cognitive Load in Configurations: Replaces rigid syntax boundaries (braces, brackets, closing tags) with indentation scoping, mirroring Python's design philosophy.
- Redundant Data Structures: Solves repetition in large configurations via its internal relational graph mechanism (Anchors and Aliases), a feature absent in JSON.
- Complex String Escaping: Eliminates the need to escape characters in large blocks of text (like embedded scripts or certificates) through block scalars.
(3) Features
- Graph Representation: Unlike JSON's strict tree structure, YAML can represent cyclical and relational graphs natively.
- Type Resolution: Employs implicit type resolution (automatically deducing integers, floats, booleans) and explicit typing via tags (
!!str). - Multi-document Streams: Supports concatenating multiple distinct documents in a single file stream separated by
---(e.g., Kubernetes manifests).
(4) Use Cases
- Infrastructure as Code (IaC): Kubernetes, Ansible, AWS CloudFormation. YAML acts as the declarative state definition.
- CI/CD Orchestration: GitHub Actions, GitLab CI. YAML defines the directed acyclic graph (DAG) of execution pipelines.
- Complex Configuration Management: Spring Boot, Ruby on Rails environments.
(5) Competitors
| Specification | Primary Domain | Structural Paradigm | Data Model | Native Relationality |
|---|---|---|---|---|
| YAML | DevOps, Configs | Whitespace | Graph / Tree | Yes (Anchors) |
| JSON | APIs, Data Transit | Braces/Brackets | Tree | No |
| XML | Enterprise SOAP, Legacy | Tagged Markup | Document Tree | Yes (ID/IDREF) |
| TOML | Package Configs (Rust/Python) | INI-style Key/Value | Table/Dict | No |
2. Concept, Component, & Architecture.
2.1. Key Concepts
(1) The Data Types
![[Pasted image 20260411220946.png]]
1. Numeric: represents numbers, used for mathematical or scientific computation
1. Integer: 42 (data that cannot be broken down, like number of laptops)
2. Complex: a + bj where a and b are real numbers and j is the imaginary unit (has a real and an imaginary part)
3. Float: 3.14 (data that can be broken down)
2. Mapping: store data in a key-value format
1. Dictionary: unordered collection of items where each item is a key value pair
2. The colon is the "Mapping Operator." It separates the identifier from the data.
- The Key: The label before the colon (e.g., username).
- The Value: The scalar, sequence, or another mapping after the colon (e.g., janedoe).
author: # Start of parent mapping
first_name: Jane # Child mapping pair 1
last_name: Doe # Child mapping pair 2
- Sequence: ordered collections of items, items can be accessed by an index
- String: sequence, collection of objects that are generally the same type. They are an immutable sequence of Unicode characters, meaning once a string is created, it cannot be changed.
- List: versatile, mutable sequences, meaning their elements can be changed after creation.
- Tuple: similar to lists, immutable, but for different types of that data that belong together.
- Set: unordered collection of unique items, used for eliminating duplicate entries
- Set: mutable, meaning you can add or remove items from it
- Boolean: fundamental data type, used for logical operations
- Bool:
true,false,yes,no(Note: YAML 1.2 preferstrue/false).
- Bool:
- Null:
null,~(not, absence of a value or a state of nothingness)
(2) The Information Model
YAML data is not just text; it is an abstract mathematical graph.
- Nodes: The fundamental atomic unit. A node is either a scalar (single value), a sequence (ordered list of nodes), or a mapping (unordered collection of node pairs)
- Identity: Nodes have identity. Two nodes with the same value are not the same node unless explicitly linked. This enables graph structures over simple trees.
(3) The Three Node Types
Every YAML file is composed of only three types of structures: 1. Scalars: Single values (Strings, Integers, Booleans, Nulls). 2. Mappings: Key-Value pairs (Dictionaries) 3. Sequences: Ordered lists (Arrays). 1. Sequence is an ordered list of nodes. While a Mapping is about labeling data with keys, a Sequence is about arranging data in a specific, numbered order. The hyphen followed by a space is the "Sequence Entry" marker. Each hyphen represents a new "index" in the list.
# A sequence of simple string scalars
favorite_fruits:
- Apple
- Banana
- Cherry
(4) Graph Linkage (Anchors & Aliases)
This is the mechanism that converts a YAML tree into a graph.
- Anchor (&): Marks a specific node in memory and assigns it an identifier.
- Alias (*): Creates a pointer to the previously anchored node. Modifying the data at the source node conceptually propagates to all aliases.
2.2. Core Components
The YAML architecture is a sequential, three-stage processing pipeline. Understanding this pipeline is critical for RCA (Root Cause Analysis) when parsing fails.
(1) Scanner/Parser
- Role: Reads the raw character stream and breaks it into Tokens (like "Key", "Value", "List Start").
- Output: A stream of parsing events.
(2) Composer
- Role: Assembles the events into a Node Graph. This is where the "Tree" is built and references (Anchors/Aliases) are resolved.
- Output: An in-memory Representation Graph.
(3) Constructor
- Role: Converts the Node Graph into Native Objects specific to the programming language (e.g., a Python
dictor a TypeScriptinterface). - Output: The final usable data.
2.3. Eco-system & Dependencies
- Linters: Tools like
yamllintthat verify syntax before execution. - Schemas: JSON Schema is often used to validate that a YAML file contains the correct keys and data types required for a specific application.
2.3. Architecture & Design
(1) Structural Topologies
YAML enforces structural dependencies from the top down.
- Mappings require key-value pairs. Keys can technically be complex sequences (unique to YAML, though rarely used).
- Sequences act as arrays. They can contain mappings, scalars, or nested sequences.
- Block vs. Flow: Block style (indentation) is designed for human authoring. Flow style (JSON-like {}) is designed for programmatic generation. YAML allows arbitrary mixing of both topologies.
2.4. Eco-system
(1) Schema and Validation (The Missing Link)
YAML natively lacks schema validation. It does not know what your data means. - How it connects: External standards like JSON Schema (which applies perfectly to YAML's representation graph) or Yamale are layered on top of the parser to enforce business logic (e.g., "port must be an integer between 1-65535").
(2) Security Dependencies
The parsing pipeline is inherently dangerous if the Constructor layer is allowed to instantiate arbitrary language objects.
- How it connects: "Safe Loaders" bypass the default Constructor behavior, explicitly denying the instantiation of complex classes (e.g., preventing !!python/object/apply:os.system) and only allowing standard data types.
3. Install, Configure, Secure, & Cheatsheet.
3.1. Install
# macOS - Yamllint is the standard for syntax enforcement in CI pipelines
brew install yamllint
# Python Implementation (PyYAML)
pip install pyyaml
3.2. Configure
(1) Advanced Python Implementation
Demonstrating the difference between the Representation Graph and Native Objects.
import yaml
yaml_stream = """
base_config: &base
timeout: 30
retry: true
prod_config:
<<: *base # Merge key
env: "production"
"""
# The Constructor translates the alias and merge key into a finalized Python dictionary
# yaml.safe_load is mandatory for security
native_dict = yaml.safe_load(yaml_stream)
print(native_dict['prod_config']['timeout']) # Outputs: 30
3.3. Secure
(1) The Instantiation Vulnerability
Rule: Never use yaml.load() in Python or equivalent unsafe loaders in other languages.
Mechanism: YAML's explicit tagging feature (!!) allows a document to specify the exact class constructor to invoke. An attacker can use this to execute system commands during the parsing phase before your application logic even executes.
Mitigation: yaml.safe_load() maps only to basic primitives (dicts, lists, strings, numbers).
(2) Secrets Injection
YAML files stored in version control must never contain hardcoded secrets.
Architecture: Defer secret injection to runtime environment variables using templating engines (e.g., Helm for Kubernetes, Jinja2 for Python, or native docker-compose variable interpolation like ${DB_PASS}).
3.4. Cheatsheet
(1) Block Scalars (Multi-line Strings)
|(Literal): Preserves newlines and trailing spaces. Used for embedded scripts or TLS certificates.>(Folded): Converts newlines to spaces. Used for long prose or descriptions.
(2) The Merge Key (<<)
A specialized alias function that unpacks the contents of a referenced mapping into the current mapping.
common_labels: &common
app: backend
tier: api
service_a:
<<: *common
port: 8080
4. Bootcamp & Workshops.
4.1. Resources
- YAML Spec 1.2.2: The authoritative source for the Information Model and Pipeline.
- JSON Schema for YAML: For enforcing structural integrity in YAML pipelines.
4.2. Troubleshooting (RCA)
| Issue | Pipeline Stage | Root Cause Analysis | Remediation |
|---|---|---|---|
expected '<document start>', but found '<block mapping start>' |
Scanner | Indentation mismatch or mixing tabs/spaces. | Ensure strict 2-space indentation. Run yamllint. |
found undefined alias 'X' |
Composer | Attempted to alias (*X) a node that was not anchored (&X) prior in the stream. |
Define the anchor &X above the alias reference. |
could not determine a constructor for the tag '!!python/object...' |
Constructor | Using safe_load on a document attempting code execution. |
Security functioning as intended. Sanitize input. |
| ##### 4.3. Q&A | |||
| ###### (1) Why did my unquoted "no" evaluate to a boolean? | |||
YAML 1.1 automatically typed yes, no, on, off as booleans. YAML 1.2 removed this, restricting booleans to true and false. However, many parsers (like PyYAML) default to the 1.1 specification for backwards compatibility. Expert Rule: Always quote strings that look like booleans or numbers. |
|||
| ###### (2) How does YAML handle extremely large datasets? | |||
| Poorly. Because YAML relies on whitespace and graph resolution, parsing is significantly slower than JSON. For data transit or massive datasets (logging, data dumps), JSON or binary formats (Protobuf) are architecturally superior. YAML should be strictly reserved for human-interfacing configurations. |