Pydantic and Sparkdantic Schema Integration
Summary
This cluster defines the technical stack for schema management and data validation within the Sokrates project, focusing on Pydantic v2, datamodel-code-generator, and Sparkdantic. These tools form a “schema once, deploy everywhere” pipeline that bridges standard Python type hints with JSON Schema and Apache Spark’s StructType definitions.
Details
The project utilizes a specific toolchain to ensure type safety and schema consistency across different execution environments (standard Python services and Spark-based data pipelines).
Pydantic (v2)
Pydantic serves as the foundational library for data modeling. The project utilizes Pydantic v2 features for validation and serialization:
- BaseModel & Field: Used to define structured data with constraints (e.g.,
gt,lt,description). - JSON Schema Generation: Models can export their definitions using
model_json_schema(), supporting bothvalidationandserializationmodes. - TypeAdapter: Used for validating and serializing types that are not full
BaseModelsubclasses, often combined withAnnotatedfor custom logic likeAfterValidatororPlainSerializer. - Field Validators: Implementation of
@field_validatorwithjson_schema_input_typeto document validators that accept types differing from the primary annotation.
datamodel-code-generator
To maintain a single source of truth, the project uses datamodel-code-generator to transform external schema definitions (JSON Schema, OpenAPI, or GraphQL) into Pydantic models.
- CLI Usage: The tool is invoked via
datamodel-codegenwith flags such as--output-model-type pydantic_v2.BaseModeland--target-python-version 3.11. - Programmatic API: It is also used within Python scripts via the
generatefunction, allowing for dynamic model creation from schema strings.
Sparkdantic
Sparkdantic (specifically versions around v2.6.0) is employed to bridge the gap between Pydantic models and PySpark. It allows for the automatic generation of Spark schemas from Python classes, avoiding manual StructType definitions.
- SparkModel: A base class that extends Pydantic’s
BaseModelto include methods likemodel_spark_schema()andmodel_json_spark_schema(). - Standalone Conversion: For existing models,
create_spark_schema(Model)provides a way to generate Spark-compatible schemas without modifying the original class inheritance. - Type Coercion: The
SparkFieldutility allows developers to explicitly map Python types to specific Spark types (e.g., mapping aninttobigint). - Testing: The library’s
ColumnGenerationSpecandgenerate_datamethods are used to create fake Spark DataFrames for testing pipelines based on the Pydantic model definitions.