Pydantic and Sparkdantic Schema Integration

Summary

This cluster defines the technical stack for schema management and data validation within the Sokrates project, focusing on Pydantic v2, datamodel-code-generator, and Sparkdantic. These tools form a “schema once, deploy everywhere” pipeline that bridges standard Python type hints with JSON Schema and Apache Spark’s StructType definitions.

Details

The project utilizes a specific toolchain to ensure type safety and schema consistency across different execution environments (standard Python services and Spark-based data pipelines).

Pydantic (v2)

Pydantic serves as the foundational library for data modeling. The project utilizes Pydantic v2 features for validation and serialization:

  • BaseModel & Field: Used to define structured data with constraints (e.g., gt, lt, description).
  • JSON Schema Generation: Models can export their definitions using model_json_schema(), supporting both validation and serialization modes.
  • TypeAdapter: Used for validating and serializing types that are not full BaseModel subclasses, often combined with Annotated for custom logic like AfterValidator or PlainSerializer.
  • Field Validators: Implementation of @field_validator with json_schema_input_type to document validators that accept types differing from the primary annotation.

datamodel-code-generator

To maintain a single source of truth, the project uses datamodel-code-generator to transform external schema definitions (JSON Schema, OpenAPI, or GraphQL) into Pydantic models.

  • CLI Usage: The tool is invoked via datamodel-codegen with flags such as --output-model-type pydantic_v2.BaseModel and --target-python-version 3.11.
  • Programmatic API: It is also used within Python scripts via the generate function, allowing for dynamic model creation from schema strings.

Sparkdantic

Sparkdantic (specifically versions around v2.6.0) is employed to bridge the gap between Pydantic models and PySpark. It allows for the automatic generation of Spark schemas from Python classes, avoiding manual StructType definitions.

  • SparkModel: A base class that extends Pydantic’s BaseModel to include methods like model_spark_schema() and model_json_spark_schema().
  • Standalone Conversion: For existing models, create_spark_schema(Model) provides a way to generate Spark-compatible schemas without modifying the original class inheritance.
  • Type Coercion: The SparkField utility allows developers to explicitly map Python types to specific Spark types (e.g., mapping an int to bigint).
  • Testing: The library’s ColumnGenerationSpec and generate_data methods are used to create fake Spark DataFrames for testing pipelines based on the Pydantic model definitions.