Ontologically Aligned Diffusion Pipeline
Summary
The Ontologically Aligned Diffusion Pipeline is a high-performance generative video system developed for the Sokrates project. It is a heavily modified “Ship of Theseus” architecture derived from Wan 2.2 and Flux, featuring a rewritten transformer core (36 of 40 blocks), a novel single-pass guidance mechanism called Normalized Attention Guidance (NAG), and a factorized conditioning system grounded in Riemannian manifold geometry.
Details
Base Architecture and Modifications
The pipeline operates in BF16 precision, optimized for NVIDIA Blackwell (RTX Pro 6000) and Hopper (H200) hardware. While its lineage traces back to Wan 2.2 and Flux, the computational graph has been fundamentally altered to support autoregressive causal flow. This modification provides the model with natural temporal directionality, allowing it to achieve high-quality results in only 6 denoising steps, compared to the ~30 steps required by the base model. The system also incorporates flowmatching for vision tensors and an aesthetic finetune trained on dual H200s for approximately 30 hours.
Sliding Window Attention
Temporal coherence is maintained through overlapping latent-space temporal windows. Rather than frame-level stitching, the system performs attention over compressed latent representations where each latent encodes four output frames.
- Pyramid Scheduling: Within each window, attention weights follow a pyramid schedule that peaks in the center and tapers at the edges.
- Temporal Blending: Adjacent windows overlap by eight strides (two seconds at 16fps). The tapering weights allow for smooth blending in latent space, preventing the visual “seams” common in tiled video generation.
- Pipe Delimiters: Prompt segments are demarcated by pipe characters (
|), which define attention segment boundaries in the text encoder, representation space, and vision side, enabling strategic noise injection and error correction.
Normalized Attention Guidance (NAG)
NAG is a proprietary replacement for Classifier-Free Guidance (CFG). It operates on the principle that CFG is a computationally expensive “band-aid” for topological holes in a VAE’s manifold caused by censored training data. NAG replaces the dual-pass CFG with a single forward pass in attention space:
- Feature Expansion: Extrapolates in attention space from negative toward positive conditioning.
- L1 Regularization: Acts as “manifold proprioception,” computing norm ratios to ensure activations do not drift into non-coherent regions of the latent space (stability threshold is typically < 1.6).
- Blending: Mixes the bounded extrapolation back toward positive features for semantic stability.
NAG is hypermodular, allowing independent guidance scales for different ontological categories (Entity, Lighting, Geometry, Action), and provides a ~10x compute saving over traditional CFG.
Ontological Alignment Framework
The pipeline is the primary implementation of the Ontological Alignment Framework, which seeks to repair the visual manifold through three pillars:
- Data-Level Alignment: Populating underrepresented factor combinations using synthetic data from a Unity Perception pipeline.
- Representation-Level Alignment: Replacing unstructured captions with a typed ontology (Entity, Action, Lighting, Geometry). Each category has a dedicated cross-attention block. A deliberately under-provisioned “Residual” category forces the model to use the typed streams correctly, as “dumping” information into the Residual stream causes loss to explode.
- Guidance Alignment: Moving interpolation to the attention logit level.
Factorized and Identity Adapters
- Lighting Foundation Adapter: A rank-256 adapter trained on 20,000 synthetic frames with ground-truth physical labels (lux, Kelvin, azimuth). It serves as an augmentation engine, allowing identity training to see the same subject under thousands of physically accurate lighting conditions.
- Identity LoRAs: High-fidelity adapters (e.g., Hákon’s personal LoRA) trained on 500+ images and 80 4K videos. These capture biomechanical invariants, such as gait patterns and specific anatomical details, and are portable across multiple architectures (Qwen, Flux, Wan, LTX Video).
- Dragon Attention: A specialized technique for maintaining coherence in sparse latent neighborhoods (e.g., mythical creatures) by triangulating on concept-specific tokens in attention space.