Debugging Postal Code Type Coercion in Bulk Prediction Pipelines
Summary
This process documents the investigation and resolution of a data type mismatch within the Sokrates prediction pipeline. The issue involved the unintended coercion of postal_code values from integers to floats during bulk DataFrame processing, which caused categorical encoding failures in the L2 model.
Details
During the development and testing of bulk prediction features—specifically when merging transaction data with the kaupskrá (Icelandic property registry) dataset—a discrepancy was observed between isolated single-row predictions and bulk processing results. In bulk mode, the L2 Spatial model scores dropped significantly (e.g., from +0.039 to -0.199).
Root Cause Analysis
The investigation identified that the issue stemmed from how Pandas handles nullable integers and concatenation. When the postal_code column was processed in a bulk DataFrame that included missing values (NaN or pd.NA), the column was coerced into a float64 data type.
The specific failure point occurred during the .map() operation used for categorical encoding. The model’s encoder was trained on string representations of integers (e.g., "270"). However, because the column had become float-based, the mapping function received 270.0, which converted to the string "270.0". This string did not exist in the encoder’s vocabulary, resulting in a “miss” and assigning the value -1 (unknown) to all postal codes in the batch.
Technical Findings
- Pandas Quirk: Even when using the nullable
Int64type, certain operations like.map()in the presence ofpd.NAcan trigger a conversion tofloatduring the lambda execution. - Categorical Encoding: The L2 model relies on exact string matches for categorical features. The difference between
"270"and"270.0"is sufficient to break the feature engineering pipeline. - Feature-Specific Behavior: While
postal_codeis a categorical feature and could be cast to an object/string type to avoid this, other features likeis_completeare numeric and must remain as floats or integers for LightGBM to process them correctly.
Resolution
The fix involved a more robust type-handling strategy within the predict_on_new_data function and the precompute pipeline:
- Explicit Casting: Ensuring
postal_codeis explicitly cast to a string or object type before it reaches the mapping stage. - Handling NaNs: Filling
NaNvalues with a sentinel value before casting to integer-strings to prevent the float coercion that occurs whenNaNis present in a standard integer series. - Dtype Preservation: Verifying that the
Int64(nullable integer) fix is preserved through filtering operations (like.between()) and concatenation.
The final implementation ensured that postal_code remains a string-compatible format throughout the pipeline, while numeric features like is_complete are maintained in their required numeric dtypes.
Related
- Eidos
- Kaupskrá Dataset
- L2 Model Architecture
- Hermes Agent (for reporting prediction errors)