Debugging Postal Code Type Coercion in Bulk Prediction Pipelines

Summary

This process documents the investigation and resolution of a data type mismatch within the Sokrates prediction pipeline. The issue involved the unintended coercion of postal_code values from integers to floats during bulk DataFrame processing, which caused categorical encoding failures in the L2 model.

Details

During the development and testing of bulk prediction features—specifically when merging transaction data with the kaupskrá (Icelandic property registry) dataset—a discrepancy was observed between isolated single-row predictions and bulk processing results. In bulk mode, the L2 Spatial model scores dropped significantly (e.g., from +0.039 to -0.199).

Root Cause Analysis

The investigation identified that the issue stemmed from how Pandas handles nullable integers and concatenation. When the postal_code column was processed in a bulk DataFrame that included missing values (NaN or pd.NA), the column was coerced into a float64 data type.

The specific failure point occurred during the .map() operation used for categorical encoding. The model’s encoder was trained on string representations of integers (e.g., "270"). However, because the column had become float-based, the mapping function received 270.0, which converted to the string "270.0". This string did not exist in the encoder’s vocabulary, resulting in a “miss” and assigning the value -1 (unknown) to all postal codes in the batch.

Technical Findings

Pandas Quirk: Even when using the nullable Int64 type, certain operations like .map() in the presence of pd.NA can trigger a conversion to float during the lambda execution.
Categorical Encoding: The L2 model relies on exact string matches for categorical features. The difference between "270" and "270.0" is sufficient to break the feature engineering pipeline.
Feature-Specific Behavior: While postal_code is a categorical feature and could be cast to an object/string type to avoid this, other features like is_complete are numeric and must remain as floats or integers for LightGBM to process them correctly.

Resolution

The fix involved a more robust type-handling strategy within the predict_on_new_data function and the precompute pipeline:

Explicit Casting: Ensuring postal_code is explicitly cast to a string or object type before it reaches the mapping stage.
Handling NaNs: Filling NaN values with a sentinel value before casting to integer-strings to prevent the float coercion that occurs when NaN is present in a standard integer series.
Dtype Preservation: Verifying that the Int64 (nullable integer) fix is preserved through filtering operations (like .between()) and concatenation.

The final implementation ensured that postal_code remains a string-compatible format throughout the pipeline, while numeric features like is_complete are maintained in their required numeric dtypes.

Eidos
Kaupskrá Dataset
L2 Model Architecture
Hermes Agent (for reporting prediction errors)

Sókrates Wiki

Explorer

Debugging Postal Code Type Coercion in Bulk Prediction Pipelines

Debugging Postal Code Type Coercion in Bulk Prediction Pipelines

Summary

Details

Root Cause Analysis

Technical Findings

Resolution

Graph View

Table of Contents

Sókrates Wiki

Explorer

Debugging Postal Code Type Coercion in Bulk Prediction Pipelines

Debugging Postal Code Type Coercion in Bulk Prediction Pipelines

Summary

Details

Root Cause Analysis

Technical Findings

Resolution

Related

Graph View

Table of Contents