Kaupskrá Data Cleaning and Outlier Remediation

Summary

The process of identifying and fixing systemic statistical distortions in the Icelandic property register (kaupskrá) data, specifically targeting “bundle sales” and placeholder property assessments. This remediation effort successfully closed the artificial gap between mean and median residuals in the property valuation models, ensuring that bulk transactions do not skew market analysis.

Details

During the development of the property valuation engine, an investigation by the outlier-investigator identified a massive spread between mean and median residuals in the fjölbýli (multi-family home) segment. In the 2024 dataset, the raw mean residual was +267.6% while the median was only +10.7%. This distortion was traced to two primary root causes within the source data.

1. Bundle Sales (Bulk Purchases)

The primary cause of distortion (accounting for ~90% of the spread) is the way the property register records bulk purchases. When an entire apartment building is purchased (e.g., by a real estate investment firm), the register creates a separate row for every individual unit. However, each row incorrectly lists the total purchase price for the entire building in the KAUPVERD column, while the FASTEIGNAMAT (official assessment) reflects only that specific unit.

Key Examples:

  • Laugavegur 168C/D: 19 units, each showing a KAUPVERD of 1,023,484,000 ISK against individual unit assessments. This created an apparent 174x price-to-assessment ratio.
  • Stefnisvogur 30/32/34: 50 units, each showing a KAUPVERD of 4,808,210,000 ISK. The apparent ratio was 57x, while the corrected building-level ratio was 1.14.
  • Vatnsholt 1/3: 49 units, corrected ratio of 1.53.

This issue affected approximately 15.6% of all fjölbýli rows in the 2024-2026 period. In contrast, einbýli (single-family homes) only saw a 1-2% bundle rate, explaining why their mean and median were already closely aligned.

2. Pre-completion Fasteignamat

The secondary cause involves new construction (2022 onwards), particularly in postnr 260 (Reykjanesb\u00e6r). These properties often have placeholder assessments (fasteignamat) that have not been updated to reflect the finished building. These placeholders are typically around 2,600–3,800 ISK for apartments ranging from 63–118 sqm, resulting in ~40 ISK/sqm compared to the market norm of ~700 ISK/sqm. This creates artificial 15-20x ratios for legitimate sales.

Remediation Implementation

The ETL pipeline was updated to include a specific cleaning stage with the following logic:

  1. Bundle Deduplication: The system now groups rows by KAUPVERD and UTGDAG (issue date). If multiple FASTNUM (property numbers) exist for the same price and date, the system either keeps a single representative row or distributes the price across the units. In the initial cleanup, 11,709 rows were corrected across 5,522 bundles.
  2. Placeholder Filtering: A filter was implemented to exclude or flag rows where the ratio of assessment to size (FASTEIGNAMAT / EINFLM) is less than 100. This removed 3,533 rows of low-signal placeholder data.
  3. Ratio Capping: As a secondary safety measure, transactions with a price-to-assessment ratio greater than 5.0 are excluded from training sets unless manually verified.

Impact on Model Accuracy

Following the implementation of these filters, the mean and median residuals converged dramatically across all property types:

Property TypeBefore (Mean/Median)After (Mean/Median)
Fjölbýli+29.2% / +10.7%+6.5% / +7.4%
Sérbýli+15.7% / +10.4%+5.2% / +9.6%
Einbýli+14.0% / +13.0%+7.8% / +11.5%

The removal of these outliers allows the Eidos valuation models to train on a much cleaner representation of the Icelandic real estate market, preventing bulk commercial transactions from inflating residential price estimates.