Icelandic Real Estate NLP and Presentation Modeling

Summary

A specialized NLP pipeline and predictive model designed to analyze, score, and optimize Icelandic real estate listing descriptions. It utilizes IceBERT for semantic embeddings and Gemini 3.1 Flash Lite for generating actionable improvements based on empirical price-lift correlations.

Details

The Icelandic Real Estate NLP system is a core component of the Sokrates/Residual-Modeling stack, specifically focused on the “Presentation” aspect of property listings. It addresses the challenge of processing Icelandic—a morphologically complex, low-resource language—within the specific context of the domestic real estate market.

Embedding Architecture

The system uses IceBERT (mideind/IceBERT), a RoBERTa-based model trained on the Icelandic Gigaword Corpus, rather than general-purpose multilingual APIs. Benchmarking conducted within the project revealed that Voyage AI (voyage-large) and IceBERT have a Spearman correlation (ρ) of only 0.185 and a median k=5 neighbor overlap of 0% on Icelandic listing data. This confirmed that IceBERT captures the domain-specific semantic structure of Icelandic real estate significantly better than multilingual models.

The embedding pipeline follows this flow:

  1. Inference: IceBERT (768-dimensions) runs on RunPod (AMPERE_48 GPU) using a serverless setup to manage costs.
  2. Pooling: Mean pooling is applied to the model output.
  3. Dimensionality Reduction: Vectors are reduced to 50 dimensions using a pre-fitted PCA (Principal Component Analysis) model, which is cached as a pickle file.
  4. Storage: Historical vectors are stored in Parquet format or within pgvector in Postgres.

The Presentation Model

The “Presentation Model” is a LightGBM regressor that predicts the price residual (the difference between actual sale price and structural/spatial predictions) based on features extracted from the listing.

Key findings from model interpretation include:

  • Positive Price Drivers: High image count (the strongest signal), mentions of specific materials (“granít”, “eik”, “innfelld”), precise measurements, and premium amenities.
  • Negative Price Drivers: Generic boilerplate (“hafðu samband”, “kostnaðarlausu”), and overused clichés (“björt og rúmgóð”).
  • Centering: To ensure the model provides meaningful feedback, predictions are centered around the mean () of the training set. This ensures an “average” description results in a 0% price delta, while superior descriptions show a positive lift (up to ~9.6% in the top quartile) and poor ones show a negative impact.

Generative Optimization

For the user-facing “Presentation Tab,” the system uses gemini-3.1-flash-lite-preview to generate rewrite suggestions. This process is “grounded” by the empirical findings of the LightGBM model. Instead of generic creative writing, the agent is prompted to:

  1. Identify specific negative patterns (e.g., boilerplate phrases) found in the user’s text.
  2. Suggest specific additions based on high-correlation keywords (e.g., naming specific kitchen appliance brands or floor materials).
  3. Rewrite sections of the description while maintaining factual integrity (avoiding “hallucinated” measurements).
  • IceBERT
  • Gemini
  • Voyage AI
  • Residual Modeling
  • RunPod
  • Icelandic NLP
  • Eidos
  • sokrates-box