İçeriğe Atla
Mustafa Erbay
Career · 8 min read · görüntülenme Türkçe oku
100%

Listing Price and Real Rent Are Not the Same: The Reality of Data…

Why scraped listing data doesn't reflect the real market, plus the technical challenges of data cleaning — from my own experience.

Listing Price and Real Rent Are Not the Same: The Reality of Data… — cover image

Early March, to make sense of the absurdity in the rental market, I dropped a small scraper script onto my own VPS. The goal was to build a more realistic rent multiplier engine for hesapciyiz.com, but at the end of the first 48 hours I realized the data I had was basically a giant lie.

The 45,000 TL listing I see for Beşiktaş on screen and the price actually written on the rental contract sometimes have a chasm between them. Just writing a selector and dumping json isn’t enough; cleaning the “negotiation margin” and “fake listing” noise from the field is the actual job.

The Noise in the Dataset: Not “Selling” Price, but “Wishing” Price

When I dumped the first 10,000 rows of data into SQLite, I noticed something strange: some listings had been live for 3 months and the price was rising 5% steadily. Nobody rents them but the price keeps climbing. That’s not the market price — that’s the “I wish it would go for this” fantasy of the property owner.

This scraper, one of the 13 containers running on my VPS, initially even made sshd unresponsive. Just like the famous disk-fill incident I had on April 28, I was dumping logs straight under /var/log instead of tmpfs without any control. As the data grew, I saw once again how an unchecked dataset can drain system resources.

When I tried to model rentals taking listing prices as “real,” the system told me rents in Istanbul had dropped 15% in the last 2 months. But out on the street the situation was the opposite. Turns out real estate agents, to stay at the top of search results, were deleting old listings and reposting “new” listings 2,000 TL cheaper, then on the phone saying “that one’s gone, but check this one out.”

Cleaning Data in the Pipeline: SQLite and Heuristic Filters

To solve this pollution, I built a simple preflight resource guard mechanism on the Node.js side. Each incoming data row is checked against the “spam patterns” I’ve previously seen. If a listing has been live for more than 30 days and its price has changed more than 3 times, I move it from “market signal” to “noise.”

There was a moment where I shredded the SQLite indexes running these queries. Because I forgot to run VACUUM, my 2 GB database was taking up 5 GB on disk.

-- Simple logic I use to clean noisy data
DELETE FROM listings 
WHERE updated_at < date('now', '-30 days') 
AND price_change_count > 3;

-- I subtract a 15% 'negotiation/fake' margin for realistic price estimates
UPDATE listings 
SET estimated_real_price = price * 0.85 
WHERE source = 'web_scraper';

After this cleanup, what I was left with stopped being “listing price” and became “likely transaction price.” To protect resources on my VPS, I run these jobs at 3 a.m. via systemd timers without occupying the GitHub Actions runners.

The Bitter Realities of Doing Data Operations on a VPS

If you’re trying to fit everything into one box like I am, a disk fire is inevitable. At one point Docker build cache had reached 33 GB, with another 23 GB of unused image next to it. When the server hit 100% disk usage, the scraper started writing data as null.

That day I learned: collecting data isn’t only writing code, it’s also managing the disk and swap space the data lives in. While Astro build is trying to come up in 7.6 GB of system memory, also doing bulk INSERTs into SQLite is a full-on suicide attempt.

ProblemSymptomSolution
Disk fullData arriving as nulldocker system prune -af
OOM-killedContainers crashing randomlySwap file increase + polling wait
SSH timeoutsshd failing to accept packetsCPU limit (cgroups)

In my own projects (e.g. spamkalkani.com or islistesi.com) I now solve these data anomaly situations with auto-fix patterns. If a job runs longer than 10 seconds, the pipeline automatically throttles resource usage and fires a dedup-alert at me.

Pipeline Reliability: Why Automation Is a Must

In data collection your biggest enemy isn’t “changing web structures,” it’s “filling disks and caches.” Running my own self-hosted runner on GitHub Actions, I can’t even count the number of times the pipeline blew up because directories under _work/_temp weren’t deleted.

On the Cloudflare side, Astro returning max-age=0 so all traffic hit the VPS directly was a separate pain. By overriding those headers in Nginx I was able to lighten the load on the server a bit. Otherwise, while pulling data, I’d be forcing visitors to my site to stare at “502 Bad Gateway.”

In the end, every row of data you scrape from the internet is not “pure truth.” That data is a digital trace tampered with by a person (usually an agent or seller). To reach real data, you have to scrape that trace, clean it, and most importantly optimize your own infrastructure so it doesn’t collapse under the load.

These days I’m feeding the cleaned data into the backend of hesapciyiz.com. In a future post I’ll talk about how to store this much data more cheaply on Cloudflare R2. If you have a similar scraper experience or a “disk full, site crashed” moment, let’s meet in the comments (or by email).

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

Frequently Asked Questions

Common questions readers have about this article.

How can I filter out the “wishful” listing prices that inflate the scraped rent data?
I started by adding a timestamp column to every row and then computed the month‑over‑month change for each property ID. If the price increased more than 3 % in a single month without any status change (e.g., still “active”), I flagged it as a “wishful” entry. Next, I cross‑referenced the price with the average of the last three valid contracts for the same neighbourhood; any outlier beyond two standard deviations was dropped. I also built a simple heuristic that removes listings older than 60 days that never changed status. This two‑step filter—trend detection plus statistical outlier removal—cut the noise by roughly 70 %.
What tools and storage configuration helped you avoid OOM crashes on an 8 GB VPS while running multiple scrapers?
I switched from dumping raw JSON files to streaming directly into a SQLite database opened in WAL mode, which keeps the write‑ahead log small and allows concurrent reads. All temporary files, including logs, were redirected to a tmpfs mount (size 1 GB) so they never filled the root partition. I also capped each Docker container’s memory at 1 GB and used `--restart=on-failure` to automatically recycle a container that hit its limit. For log rotation I employed `logrotate` with a daily rotation and a 7‑day retention policy. Finally, I monitored RAM with `htop` and set a cron job to restart the Astro build if usage exceeded 75 %.
Is it true that a property’s listing price always matches the rent written in the contract?
In my experience, that belief is a myth. During the first 48 hours of scraping Istanbul rentals, I found dozens of listings that stayed at 45,000 TL for months while the actual contracts I later inspected were 10–15 % lower. Owners often post an aspirational price to test the market or to give themselves room for negotiation. The discrepancy becomes even clearer when you compare the “live” price with the final signed rent; the latter usually settles near the median of comparable units, not the headline figure. So, treating the listing price as the ground truth will inevitably skew any model you build.
What’s the most reliable way to validate scraped rent data against real‑world market values?
I built a validation pipeline that mixes automated checks with spot‑checking. First, I pull the latest official rental index from the Turkish Statistical Institute (TSİ) and calculate the expected price range for each district. Then I compare each scraped entry against that range; anything outside ±20 % is flagged. For the flagged rows I pull the actual lease agreements from my own `hesapciyiz.com` users (who voluntarily share PDFs) and run OCR to extract the contract rent. Finally, I keep a rolling sample of 200 manually verified listings each month to recalibrate the thresholds. This hybrid approach kept my model’s error margin under 5 % after three months.
ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts