Early March, to make sense of the absurdity in the rental market, I dropped a small scraper script onto my own VPS. The goal was to build a more realistic rent multiplier engine for hesapciyiz.com, but at the end of the first 48 hours I realized the data I had was basically a giant lie.
The 45,000 TL listing I see for Beşiktaş on screen and the price actually written on the rental contract sometimes have a chasm between them. Just writing a selector and dumping json isn’t enough; cleaning the “negotiation margin” and “fake listing” noise from the field is the actual job.
The Noise in the Dataset: Not “Selling” Price, but “Wishing” Price
When I dumped the first 10,000 rows of data into SQLite, I noticed something strange: some listings had been live for 3 months and the price was rising 5% steadily. Nobody rents them but the price keeps climbing. That’s not the market price — that’s the “I wish it would go for this” fantasy of the property owner.
This scraper, one of the 13 containers running on my VPS, initially even made sshd unresponsive. Just like the famous disk-fill incident I had on April 28, I was dumping logs straight under /var/log instead of tmpfs without any control. As the data grew, I saw once again how an unchecked dataset can drain system resources.
When I tried to model rentals taking listing prices as “real,” the system told me rents in Istanbul had dropped 15% in the last 2 months. But out on the street the situation was the opposite. Turns out real estate agents, to stay at the top of search results, were deleting old listings and reposting “new” listings 2,000 TL cheaper, then on the phone saying “that one’s gone, but check this one out.”
Cleaning Data in the Pipeline: SQLite and Heuristic Filters
To solve this pollution, I built a simple preflight resource guard mechanism on the Node.js side. Each incoming data row is checked against the “spam patterns” I’ve previously seen. If a listing has been live for more than 30 days and its price has changed more than 3 times, I move it from “market signal” to “noise.”
There was a moment where I shredded the SQLite indexes running these queries. Because I forgot to run VACUUM, my 2 GB database was taking up 5 GB on disk.
-- Simple logic I use to clean noisy data
DELETE FROM listings
WHERE updated_at < date('now', '-30 days')
AND price_change_count > 3;
-- I subtract a 15% 'negotiation/fake' margin for realistic price estimates
UPDATE listings
SET estimated_real_price = price * 0.85
WHERE source = 'web_scraper';
After this cleanup, what I was left with stopped being “listing price” and became “likely transaction price.” To protect resources on my VPS, I run these jobs at 3 a.m. via systemd timers without occupying the GitHub Actions runners.
The Bitter Realities of Doing Data Operations on a VPS
If you’re trying to fit everything into one box like I am, a disk fire is inevitable. At one point Docker build cache had reached 33 GB, with another 23 GB of unused image next to it. When the server hit 100% disk usage, the scraper started writing data as null.
That day I learned: collecting data isn’t only writing code, it’s also managing the disk and swap space the data lives in. While Astro build is trying to come up in 7.6 GB of system memory, also doing bulk INSERTs into SQLite is a full-on suicide attempt.
| Problem | Symptom | Solution |
|---|---|---|
| Disk full | Data arriving as null | docker system prune -af |
| OOM-killed | Containers crashing randomly | Swap file increase + polling wait |
| SSH timeout | sshd failing to accept packets | CPU limit (cgroups) |
In my own projects (e.g. spamkalkani.com or islistesi.com) I now solve these data anomaly situations with auto-fix patterns. If a job runs longer than 10 seconds, the pipeline automatically throttles resource usage and fires a dedup-alert at me.
Pipeline Reliability: Why Automation Is a Must
In data collection your biggest enemy isn’t “changing web structures,” it’s “filling disks and caches.” Running my own self-hosted runner on GitHub Actions, I can’t even count the number of times the pipeline blew up because directories under _work/_temp weren’t deleted.
On the Cloudflare side, Astro returning max-age=0 so all traffic hit the VPS directly was a separate pain. By overriding those headers in Nginx I was able to lighten the load on the server a bit. Otherwise, while pulling data, I’d be forcing visitors to my site to stare at “502 Bad Gateway.”
In the end, every row of data you scrape from the internet is not “pure truth.” That data is a digital trace tampered with by a person (usually an agent or seller). To reach real data, you have to scrape that trace, clean it, and most importantly optimize your own infrastructure so it doesn’t collapse under the load.
These days I’m feeding the cleaned data into the backend of hesapciyiz.com. In a future post I’ll talk about how to store this much data more cheaply on Cloudflare R2. If you have a similar scraper experience or a “disk full, site crashed” moment, let’s meet in the comments (or by email).