Last month, when I added a new financial calculator to hesapciyiz.com, I started collecting user interaction data. My goal was to understand which calculators got more attention and which features were used. At first everything looked fine. I was streaming the collected data into a simple dashboard.
Until one morning when I checked my reports. The graphs were meaningless, some fields were missing, and some values were unbelievable. I saw users typing “abcdef” into the “Age” field and entering negative numbers in “Income.” The data I was collecting had started turning into trash. That’s when I realized once again: collecting raw data is easy, but collecting reliable, useful data is pure hell. In this post I’ll explain how I climbed out of that hell and the steps I apply in my own pipelines.
The Problem: Pitfalls of Raw Data and My Own Mistakes
When I jump into data collection, my first instinct is usually “log everything.” Although that looks like a pragmatic approach, it gets me into trouble fast. Especially while managing more than 13 Docker containers on my own VPS, parsing the logs each container produces and turning them into something meaningful is already a job in itself. If on top of that the log quality drops, debugging becomes impossible.
The most common problems I ran into were:
- Missing data: Some fields aren’t filled as expected, arriving as
nullor empty strings. - Wrong format: Text where I expected a number, or an unrelated string where I expected a date.
- Outliers and anomalies: Values that lie far outside the norm, pushing the limits of logic. Things like “Age: 999.”
- Data pollution: Duplicate records, the same information stored in different ways.
- Privacy violations: Sensitive user information being collected by accident or unnecessarily. Especially critical in an app like spamkalkani.com.
Because of these problems, my reports became unreliable, my AI models received bad input, and the business decisions I made ended up being wrong. So I had to bind my data collection process to solid rules end-to-end, instead of going “good enough.”
Steps for Reliable Data Collection: My Own Playbook
In my own projects — especially hesapciyiz.com and my AI generation pipeline — I improved data quality by applying these steps. The process became a sort of personal “runbook.”
Step 1: Define the Data Schema Up Front (Schema-First Approach)
The first thing I do when designing a data flow is to clearly state what data I expect. That’s vital, especially when exchanging data between different services or sending data to an API endpoint.
- What I did: In my JavaScript projects, I define the expected data structure and types using libraries like JSON Schema or Zod.
- For example, I specify that the “age” field from a form must be an
integerbetween18and120. - I make
requiredfields explicit.
- For example, I specify that the “age” field from a form must be an
- Why it matters: With this, I clarify what I expect even before the data arrives. It’s the first step in preventing developers (or me) from sending the wrong data.
Step 2: Validate Input Data Immediately (Server-Side Validation)
Defining a schema is a start, but you actually have to validate incoming data against that schema. Client-side validation is a well-meaning first step, but server-side validation is mandatory for stopping malicious or faulty requests.
- What I did: In my Node.js services I validate every incoming API call against the schema I defined. If the data doesn’t match, I reject the request and return an error message.
- In
Expressapps, I usezodorjoias middleware for this.
- In
- Why it matters: This is the most critical step in preventing dirty data from reaching the database or downstream services. Just like my Astro build going OOM, I prevent badly formatted data from breaking the processing engine or chewing through resources.
Step 3: Anonymization and Privacy (Privacy by Design)
Caring about privacy when collecting user data isn’t just for KVKK or GDPR compliance — it’s an ethical responsibility. I aim to collect as little data as possible and as anonymously as possible.
- What I did:
- I don’t store IP addresses in full. I zero out the last octet (e.g.
192.168.1.0instead of192.168.1.1). - I store user IDs not directly but as one-way hashes. That way I can’t directly identify the user, but I can correlate the same user’s different interactions.
- In the spamkalkani.com app, I anonymize incoming call data and use it only for statistical analysis. I never store phone numbers or personal information directly.
- I don’t store IP addresses in full. I zero out the last octet (e.g.
- Why it matters: It reduces the risk of data security breaches and earns user trust. It also prevents future legal issues around data usage.
Step 4: Outlier and Anomaly Detection
I often see data that fits the schema but is still nonsensical. For example, a user’s age being between 18 and 120 fits the schema, but a 115-year-old user making 5,000 financial transactions a day might be an anomaly.
- What I did:
- Simple thresholds: In my AI-driven content pipeline, I use simple thresholds to detect sudden jumps in generated text length or word frequency. If a piece of content is 5x longer than expected, it might be an anomaly and I send it for manual review.
- Statistical methods: In very basic situations, I use methods like IQR (Interquartile Range) or Z-score. They show how far a data point is from the rest of the population.
- Why it matters: These anomalies usually point to system errors, bot traffic or serious mistakes in data entry. Detecting them early lets me catch problems that would otherwise lead to bad analyses or system slowdowns. When I ran into
kcompactd at 92% CPUon my VPS, I saw one of the underlying causes was an unexpected workload; abnormal data input can trigger this kind of load too.
Step 5: Data Enrichment and Cleanup
Sometimes raw data is incomplete or insufficient. At that point I enrich the data from external sources or with internal logic. I also clean duplicate records and inconsistencies.
- What I did:
- Geographic enrichment: From the incoming IP I add country or city info using an external GeoIP service.
- User-Agent parsing: I parse the
User-Agentfrom incoming HTTP requests to extract details like the browser and OS. - Deduplication: Sometimes I see the same event logged multiple times. I clean duplicates by generating a unique
eventIdor by combining specific fields.
- Why it matters: Richer data lets me run deeper analyses. Cleanup increases data integrity and accuracy. When I had GitHub Actions runner state corruption, I had to delete directories under
_work/_temp; that showed how destructive a faulty “cleanup” job can be. Data cleanup operations need to be done with similar care.
Step 6: Reliable Storage and Backup
Finally, without a reliable storage solution and a solid backup strategy, all this effort can be wasted.
- What I did:
- Relational database (PostgreSQL): I prefer PostgreSQL for most structural data. Its transactional structure is critical for protecting data integrity. On my own VPS, I keep this blog and my other services’ data in PostgreSQL.
- Daily backups: I take daily backups with
pg_dumpormysqldumpand send them to a different storage area (an S3-compatible service). - Snapshots: I regularly use the disk snapshot features my VPS provider offers.
- Why it matters: When I had the disk fire (Docker disk filled with 33 GB build cache + 23 GB unused images), reaching 100% disk usage put me at risk of data loss. Reliable backup and storage strategies save lives in scenarios like that.
Operational Challenges and Lessons
Reliable data collection isn’t only technical steps; it also requires constant operational discipline.
- Monitoring and alerting: You need to constantly watch the quality of the data being collected. If the rate of empty fields increases or anomalous values explode, I need an alert immediately. My own “Pipeline-health monitor” system emails me a “DEGRADED” notice in those cases. That way I can spot a problem before it grows.
- Trade-offs: Sometimes you have to balance data quality with collection performance or cost. Validating every piece of data down to the smallest detail can slow the system or increase resource usage. So it’s important to identify which data is “critical” and apply stricter rules to that.
- Don’t fear making mistakes: Last month, when I wrote
sleep 360in a data processing script and got OOM-killed, I noticed the data collection service had stopped and I had missed critical data. These mistakes teach me to make the system tougher. I now use more robust mechanisms likepolling-wait.
Conclusion
Reliable data collection is an ongoing battle, especially for someone like me who manages everything on their own server. Adding new features while preserving existing data quality can be tough. But by applying these steps, I’ve seen that the data I collect becomes far more reliable and that I can put my business decisions on more solid ground.
Have you faced similar challenges or applied different strategies on this? I’d love to hear them in the comments. In a future post, maybe I’ll explain how I turn this reliable collected data into a Knowledge Graph.