SQLite and Concurrency: The Lockout Experienced at islistesi.com
A few weeks ago, the SQLite database I use to ensure data consistency in a production ERP system gave me serious trouble with an unexpected lockout (lock) issue. Specifically, I encountered situations where certain operations would get stuck, and the system would become unresponsive, especially during high-volume transactions and concurrent access. Although this situation resembled performance problems I’d previously experienced with a large e-commerce site, the root cause was different: the limitations of SQLite’s own concurrency model.
In this post, I will share my personal experience with the SQLite concurrency issue I faced in the islistesi.com project, detailing the lockouts, the debugging process, and the solutions I ultimately found. While I generally prefer more advanced databases like PostgreSQL for my own projects, SQLite’s simplicity and lightweight nature can sometimes be appealing in certain scenarios. However, the concurrency management behind this appeal is a topic that requires careful attention.
The Emergence of the Problem: Unexpected Lockouts
It all started with an intense reporting request hitting my system. When users tried to fetch order data for a specific date range, some requests would hang for a long time, or even get completely stuck. Initially, I thought the problem stemmed from heavy queries, but upon examining the journald logs, I realized the situation was more complex.
May 12 03:14:15 servername systemd[1]: islistesi-backend.service: Main process exited, code=killed, status=9/n/a
May 12 03:14:15 servername systemd[1]: islistesi-backend.service: Failed with result 'signal'.
May 12 03:14:16 servername systemd[1]: islistesi-backend.service: Scheduled restart job, restart counter is: 1
May 12 03:14:16 servername systemd[1]: Stopped islistesi backend service.
May 12 03:14:16 servername systemd[1]: Started islistesi backend service.
These logs showed that the application’s main process was terminating unexpectedly. But this was a symptom, not the problem itself. To find the source of the issue, I had to investigate more deeply. When I checked the running processes with ps aux, I saw that some python processes were using far more CPU and memory than normal.
SQLite’s Concurrency Model and Lock Types
By default, SQLite uses the “Single Writer, Multiple Reader” (SWMR) model. This means that only one write operation can take place at a time, while multiple read operations can run concurrently. During a write operation, the database file is locked with a “write lock.” Read operations, on the other hand, have to wait during this lock.
This model is quite effective for simple applications and single-user scenarios. In my case, however, both read and write operations were happening intensively at the same time. Reporting queries in particular kept the database busy with a “read lock” for long periods, while other services running in the background (for example, data updates or new order insertions) waited to acquire a “write lock.”
What made the problem even more complex were SQLite’s different locking modes:
- IMPLICIT LOCKS: By default, every read and write operation automatically acquires a lock.
- EXPLICIT LOCKS: Locks can be acquired manually with commands like
BEGIN EXCLUSIVEandBEGIN IMMEDIATE. - WAL (Write-Ahead Logging): WAL mode improves the performance of the SWMR model. In this mode, write operations write to a separate WAL file instead of locking the main database file. Read operations read the main database file while write operations update the WAL file. This allows read and write operations to be far more concurrent.
WAL mode was active in my project. In theory, this should have improved performance. However, even with WAL mode active, when multiple write operations tried to happen at the same time, or when a single write operation took a long time, it was inevitable that the other operations would wait. The incorrect use of more aggressive locking mechanisms like BEGIN EXCLUSIVE in particular could trigger the problem.
The Debugging Process: Finding the Root Cause
To solve the problem, I first needed to identify which operations were causing the lockups. I tried several methods for this:
- Database Monitoring: SQLite’s
PRAGMA lock_status;command let me see the current lock state. However, this command gave only a snapshot in time, and I had to wait for the problem to recur. - Application Logs: In my FastAPI backend’s logs, I recorded in detail how long database queries took and when each operation started. This helped me pinpoint long-running queries and potential lock points.
- System Tools: Using system tools like
strace, I traced the SQLite process’s file accesses and locking calls. This helped me understand which system calls were getting stuck.
After a long analysis, I realized that the source of the problem wasn’t actually a single “heavy” query, but rather multiple write operations happening concurrently and the locking strategy these operations followed in WAL mode. In particular, the services handling order updates and stock tracking kept each other waiting when they tried to update multiple rows at the same time.
Solutions and Applied Strategies
I tried several different approaches to solve this problem, and ultimately stabilized the system with the strategies I applied:
-
Transaction Management Optimization:
- I reviewed each service’s interaction with the database. I optimized operations that ran unnecessarily long or updated large numbers of rows.
- Instead of using
BEGIN EXCLUSIVE, I focused on keeping transactions as short as possible. In SQLite,BEGIN IMMEDIATEis usually a better option because it requests a lock only for the write operation, rather than locking the database entirely. - I rewrote my queries to solve N+1 query problems.
-
Improving WAL Mode Settings:
PRAGMA journal_mode=WAL;was already in use. However, by adding settings likePRAGMA busy_timeout = 5000;, I defined how long to wait in the event of a lock. This kept the operation from failing immediately.- I had been using the
PRAGMA synchronous = NORMAL;setting. By switching to the saferPRAGMA synchronous = FULL;setting, I ensured that data was secured on disk after each operation. While this reduces the risk of data loss, it can lower performance somewhat.
-
Evaluating Alternative Database Models:
- As a last resort, I considered migrating to a more advanced database like PostgreSQL. However, due to project requirements and infrastructure constraints, this wasn’t a suitable solution at that moment.
- I researched whether more advanced concurrency models could be added with SQLite’s
load_extension()function, but this generally required more complex solutions.
The most effective solution turned out to be optimizing transaction management and using the busy_timeout setting. I had experienced similar concurrency issues in an Android spam-blocker app I developed, and the busy_timeout setting had worked there too. This was a reminder that simple solutions can sometimes be the most effective.
import sqlite3
import time
# Set the busy timeout (in milliseconds)
timeout_ms = 5000
conn = sqlite3.connect("my_database.db", timeout=timeout_ms)
cursor = conn.cursor()
# Enable WAL mode (if not already active)
cursor.execute("PRAGMA journal_mode=WAL;")
# Set synchronous to FULL (safer)
cursor.execute("PRAGMA synchronous=FULL;")
try:
# A potentially long-running write operation
cursor.execute("BEGIN EXCLUSIVE;")
# ... complex UPDATE or INSERT operations ...
time.sleep(2) # Simulated delay
conn.commit()
except sqlite3.OperationalError as e:
if "database is locked" in str(e):
print("Database is locked, busy_timeout expired.")
conn.rollback()
else:
raise
finally:
conn.close()
Lessons Learned and Conclusion
This SQLite concurrency issue in the islistesi.com project taught me several important lessons. First, that even databases which appear simple can have complex concurrency models, and that these models need to be well understood. Second, that although WAL mode improves performance, you have to be careful during intensive write operations. Finally, that even simple settings like busy_timeout can solve serious problems.
Encountering problems like this is a reality of the tech world. What matters is diagnosing the problem correctly, trying different solution paths, and finding the most suitable one. Living through these kinds of experiences in my own systems is exactly what lets me produce the concrete, realistic content I share on my blog.
I hope my experience proves useful when you run into similar problems. If you have your own concurrency issues and solutions with SQLite or other databases, don’t hesitate to share them in the comments. Perhaps in a future post, I’ll tell the story of a similar lockup in a different database system.