Fixed a critical silent data-loss bug where OpenAlgo's Strategy Manager permanently loses pending orders on app restart by replacing ephemeral in-memory queues with a SQLite-backed persistent queue
## About the Partner ProjectOpenAlgo is India's leading open-source algorithmic trading platform, supporting 24+ brokers
including Zerodha, Angel One, Dhan, Fyers, and Upstox. It provides a unified REST API,
real-time WebSocket streaming, a visual strategy builder, AI trading integration, and a
Python Strategy Manager — enabling retail traders and developers to build, deploy, and
automate trading strategies without writing broker-specific code.
The platform is actively maintained, production-deployed by hundreds of Indian traders,
and positions itself as "enterprise-grade" with advanced monitoring and reliability tools.
## The Problem I Identified
During a deep code audit of the Python Strategy Manager, I identified a critical
architectural reliability gap that had never been reported.
The Strategy Manager ([blueprints/strategy.py](cci:7://file:///d:/sem4/foss_register/openalgo/blueprints/strategy.py:0:0-0:0)) uses Python's built-in queue.Queue
to buffer all algo orders before sending them to the broker API:
```python
regular_order_queue = queue.Queue() # lives in process RAM only
smart_order_queue = queue.Queue() # lives in process RAM only
These queues exist purely in process memory. The background worker thread that drains them is started with
daemon=True— meaning it is instantly killed when the main process exits, without flushing.
The consequence: Any orders pending in the queue at the moment of an app restart (Docker restart, systemd service restart, OOM kill, or even a simple
Ctrl+C) are permanently and silently discarded. No error is raised. No log entry is written. No user notification is triggered. The orders simply cease to exist.
In live algorithmic trading, this produces:
Silent position mismatch — the strategy believes it's long, the broker shows flat
Missing stop-losses — hedge orders that were queued but never sent
Strategy state corruption — subsequent signals act on a position that doesn't exist
No audit trail — the lost orders don't appear anywhere in the orderbook or tradebook
This is more dangerous than a visible error, because the user has no way to know it happened until they manually cross-check their broker's orderbook.
Filed a detailed architectural issue with root cause analysis, reproducible scenario, impact matrix, and three concrete solution options ranked by implementation effort.
Replaced the volatile
queue.Queuewith aSQLite-backed persistent order queueusing OpenAlgo's existing SQLAlchemy stack — requiring zero new dependencies.
New file:
database/order_queue_db.py
A self-contained persistence layer with full order lifecycle tracking:
enqueue_order() — persists each order to SQLite with status
pending before the worker even sees it
mark_processing() — atomically marks an order as in-flight before the API call, so a crash mid-flight is detectable on next startup
mark_sent() /
mark_failed() — records delivery outcome with timestamps
recover_stale_processing_orders() — on startup, resets any orders stuck in
processing state back to
pending for automatic retry
queue_depth() — per-status order counts for monitoring dashboards
get_failed_orders() — exposes dead-letter orders for inspection
Modified:
blueprints/strategy.py
process_orders() rewritten to read from SQLite instead of RAM
Worker thread changed from
daemon=True→
daemon=False, allowing it to complete its current order before the process exits
SIGTERM and SIGINT signal handlers added: set a
threading.Eventflag that allows the worker loop to exit cleanly rather than being killed mid-order
Startup recovery call added: stale
processing orders from the previous session are automatically re-queued on first worker start
Public
queue_order(endpoint, payload) interface is completely unchanged — all existing callers (webhook handler, squareoff scheduler, etc.) work without any modification
Modified:
Two lines added to initialise the
order_queueDB table in the existing parallel database initialisation block alongside all other OpenAlgo databases.
Before (at-most-once, often zero): Signal → queue.put() → [process killed] → orders lost forever
After (at-least-once): Signal → enqueue_order() [SQLite] → mark_processing() → API call → mark_sent() [success] OR mark_failed() [retry up to N times] → recover on next startup if crashed mid-flight
Distributed systems: implemented at-least-once delivery semantics using a durable write-ahead approach (order persisted before processing begins)
Python concurrency: correct use of
threading.Eventfor cooperative shutdown vs the incorrect
daemon=Truepattern in the original code
Signal handling: POSIX signal handlers (
SIGTERM,
SIGINT) for graceful shutdown in Docker and systemd environments
SQLAlchemy patterns: followed OpenAlgo's exact engine and session conventions (
NullPoolfor SQLite,
scoped_session,
declarative_base) to ensure seamless integration with the existing codebase
Affects all 24+ broker integrations
Fixes a silent failure in live trading scenarios (real financial stakes)
Introduces crashrecovery with zero manual intervention required
Zero new dependencies — uses the existing SQLite/SQLAlchemy stack
Fully configurable via
ORDER_QUEUE_MAX_RETRIESenvironment variable
Closes the gap between OpenAlgo's "enterprise-grade" marketing and its actual reliability guarantees
Issue: https://github.com/marketcalls/openalgo/issues/[YOUR_ISSUE_NUMBER]
Pull Request: https://github.com/marketcalls/openalgo/pull/950