Persistent Order Queue for OpenAlgo Strategy Manager

Fixed a critical silent data-loss bug where OpenAlgo's Strategy Manager permanently loses pending orders on app restart by replacing ephemeral in-memory queues with a SQLite-backed persistent queue

Description
## About the Partner Project

OpenAlgo is India's leading open-source algorithmic trading platform, supporting 24+ brokers

including Zerodha, Angel One, Dhan, Fyers, and Upstox. It provides a unified REST API,

real-time WebSocket streaming, a visual strategy builder, AI trading integration, and a

Python Strategy Manager — enabling retail traders and developers to build, deploy, and

automate trading strategies without writing broker-specific code.

The platform is actively maintained, production-deployed by hundreds of Indian traders,

and positions itself as "enterprise-grade" with advanced monitoring and reliability tools.

## The Problem I Identified

During a deep code audit of the Python Strategy Manager, I identified a critical

architectural reliability gap that had never been reported.

The Strategy Manager ([blueprints/strategy.py](cci:7://file:///d:/sem4/foss_register/openalgo/blueprints/strategy.py:0:0-0:0)) uses Python's built-in queue.Queue

to buffer all algo orders before sending them to the broker API:

```python

regular_order_queue = queue.Queue() # lives in process RAM only

smart_order_queue = queue.Queue() # lives in process RAM only

These queues exist purely in process memory. The background worker thread that drains them is started with

daemon=True

— meaning it is instantly killed when the main process exits, without flushing.

The consequence: Any orders pending in the queue at the moment of an app restart (Docker restart, systemd service restart, OOM kill, or even a simple

Ctrl+C

) are permanently and silently discarded. No error is raised. No log entry is written. No user notification is triggered. The orders simply cease to exist.

In live algorithmic trading, this produces:

  • Silent position mismatch — the strategy believes it's long, the broker shows flat

  • Missing stop-losses — hedge orders that were queued but never sent

  • Strategy state corruption — subsequent signals act on a position that doesn't exist

  • No audit trail — the lost orders don't appear anywhere in the orderbook or tradebook

This is more dangerous than a visible error, because the user has no way to know it happened until they manually cross-check their broker's orderbook.

My Contribution

Issue Raised

Filed a detailed architectural issue with root cause analysis, reproducible scenario, impact matrix, and three concrete solution options ranked by implementation effort.

Fix Implemented — PR #950

Replaced the volatile

queue.Queue

with aSQLite-backed persistent order queueusing OpenAlgo's existing SQLAlchemy stack — requiring zero new dependencies.

New file:

database/order_queue_db.py

A self-contained persistence layer with full order lifecycle tracking:

  • enqueue_order() — persists each order to SQLite with status

    pending before the worker even sees it

  • mark_processing() — atomically marks an order as in-flight before the API call, so a crash mid-flight is detectable on next startup

  • mark_sent() /

    mark_failed() — records delivery outcome with timestamps

  • recover_stale_processing_orders() — on startup, resets any orders stuck in

    processing state back to

    pending for automatic retry

  • queue_depth() — per-status order counts for monitoring dashboards

  • get_failed_orders() — exposes dead-letter orders for inspection

Modified:

blueprints/strategy.py

  • process_orders() rewritten to read from SQLite instead of RAM

  • Worker thread changed from

    daemon=True

    daemon=False

    , allowing it to complete its current order before the process exits

  • SIGTERM and SIGINT signal handlers added: set a

    threading.Event

    flag that allows the worker loop to exit cleanly rather than being killed mid-order

  • Startup recovery call added: stale

    processing orders from the previous session are automatically re-queued on first worker start

  • Public

    queue_order(endpoint, payload) interface is completely unchanged — all existing callers (webhook handler, squareoff scheduler, etc.) work without any modification

Modified:

app.py

Two lines added to initialise the

order_queue

DB table in the existing parallel database initialisation block alongside all other OpenAlgo databases.

Order Lifecycle (Before vs After)

Before (at-most-once, often zero): Signal → queue.put() → [process killed] → orders lost forever

After (at-least-once): Signal → enqueue_order() [SQLite] → mark_processing() → API call → mark_sent() [success] OR mark_failed() [retry up to N times] → recover on next startup if crashed mid-flight

Technical Depth

  • Distributed systems: implemented at-least-once delivery semantics using a durable write-ahead approach (order persisted before processing begins)

  • Python concurrency: correct use of

    threading.Event

    for cooperative shutdown vs the incorrect

    daemon=True

    pattern in the original code

  • Signal handling: POSIX signal handlers (

    SIGTERM

    ,

    SIGINT

    ) for graceful shutdown in Docker and systemd environments

  • SQLAlchemy patterns: followed OpenAlgo's exact engine and session conventions (

    NullPool

    for SQLite,

    scoped_session

    ,

    declarative_base

    ) to ensure seamless integration with the existing codebase

Impact

  • Affects all 24+ broker integrations

  • Fixes a silent failure in live trading scenarios (real financial stakes)

  • Introduces crashrecovery with zero manual intervention required

  • Zero new dependencies — uses the existing SQLite/SQLAlchemy stack

  • Fully configurable via

    ORDER_QUEUE_MAX_RETRIES

    environment variable

  • Closes the gap between OpenAlgo's "enterprise-grade" marketing and its actual reliability guarantees

Links


Issues & PRs Board
No issues or pull requests added.