In search of a job queue – Statistically incorrect

Job queues are an essential element in internet application design to execute long running tasks. This stems from the fact that web servers and consequently web applications are best suited for interactive applications. If a particular operation that needs to be carried out fits the description below, then it makes a good case for the usage of job queues

For the most part, these tasks need to be carried out near instantaneous fashion. Specifically, the commencement of execution of the task should happen at the earliest possible time. The expectation of completion however depends on the nature of task.
If the task cannot be executed immediately for some reason, it should be queued up and processed later.
Executing the task is resource intensive in some form or the other.
Tasks once submitted must not be dropped to the extent possible.

The desirable features of a job queue are as follows:

The job queue should have durability.
The queue should support multiple producers & consumers. These queue operations would be performed across hosts.
The draining of the queue should happen as quickly as possible. If a new task gets added to the system and there are idle consumers, the consumption should commence at the earliest.

The first two points are well addressed by an RDBMS solution. However, it struggles to achieve the third point since there is no inherent notification mechanism and aggressive polling is the closest solution but it does not scale very well. A message queue is good at supporting the last two points but trying to maintain a credible state is exceptionally hard. Interestingly, the popular job queue solutions out there choose to use either an RDBMS (example: gearman) or an MQ (example: celery). However, a mix of both seems to be the right answer. I shall briefly describe what looks like.

Adding a new task

Add a new element to your data store. This element should represent every aspect of the task such as the task type, the task details and also task management data such as execution status. A unique id must also be generated by the producer before adding the task to the store. Failure to make this entry is considered as failure to accept the job.
A notification event is sent out a message queue. The notification contains the task type and task id.

Processing a task, the normal case

A pool of consumers is actively waiting for notification of a new task and starts working the moment it gets a notification. The delivery mechanism of the notification can be configured to either exactly one or at least one consumer based on what looks like the right trade-off.
The consumer checks with the data store and manipulates it accordingly to indicate that it has voulenteered to perform the task.
When it is done processing the task (either successfully or unsuccessfully), it updates the store with the outcome.

Processing a task, the abnormal cases

The notification message could have gotten lost and not reached any consumer for a variety of reasons. It is necessary to sweep the job queue periodically for any unprocessed tasks and trigger its execution. The latencies associated with this is comparable to a pure RDBMS based queue. Specifically, the need to scan by the value of a field (task status) ni addition to the normal access pattern based on id is what makes RDBMS a convenient choice.
Semi-completed and also failed tasks may have to be retried depending upon the semantics of the task at hand. This might require a back-off mechanism which will effectively need a scheduler. In such situations, the scheduler needs to be held outside of the job queue to achieve clear separation of responsibilities.

So far, I have not been able to find any open source solution that seems to follow the above approach. If you know of any, do let me know. Else I get down to implementing one.