
As the previous post suggested, we decided going with SQS as Celery broker, because of few reasons:
Compliance - we have some offline jobs that are directly related to our business and regulations. This means that we can not afford loosing jobs (see redis evictions)
Our project is fully based on Celery, and migrating from it from (i.e polling SQS by ourselves) would take a long time. With that said, it took us quite some time to make it work with SQS
Redrive policy - SQS offers out of the box a feature that lets the system reliability a serious boost. If a message disappear - it will requeue it, regardless to the state of the system. For instance if there was a power shutdown and our containers died - the jobs will be restored.
The good
Out of the box integration, meaning its just a configuration replacement
Pay as you go - depends on the throughput, no ElasticCache / RabbitMQ instances sitting there waiting to be used
Visibility and monitoring - using AWS console we can monitor queue length, messages in flight, and dead letter messages. Using CloudWatch its possible configuring alerts on queue length etc..
Big community supporting the Celery project (comparing with the alternative of polling the queues with our own implementation)
The bad
We had a problem with our celery config that gave us quite a headache:
acks_late=True - ack the message only on task finish (default False)
acks_on_failure_or_timeout=False - don't ack the message if there was an exception during the task.
prefetch_multiplier=1 - how many tasks should Celery fetch from SQS on single roundtrip to the service.
Why this setup was so bad? it caused the workers not executing any work after a single exception. Why? because it turned out that when you don't ack the message (on exception in this case), its not removed from the prefetch local queue. In return to the community we fixed this issue
The ugly
Plenty of permissions are needed from IAM / SQS. For instance ListQueues / CreateQueue. You are asking - why does it need to create queues ? well the answer is - to let the workers ping each other. So we removed the Celery health cron, assuming that queue length metric would be fine. Regarding list queues, we disabled it by monkey patching Kombu code.
As it turned out, we have different jobs with different timing behaviors, meaning we have a setup that needs to be modified in order to work well with visibility timeout mechanism (one of the reasons we wanted SQS in the first place)
In retrospective was it a good move? Yes, it was a good move. Was it easy as I expected it to be? No. Is management happy? Bottom line we delivered, not in a great timing.