Preface
As I started a new job at Autodesk, I came across a chaotic Django / Celery project. In a way everything was standard - for API and ORM we use Django Rest Framework, with Celery and Redis as broker. But from some reason I can't explain, it was the worst, unreliable, unscalable project I have ever seen. The developers decided to persist the messages by writing them both to Redis and Postgres table. They used Celery Beat in order to periodically find tasks that were not executed, by querying the DB table. Not mentioning the horrible project structure / containerized two headed monolith - two different products in one docker image, using the same ECS tasks (and CI CD) for both products, with different different databases (migrations and all).
Organisational circumstances
As in every blog post that I had, and every lecture that I gave, I need to explain the organisational posture. Autodesk went through a re-org, which meant to me - after two months working under one team lead, changing the team and the boss as well, while bringing the project with me. My perspective (which is not the common one among my colleagues) - this is not the right time for paying technical debts such as this one, mainly because you don't want to get into a technical rabbit holes, without shipping a single feature after forming the new team.
The catalyst
2 weeks after running in the new team, I saw that we are going to have problems with our ElasticCache instance (1 primary node, 2 read replicas), and a day after reporting about it, we had a failover. Few tasks went missing. User experience was not harmed, but yet its a bad situation. BTW this did not happen because of Celery, it happened because of Redis keys without TTL. While taking care of it, we decided its time to move from Redis as broker, to something more reliable and persisted.
Why SQS?
1. Built in Celery integration
2. Fully managed by Amazon
3. Good and reliable visibility timeout and dead letter queuing mechanisms
4. Experience shared across the development team
5. No plenty of infrastructure work (Terraform module already existed)
6. Fully transparent via AWS Consule
Problems raised
1. We realized Redis and SQS has different message sizes
2. SQS contrary to Redis, can not act as results back-end (which is not mandatory unless you are using Celery GET method which is blocking)
3. By default Celery acks the message on consumption
4. By default Celery acks the message after max_retries exceeds
Solutions
1. We overcome the message size problem by using the existing Postgres table holding the messages, passing by SQS only the unique ID to DB record holding all data passed to the workers
2. We removed the usage of blocking GET method, but its possible using Redis as results back-end, as demonstrated in the GitHub project linked below.
3. We passed all the celery tasks the flag acks_late=True, meaning ack the message on finish.
4. At may 2019, a new feature was release, that lets the message be not acked on error, meaning visibility timeout will restore the message X times as configured in SQS, and move it to a dead letter queue if exceeds this number
Please visit and share this post!
Thanks for reading
Comments