Solving Timeout Issues in Python Django on Kubernetes

The cloud world is constantly evolving, and migrating applications from virtual machines (VMs) to platforms like Kubernetes offers scalability, portability, and ease of management.
However, the migration process is not always straightforward, and sometimes the journey doesn’t go as smoothly as expected. Our Python Django application, which had been running flawlessly on a VM, suddenly turned sluggish and unresponsive after the migration.
Timeouts became a frustratingly common occurrence, and the overall performance of the application deteriorated significantly.
This unexpected slowdown was a major concern, as it impacted the user experience and could potentially lead to lost revenue and customer dissatisfaction.
In this blog post, we take you through the steps we followed to track down the performance issues and identify the root cause of our application’s slowdown in the Kubernetes environment.

Step to Resolve Timeout Issues in Python Django on Kubernetes

Even after adjusting configurations and scaling our application, the problem persisted, leading us to delve deeper into the underlying infrastructure. Here are the steps that we followed to identify and fix the issues:

Fine-Tuning Kubernetes Resource Allocation: We looked at our resource allocation for the application and checked it against the minimum requirement for the application to run.

Readiness & Liveness Probe: Initially, we optimized resource usage. Then, we extended the liveness and readiness timeout to ensure that the probe responded back before the timeout exceeded.

Research on Stack Overflow highlighted that under heavy request loads, the probes might struggle to respond promptly.

Therefore, we increased the probe timeout. This adjustment significantly reduced the frequency of timeout issues in our application. Moreover, by doubling the timeout setting, we observed a 25% decrease in application timeouts.

                                   Readiness & Liveness Probe Timeout

Gunicorn Configuration: Even after doubling the time for the liveness and readiness checks, the problem’s still there. So, we added Gunicorn to our Django app. It uses workers better to manage more requests, helping avoid server issues beyond what the checks fix. This makes things smoother and prevents timeouts.

Number of Worker = (2 * #cores) + 1 WorkerClass: gthread

Changing Gunicorn worker class and number of threads: Even though we set up Gunicorn with the usual settings and made the liveness and readiness checks take longer, the problem stayed.

So, we discussed with our Python developer and decided to switch Gunicorn’s worker class to “gevents”. This change helped it handle lots of requests all at once without causing problems.

Upgrading the Postgres Master Server Configuration: After making all the changes to the application, we checked how much the PostgreSQL master was using the node’s resources.

We saw that the CPU was getting really busy, which could be causing the timeouts. So, we decided to increase the node size for the PostgreSQL master. But even after doing that, the problem still persisted.

                                          Postgres Master Server CPU Usage

Setting up monitoring for Postgres and Ingress Controller: Even after making many changes, we still had the same problem with our app. So, we decided to monitor the Nginx ingress controller and our Postgres Database using Postgres exporter. So when we started monitoring the ingress controller & Postgres database, we noticed that when there were too many requests at the same time Postgres tables were also getting locked.

                             Nginx Ingress Controller & Postgres Table Lock

So after implementing monitoring, we noticed that when the application times out, the database tables also get locked.

Based on this analysis, we can conclude that the database lock is causing a timeout issue and an increase in HTTP status code 499 errors, which usually means that the client closed the connection before receiving a response.

Conclusion:

Fixing our app’s slowdowns needed patience, know-how, and the right tools. Keeping a close eye on our system and using tools that help us see what’s going on, we found out why our app was timing out.

It turns out, that our database was getting stuck sometimes, slowing everything down. After fixing that, our app is now running smoothly in Kubernetes, without any issues.

Currently, we are further optimizing our application by introducing Celery to manage the exponential traffic growth. We’ll share the solution for the issue in the upcoming second part of our blog.

Opstreeis an End to End DevOps solution provider

Connect Us

Leave a Reply