{"id":18492,"date":"2024-05-28T16:29:15","date_gmt":"2024-05-28T10:59:15","guid":{"rendered":"https:\/\/opstree.com\/blog\/?p=18492"},"modified":"2024-06-11T18:24:39","modified_gmt":"2024-06-11T12:54:39","slug":"solving-timeout-issues-in-python-django-on-kubernetes","status":"publish","type":"post","link":"https:\/\/opstree.com\/blog\/2024\/05\/28\/solving-timeout-issues-in-python-django-on-kubernetes\/","title":{"rendered":"Solving Timeout Issues in Python Django on Kubernetes"},"content":{"rendered":"<div class=\"er es et eu ev l\">\n<article>\n<div class=\"l\">\n<div class=\"l\">\n<section>\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">The cloud world is constantly evolving, and migrating applications from virtual machines (VMs) to platforms like Kubernetes offers scalability, portability, and ease of management.<\/div>\n<div><\/div>\n<div class=\"ch bg ew ex ey ez\">However, the migration process is not always straightforward, and sometimes the journey doesn&#8217;t go as smoothly as expected. Our Python Django application, which had been running flawlessly on a VM, suddenly turned sluggish and unresponsive after the migration.<\/div>\n<div><\/div>\n<div class=\"ch bg ew ex ey ez\">Timeouts became a frustratingly common occurrence, and the overall performance of the application deteriorated significantly.<\/div>\n<div><\/div>\n<div class=\"ch bg ew ex ey ez\">This unexpected slowdown was a major concern, as it impacted the user experience and could potentially lead to lost revenue and customer dissatisfaction.<\/div>\n<div><\/div>\n<div class=\"ch bg ew ex ey ez\">In this blog post, we take you through the steps we followed to track down the performance issues and identify the root cause of our application&#8217;s slowdown in the Kubernetes environment.<\/div>\n<\/div>\n<\/div>\n<\/section>\n<\/div>\n<\/div>\n<\/article>\n<\/div>\n<p><!--more--><\/p>\n<div class=\"er es et eu ev l\">\n<article>\n<div class=\"l\">\n<div class=\"l\">\n<section>\n<div class=\"fk fl fm fn fo\">\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\">\n<h1 id=\"6355\" class=\"ne nf fr be ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob bj\" data-selectable-paragraph=\"\">Step to Resolve Timeout Issues in Python Django on Kubernetes<\/h1>\n<p>Even after adjusting configurations and scaling our application, the problem persisted, leading us to delve deeper into the underlying infrastructure. Here are the steps that we followed to identify and fix the issues:<\/p>\n<p id=\"10c6\" class=\"pw-post-body-paragraph ln lo fr lp b lq oc ls lt lu od lw lx ly oe ma mb mc of me mf mg og mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\">Fine-Tuning Kubernetes Resource Allocation:<\/strong> We looked at our resource allocation for the application and checked it against the minimum requirement for the application to run.<\/p>\n<p id=\"dd70\" class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\">Readiness &amp; Liveness Probe:<\/strong> Initially, we optimized resource usage. Then, we extended the liveness and readiness timeout to ensure that the probe responded back before the timeout exceeded.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\">Research on Stack Overflow highlighted that under heavy request loads, the probes might struggle to respond promptly.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\">Therefore, we increased the probe timeout. This adjustment significantly reduced the frequency of timeout issues in our application. Moreover, by doubling the timeout setting, we <strong>observed a 25% decrease in application timeouts<\/strong>.<\/p>\n<figure class=\"mo mp mq mr ms mt ml mm paragraph-image\">\n<div class=\"mu mv ee mw bg mx\" role=\"button\">\n<div class=\"ml mm mn\"><img loading=\"lazy\" decoding=\"async\" class=\"bg ku my c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*07HWkE3MkZnxHbuX66xEug.png\" alt=\"\" width=\"700\" height=\"183\" \/><\/div>\n<\/div><figcaption class=\"mz na nb ml mm nc nd be b bf z dw\" data-selectable-paragraph=\"\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Readiness &amp; Liveness Probe Timeout<\/figcaption><\/figure>\n<p id=\"f5c8\" class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\">Gunicorn Configuration:<\/strong> Even after doubling the time for the liveness and readiness checks, the problem\u2019s still there. So, we added <strong>Gunicorn<\/strong> to our Django app. It uses workers better to manage more requests, helping avoid server issues beyond what the checks fix. This makes things smoother and prevents timeouts.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\">Number of Worker = (2 * #cores) + 1 WorkerClass: gthread<\/strong><\/p>\n<p id=\"224d\" class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\">Changing Gunicorn worker class and number of threads:\u00a0<\/strong>Even though we set up Gunicorn with the usual settings and made the liveness and readiness checks take longer, the problem stayed.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\">So, we discussed with our Python developer and decided to switch Gunicorn\u2019s worker class to \u201c<a class=\"af oh\" href=\"https:\/\/dev.to\/lsena\/gunicorn-worker-types-how-to-choose-the-right-one-4n2c#:~:text=your%20usage%20patterns.-,eventlet\/gevent,-Eventlet%20and%20gevent\" target=\"_blank\" rel=\"noopener ugc nofollow\">gevents<\/a>\u201d. This change helped it handle lots of requests all at once without causing problems.<\/p>\n<p id=\"bb77\" class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\">Upgrading the Postgres Master Server Configuration:<\/strong>\u00a0After making all the changes to the application, we checked how much the PostgreSQL master was using the node\u2019s resources.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\">We saw that the CPU was getting really busy, which could be causing the timeouts. So, we decided to increase the node size for the PostgreSQL master. But even after doing that, the problem still persisted.<\/p>\n<figure class=\"mo mp mq mr ms mt ml mm paragraph-image\">\n<div class=\"mu mv ee mw bg mx\" role=\"button\">\n<div class=\"ml mm oi\"><img loading=\"lazy\" decoding=\"async\" class=\"bg ku my c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*uEH8JA9HvP-TB03JKUitaA.png\" alt=\"\" width=\"700\" height=\"335\" \/><\/div>\n<\/div><figcaption class=\"mz na nb ml mm nc nd be b bf z dw\" data-selectable-paragraph=\"\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Postgres Master Server CPU Usage<\/figcaption><\/figure>\n<p id=\"16c1\" class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\"><strong class=\"lp fs\"> Setting up monitoring for Postgres and Ingress Controller:<\/strong>\u00a0Even after making many changes, we still had the same problem with our app. So, we decided to monitor the\u00a0<strong class=\"lp fs\">Nginx ingress controller<\/strong>\u00a0and our<strong class=\"lp fs\">\u00a0Postgres Database<\/strong>\u00a0using\u00a0<strong class=\"lp fs\">Postgres<\/strong>\u00a0exporter. So when we started monitoring the ingress controller &amp; Postgres database, we noticed that when there were too many requests at the same time Postgres tables were also getting locked.<\/p>\n<figure class=\"mo mp mq mr ms mt ml mm paragraph-image\">\n<div class=\"mu mv ee mw bg mx\" role=\"button\">\n<div class=\"ml mm oj\"><img loading=\"lazy\" decoding=\"async\" class=\"bg ku my c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*vE8ZRK_0b-WyI18vabev7w.png\" alt=\"\" width=\"700\" height=\"354\" \/><\/div>\n<\/div><figcaption class=\"mz na nb ml mm nc nd be b bf z dw\" data-selectable-paragraph=\"\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Nginx Ingress Controller &amp; Postgres Table Lock<\/figcaption><\/figure>\n<p id=\"9996\" class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\">So after implementing monitoring, we noticed that when the application times out, the <strong>database tables also get locked<\/strong>.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk fk bj\" data-selectable-paragraph=\"\">Based on this analysis, we can conclude that the database lock is causing a timeout issue and an increase in HTTP status code <strong>499<\/strong> errors, which usually means that the client closed the connection before receiving a response.<\/p>\n<h1 id=\"6bfb\" class=\"ne nf fr be ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx ny nz oa ob bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Conclusion:<\/strong><\/h1>\n<p id=\"10d2\" class=\"pw-post-body-paragraph ln lo fr lp b lq oc ls lt lu od lw lx ly oe ma mb mc of me mf mg og mi mj mk fk bj\" data-selectable-paragraph=\"\">Fixing our app\u2019s slowdowns needed patience, know-how, and the right tools. Keeping a close eye on our system and using tools that help us see what\u2019s going on, we found out why our app was timing out.<\/p>\n<p class=\"pw-post-body-paragraph ln lo fr lp b lq oc ls lt lu od lw lx ly oe ma mb mc of me mf mg og mi mj mk fk bj\" data-selectable-paragraph=\"\">It turns out, that our database was getting stuck sometimes, slowing everything down. After fixing that, our app is now running smoothly in Kubernetes, without any issues.<\/p>\n<p>Currently, we are further optimizing our application by introducing Celery to manage the exponential traffic growth. We&#8217;ll share the solution for the issue in the upcoming second part of our blog.<\/p>\n<\/div>\n<div><\/div>\n<\/div>\n<\/div>\n<\/section>\n<\/div>\n<\/div>\n<\/article>\n<div class=\"ab ca\">\n<div class=\"ch bg ew ex ey ez\"><strong>Blog Pundit:<\/strong> <strong><a href=\"https:\/\/opstree.com\/blog\/author\/sandeep7c51ad81ba\/\">Sandeep Rawat<\/a> &amp; <a href=\"https:\/\/opstree.com\/blog\/author\/deepakgupta97\/\">Deepak Gupta<\/a><\/strong><\/div>\n<div><\/div>\n<div class=\"ch bg ew ex ey ez\"><a href=\"https:\/\/www.opstree.com\/contact-us?utm_source=wordpress&amp;utm_campaign=AWS-Gateway-LoadBalancer-A-Load-Balancer-that-we-deserve&amp;utm_id=Blog\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Opstree<\/strong><\/a>is an End to End DevOps solution provider<\/div>\n<div class=\"ch bg ew ex ey ez\">\n<div class=\"wp-block-buttons\">\n<div><\/div>\n<div class=\"wp-block-button is-style-fill\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/www.opstree.com\/contact-us\" target=\"_blank\" rel=\"noreferrer noopener\">CONTACT US<\/a><\/div>\n<\/div>\n<p class=\"has-text-align-center\"><strong>Connect Us<\/strong><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The cloud world is constantly evolving, and migrating applications from virtual machines (VMs) to platforms like Kubernetes offers scalability, portability, and ease of management. However, the migration process is not always straightforward, and sometimes the journey doesn&#8217;t go as smoothly as expected. Our Python Django application, which had been running flawlessly on a VM, suddenly &hellip; <a href=\"https:\/\/opstree.com\/blog\/2024\/05\/28\/solving-timeout-issues-in-python-django-on-kubernetes\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Solving Timeout Issues in Python Django on Kubernetes&#8221;<\/span><\/a><\/p>\n","protected":false},"author":237666321,"featured_media":18525,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[768739351],"tags":[768739352],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2024\/05\/Solving-Timeout-Issues-in-Python-Django-on-Kubernetes-3.png","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pfDBOm-4Og","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/18492"}],"collection":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/users\/237666321"}],"replies":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/comments?post=18492"}],"version-history":[{"count":18,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/18492\/revisions"}],"predecessor-version":[{"id":18551,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/18492\/revisions\/18551"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/media\/18525"}],"wp:attachment":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/media?parent=18492"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/categories?post=18492"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/tags?post=18492"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}