{"id":3872,"date":"2020-08-11T19:58:56","date_gmt":"2020-08-11T14:28:56","guid":{"rendered":"https:\/\/opstree.com\/blog\/\/?p=3872"},"modified":"2020-08-18T16:28:33","modified_gmt":"2020-08-18T10:58:33","slug":"elasticsearch-garbage-collector-frequent-execution-issue","status":"publish","type":"post","link":"https:\/\/opstree.com\/blog\/2020\/08\/11\/elasticsearch-garbage-collector-frequent-execution-issue\/","title":{"rendered":"Elasticsearch Garbage Collector Frequent Execution Issue"},"content":{"rendered":"\r\n<p id=\"23b1\">Have you noticed an unexpected unallocation of Shards happening at a duration of 1 hour resulting in Cluster state switching from Green &gt; Yellow &gt; Red &gt; Yellow &gt; Green?. During this transition, ES becomes unreachable and the API calls start responding with non 200 code.<\/p>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/1650\/0*ADqhoJ0YBjEfYKHg\" alt=\"Image for post\" \/><\/figure>\r\n\r\n\r\n\r\n<p id=\"705c\"><strong>Environment<\/strong><!--more--><\/p>\r\n\r\n\r\n\r\n<p id=\"3323\"><strong>3 Master Node with 3 Worker Node.<\/strong><\/p>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"285b\">Analysis of Error<\/h2>\r\n\r\n\r\n\r\n<h6 class=\"wp-block-heading\" id=\"7385\"><strong>Garbage Collector Sawtooth Pattern<\/strong><\/h6>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/miro.medium.com\/max\/1650\/0*FpG1YGfdWXhAMrGc\" alt=\"Image for post\" width=\"503\" height=\"99\" \/><\/figure>\r\n\r\n\r\n\r\n<p id=\"534d\">The reason for this sawtooth pattern is that the JVM continuously needs to allocate memory on the heap as new objects are created very frequently as a part of the elasticsearch own program execution like for search queries, write queries, flush, refresh operation and more. Most of these objects are however short-lived objects and quickly become available for collection by the garbage collector in the young region of the heap. When the garbage collector finishes you\u2019ll see a drop on the memory usage graph.\u00a0<a href=\"https:\/\/www.elastic.co\/blog\/found-understanding-memory-pressure-indicator\" target=\"_blank\" rel=\"noreferrer noopener\">Reference<\/a><\/p>\r\n\r\n\r\n\r\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\r\n<p>Note: Most of the Elasticsearch objects are short-lived and collected by Garbage Collector at the Young region.<\/p>\r\n<\/blockquote>\r\n\r\n\r\n\r\n<h6 class=\"wp-block-heading\" id=\"bed5\">High allocation Rates Of Objects Cause Performance Issues<\/h6>\r\n\r\n\r\n\r\n<p id=\"00ec\">The GC logs provide a way to capture how frequently your app is allocating objects. While high allocation rates aren\u2019t necessarily a problem, they can lead to performance issues. To see whether this is affecting your app you can compare the size of the young generation after a collection and before the next one.<\/p>\r\n\r\n\r\n\r\n<p id=\"2ccd\">For example, the following three GC log entries show the app is allocating objects at about 12.48GB\/sec.<\/p>\r\n\r\n\r\n\r\n<pre class=\"wp-block-code\"><code>[31.087s][info ][gc ] GC(153) Pause Young (Normal) (G1 Evacuation Pause) 3105M-&gt;794M(3946M) 55.590ms\r\n[31.268s][info ][gc ] GC(154) Pause Young (Normal) (G1 Evacuation Pause) 3108M-&gt;798M(3946M) 55.425ms \r\n[31.453s][info ][gc ] GC(155) Pause Young (Normal) (G1 Evacuation Pause) 3113M-&gt;802M(3946M) 55.571ms<\/code><\/pre>\r\n\r\n\r\n\r\n<p>&nbsp;<\/p>\r\n\r\n\r\n\r\n<p id=\"f125\">Between 31.087s and 31.268s 2314M was allocated by the app (3108M-794M) and between 31.268s and 31.453s another 2315M was allocated (3113M-798M). This works out about 2.3GB every 200ms or 12.48GB\/sec. Depending on your app\u2019s behavior, allocating objects at that rate could negatively affect its performance.\u00a0<a href=\"https:\/\/www.papertrail.com\/solution\/tips\/7-problems-to-look-out-for-when-analyzing-garbage-collection-logs\/\" target=\"_blank\" rel=\"noreferrer noopener\">Reference<\/a><\/p>\r\n\r\n\r\n\r\n<h6 class=\"wp-block-heading\" id=\"649b\"><strong>High allocation Of Objects Cause High Frequency Of Garage Collection<\/strong><\/h6>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/miro.medium.com\/max\/1650\/0*zf3HLp2m8caa3No4\" alt=\"Image for post\" width=\"504\" height=\"137\" \/><\/figure>\r\n\r\n\r\n\r\n<p id=\"d3b9\">We observed Garbage Collector is executed in an interval of every 1 minute, because of the high rate of object allocation due to Elasticsearch own programs like search queries on shards, and a lot of shards on 3 worker node result in a lot of object with very high frequency. Also when Garage Collector executes, it causes the \u201cStop the World State\u201d means the main elastic search worker&#8217;s main thread stop. When the main thread of elastic search is unresponsive for a long duration, the elasticsearch master assumes that worker node has left cluster and it reallocates the shards among other nodes.<\/p>\r\n\r\n\r\n\r\n<p id=\"fd96\">Below is an example error obtained from elasticsearch logs<\/p>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/miro.medium.com\/max\/1650\/0*fqYoYVKEtpeBjALK\" alt=\"Image for post\" width=\"503\" height=\"93\" \/><\/figure>\r\n\r\n\r\n\r\n<h2 class=\"wp-block-heading\" id=\"d3e8\">Implemented Solution<\/h2>\r\n\r\n\r\n\r\n<p id=\"868e\">Previously we set \u201c2 primary with 6 replicas\u201d for each index which was causing a lot of shards and more shards mean more parallel read operations across each shard resulting in more objects being created frequently. And Elasticsearch suggests having 600 shards per node.<\/p>\r\n\r\n\r\n\r\n<p id=\"e340\"><strong>So we decided on this change :<br \/><\/strong>&#8211; 2 primary with 1 replica<br \/>&#8211; Increase worker node from 3 to 5.<\/p>\r\n\r\n\r\n\r\n<h6 class=\"wp-block-heading\" id=\"59f2\"><strong>Reason to increase worker node<\/strong><\/h6>\r\n\r\n\r\n\r\n<p id=\"0bdd\">First, with the increase in the number of worker nodes, we are able to maintain minimum shard count on each node.<\/p>\r\n\r\n\r\n\r\n<p id=\"6dc5\">Second, 3 parallel Garbage Collectors previously (on each worker node) had to handle a lot of garbage collection but with 5 Garbage Collectors, the task of garbage collection is divided.<\/p>\r\n\r\n\r\n\r\n<p id=\"1935\">Third, with shards divided among 5 nodes, objects created by search queries are also divided among 5 nodes. So on each node frequent object count decreases, which, in turn, decreases the frequency of Garbage Collector execution.<\/p>\r\n<p>Opstree is an End to End DevOps solution provider<\/p>\r\n<p>\r\n\r\n<\/p>\r\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\r\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link\" title=\"https:\/\/www.opstree.com\/contact-us\" href=\"https:\/\/www.opstree.com\/contact-us\" target=\"_blank\" rel=\"noopener\">contact us<\/a><\/div>\r\n<\/div>\r\n","protected":false},"excerpt":{"rendered":"<p>Have you noticed an unexpected unallocation of Shards happening at a duration of 1 hour resulting in Cluster state switching from Green &gt; Yellow &gt; Red &gt; Yellow &gt; Green?. During this transition, ES becomes unreachable and the API calls start responding with non 200 code. Environment<\/p>\n","protected":false},"author":159459904,"featured_media":29900,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[28070474],"tags":[768739310,1418072,140226],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/opstree.com\/blog\/wp-content\/uploads\/2025\/11\/DevSecOps-1.jpg","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pfDBOm-10s","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/3872"}],"collection":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/users\/159459904"}],"replies":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/comments?post=3872"}],"version-history":[{"count":10,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/3872\/revisions"}],"predecessor-version":[{"id":4243,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/posts\/3872\/revisions\/4243"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/media\/29900"}],"wp:attachment":[{"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/media?parent=3872"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/categories?post=3872"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opstree.com\/blog\/wp-json\/wp\/v2\/tags?post=3872"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}