Elasticsearch Backup and Restore in Production

We were fortunate enough to get an opportunity to do an Elasticsearch cluster snapshot and restore on a production highly active cluster. The indices we needed to restore were around 2 – 3 TB in size.

Our task was to take a snapshot from an old cluster (v 6.4.2) which had several huge indices and restore a few of them to a new cluster (v7.9.2). This endeavour was supposed to bring the load down from the old cluster.

The old cluster was facing a lot of performance issues. Read/Write operations were too much to handle. Also, CPU and memory utilization were high most of the time. Segment merging was a lot slower than expected.

For this reason, it became necessary to move some of the indices to a new cluster. We were anticipating that this activity will bring speed and stability in the performance of the application using the clusters. Before starting the activity, naturally, we scoured the internet for everything related to Elasticsearch backup and restore.

While searching on the internet, we found a lot of blogs, videos, and documents which we went through word by word. The research was helpful but there are things that can be learned only by doing. Therefore, we decided to write a blog based on our own experiences and add to the already existing pool of resources on the topic.

There are three ways, according to documentation, through which we could have migrated our indices:

Index
Reindex
Snapshot and restore

This called for an elimination meeting where we decided which way to go for. The first option is the simplest but also not quite efficient as it involves using another tool just for log harvesting. The second option was also not attractive because reindexing could be a quite resource-intensive process that would pose a risk which we were not in the position to afford. Consequently, we went with the third one, Snapshot and restore. Snapshot and restore, in Elasticsearch, is divided into three different processes. These are,

Register a snapshot repository
Create your first snapshot and subsequent incremental snapshots
Restore (or incremental restore) to new location

A snapshot repository, as the name suggests, is a location that stores all our indices and related metadata. It could be anything from a local filesystem to remote cloud object storage. There are multiple options available like fs, URL, s3, etc as stated in the official docs to create a repository.

We went with S3 because it was a convenient option for us. Let us explore S3 further. Using S3 as a repository is quite simple. We need to install an Elasticsearch plugin called repository-s3 in each node as it is a node-level setting and then use the _snapshot API to register the repository in a bucket.

Let’s install the plugin:

cd ~

wget https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-s3/repository-s3-<version>.zip

sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install file:///home/<user>/repository-s3-<version>.zip

We can confirm that the plugin is installed with the below command:

sudo /usr/share/elasticsearch/bin/elasticsearch-plugin list

It’s time for the last pair of command that we need to execute on all nodes:

sudo /usr/share/elasticsearch/bin/elasticsearch-keystore add s3.client.default.access_key

sudo /usr/share/elasticsearch/bin/elasticsearch-keystore add s3.client.default.secret_key

Note: ‘default’ in s3.client.default.access_key is the name of the repository being registered. It is denoted by the setting “client” in the API request. We can give any name we want which we will see further in this blog.

Here, we’ll have to enter our AWS ACCESS_KEY and SECRET_KEY created via IAM for S3 access. Required permission for the IAM role can be found in the official docs.

Now, that we have all the prerequisites settled, let’s register our repository. For this, we need to make a curl request to Elasticsearch _snapshot API as shown below:

curl -X PUT "<hostname/IP>:9200/_snapshot/repo_name" -H 'Content-Type: application/json' -d' 
{ 
"type": "s3", 
"settings": { 
    "bucket": "my-S3-bucket", 
    "region": "ap-south-1", 
    "base_path": "path/to/respective/directory/" 
  } 
 } 
'

If we need to add more repositories, we can add them by changing their “client” name and respective secrets in elasticsearch-keystore. Keystore settings are secure and reloadable, so we can add/update them without restarting the service or cluster.

curl -X PUT "<hostname/IP>:9200/_snapshot/repo_name" -H 'Content-Type: application/json' -d' 
{ 
"type": "s3", 
"settings": {
    “client”: “new-repo”, 
    "bucket": "my-S3-bucket", 
    "region": "ap-south-1", 
    "base_path": "path/to/respective/directory/" 
  } 
 } 
'

The above steps need to be done on both the source cluster and the destination cluster to register the snapshot repository. To view our registered repositories, we can use the below API request,

curl -XGET <hostname/IP>:9200/_cat/repositories

CREATE SNAPSHOTS

Having registered the repository, we can proceed with taking our incremental snapshots,

curl -XPUT "<hostname/IP>:9200/_snapshot/repo_name/my_snapshot_2020-12-30?wait_for_completion=true&pretty" -H 'Content-Type: application/json' -d' 
{ 
    "indices": "comma,seperated,indices", 
    "ignore_unavailable": true, 
    "include_global_state": false 
} 
'

To view all the snapshots of a repository, we can use the below API request,

 curl -XGET <hostname/IP>:9200/_cat/snapshots/repo_name

RESTORE SNAPSHOTS

We can view our snapshots in the destination cluster with the same request as above since both clusters have their repositories at the same location. Now that our incremental snapshots have been taken, it’s time to restore them to the new cluster/or location. The first restore is quite simple. We need to make a POST request like below,

curl -X POST "<hostname/IP>:9200/_snapshot/repo_name/my_snapshot_2020-12-30/_restore?pretty" -H 'Content-Type: application/json' -d'
{
    "indices": "comma,seperated,indices",
    "ignore_unavailable": true,
    "include_global_state": false,
    "index_settings": {
        "index.number_of_replicas": 1
    }
}
'

For incremental restores, we need to ensure that our indices are not open to avoid data conflict or inconsistency. Elasticsearch ensures safety here and does not allow restore operations on open indices. Therefore, to restore on pre-existing indices, we need to close them first,

curl -X POST <hostname/IP>:9200/comma,serperated,indices/_close

Following which we can restore on these indices,

curl -X POST "<hostname/IP>:9200/_snapshot/repo_name/my_snapshot_2020-12-30/_restore?pretty" -H 'Content-Type: application/json' -d'
{
    "indices": "comma,seperated,indices",
    "ignore_unavailable": true,
    "include_global_state": false
}
'

Don’t worry about opening the indices again. _restore API will open closed indices after successful incremental restore.
For detailed information on all the mentioned Elasticsearch API settings and other settings, we’ll link their official documentation here.

After we had our snapshot and restore all figured out, all that was left was to keep taking incremental snapshots until a planned migration time. Then, switch over the traffic to a new cluster during the activity post latest incremental restore. There was a little issue in handling data that is being generated during the activity but that was taken care of with the help of Kafka. Maybe we’ll write a new blog to talk about it in detail.

Co-author: Adeel Ahmad

Opstree is an End to End DevOps solution provider

Author: Sanket Gupta

DevOps Specialist View all posts by Sanket Gupta

One thought on “Elasticsearch Backup and Restore in Production”

hi sanket, great blog and thanks for sharing your experiences. I have a query reagrding the snapshot/restoration process, your thoughts are much appreciated. what do we mean by snapshot is incremental in ES? lets us assue we have scheduled snapshot with SLM for every 1 hr on s3 or any file system which is available even in case of disaster, so we have our data (backup) available. Now lets say disater occured at 40th min (lets us assue that data loss is accepted during these 40mins) of an hour, so we will have backups till the previous hour. In order to restore the all the ES data(snapshots) to new DC do we need to restore all the previous snapshots or latest snapshot restoration is fine? if we need to restore all the snapshots, is there a way we can restore all snapshots in single shot? Also let me know, how frequently the snapshots are required for ES to restore if any disaster occured. thank you 😊 .

jyothiraditya says:

September 29, 2021 at 10:54 am

hi sanket, great blog and thanks for sharing your experiences. I have a query reagrding the snapshot/restoration process, your thoughts are much appreciated. what do we mean by snapshot is incremental in ES? lets us assue we have scheduled snapshot with SLM for every 1 hr on s3 or any file system which is available even in case of disaster, so we have our data (backup) available. Now lets say disater occured at 40th min (lets us assue that data loss is accepted during these 40mins) of an hour, so we will have backups till the previous hour. In order to restore the all the ES data(snapshots) to new DC do we need to restore all the previous snapshots or latest snapshot restoration is fine? if we need to restore all the snapshots, is there a way we can restore all snapshots in single shot? Also let me know, how frequently the snapshots are required for ES to restore if any disaster occured. thank you 😊 .

Loading...