Maintenance mode on elasticsearch clusters
Running elasticsearch means that you'll need to perform maintnance on the nodes from time to time. This guide will describe steps that should be done when performing maintenance operations on an elasticsearch cluster.
Running elasticsearch means that you'll need to perform maintnance on the nodes from time to time. As elasticsearch forms a multi node database, whenever a node goes down elasticsearch will try to recover from this failure state and reassign shards to be resilient for further failures quickly. This guide will describe steps that should be done when performing maintenance operations on an elasticsearch cluster.
Throughout this guide we'll be using powershell - just because it's easy to use. The shown commands can also be run from any other client, like curl, also.
Why "maintenance mode"?
Elasticsearch in generell has no simple "maintenance mode" switch. So what happens if a node goes down?
Upon nodes missing from elasticsearch indices, ES waits until it reaches a timeout (index.unassigned.node_left.delayed_timeout) for those nodes (by default: one minute) and then it's starts to relocate the shards. This is correct and useful behavior - if a node dies, the cluster needs to recovery fast from such a failure.
To bypass this automatism during maintenance jobs, we might increase the timeout for rebuilds or just adjust the allocation of shards. As we don't want to adjust timeout values (a maintenance job might take 5 minutes or 5 hours), this guide will show how to adjust the allocation of shards.
Starting maintenance
Before actually doing your maintenance tasks, you should configure elasticsearch that it doesn't reassign shards that are offline (due nodes that are offline during maintenance).
This script will prompt for credentials (assuming you've enabled security on your ES installation) and put a cluster configuration to elasticsearch. This configuration will effectively disable relocation of shards by limiting allocation of shards to primary only.
You're now set to do your maintenance tasks and reboot nodes as required without introducing pressure on your overall system due recovery jobs.
Leaving maintenance
Having completed your tasks you'll want elasticsearch to revert to a state where it handles failures itself and keeps the data replicated as required by your configuration.
We're clearing the configuration value here, to make sure we don't configure this property at all (unless your setup requires this). Going back to system defaults is best here, as we avoid a slow configuration drift over time.
So, that's it - quite simple.