Recovery elastic’s index’s shard if it was broken

Kirill A. Korinsky
2 min readDec 14, 2018

Sometime when you have your elastic you may have an issue when shard was broken.

You may have different root causes, for example:

  • JVM was killed by OOM-killer for some reason;
  • hardware issue when you lose something on your disk;
  • power failure;
  • elastic bugs...

Anyway, it may happen. And it will happen.

Good news is elastic very tolerant of this sort of issue and may self-fix. But it can’t if you haven’t got a replica of this shard.

You may increase the probability of this by setup index.translog.durability to async but in some cases it may increase write speed. I don’t think that you really need it, but if you need… ;)

Anyway, one day you will have red status and if you run GET _cluster/allocation/explain you can find that one of shard has an issue and the system hasn’t got another copy of this shard…

OK, that should you do?

You can run something likes:

after that, you will recreate this shard. Yes, you will have the green status again but lose your data. You can use allocate_stale_primary if some nodes leave cluster and it may contain missed shard.

But if you haven’t got a node that left the cluster and you had only one copy of shard and it was corrupted… you still have an option before lose your data.

At the explain you can find a path to the corrupted Lucene index. Something likes SimpleFSDirectory@/var/lib/elasticsearch/some-node/nodes/0/indices/BYurcTitskeSexyYOBjiu1/66/index

If this folder contains a bit more than a few files a dozen bytes each you still have a chance. How?

Try to run:

sudo java -cp /usr/share/elasticsearch/lib/elasticsearch-*.jar:/usr/share/elasticsearch/lib/* org.apache.lucene.index.CheckIndex [path]

If it found any errors you can try to fix it. How? First of all shutdown the node on the machine. You can try to fix it on the fly, but ask elastic to reread corrupted index is really tricky and unsafe, so, I suggest to don’t do it and just stop it.

After you’ve stopped it re-run the same command but add -exorcise option after the path.

If it can fix it, it will. And if it fixed something you may start your node… and have a green status.

If it can’t? Well… You still may write some code to try to fix your specified case… But I suggest to copy broken index to some safe place, start everything back and remove this shard. Because you have a copy of the index, you can extract all survivors part and re-index it by hand.

If you think that your data is very important and you have no idea how to try to get it… You still can drop me an email and I can try to help ;)

--

--

Kirill A. Korinsky

IT geek who loves to play with the data. Would like to contact me? Just drop an email to kirill@korins.ky