Incident Report – July 16th, 2021

Summary

On July 16th, there was an incident affecting some of our validators. Of the total validators with StakeMyHNT, less than 5% were affected.

The first validator was affected at 2:30 PM. The remaining validators that were affected triggered between 2:30 and 5:30 PM.

What was the cause?

Some validators were incorrectly updated to v1.0.10. This was a broken release from the Helium team that caused endless boot loops.

We use canary release groups, as well as manual updates. We don’t automatically update when a version is released until it’s manually confirmed by us to be stable. Between Helium’s quick announcement that v1.0.10 was flawed, as well as our testing, our validators never should have been using this version.

After we reviewed v1.0.11 as being stable, we issued the command to our backplane to update all systems to the latest release. Here’s where something strange happened. Most systems correctly identified the latest release as v1.0.11. However, the systems affected by this incident incorrectly identified the latest release as v1.0.10. This is what caused the validators to fail.

After reviewing the raw network logs, it’s clear the validator found the wrong version. Why it found v1.0.10 and not v1.0.11 is still not clear. We believe it was a cache sitting somewhere between us and quay.io.

Why didn’t your monitoring system detect this sooner?

Our backplane has several functions in place to automatically restart a validator if it gets stuck. It will attempt to restart the validator twice. If that doesn’t solve it, it will terminate the entire instance and spin up a brand new machine with AWS. We get alerted if these automatic steps are not successful.

The problem is v1.0.10 caused a boot loop. The backplane thought it was still on it’s first restart attempt, and was patiently waiting for the machine to become active again.

What are you doing about this moving forward?

We’ve made several changes to our backplane. This includes adding timeouts to restarts and being more aggressive about terminating an instance and rebooting (which would have solved this problem).

When issuing an update, we no longer rely on using the “latest” tag. Instead, we always flag a specific version for download.

We receive alerts immediately at the first sign a machine is having trouble.

We’re also re-writing the logic for when a validator is safe to update. Updates have brought significant performance upgrades and important features, but it’s also one of the most vulnerable times for a validator. We found sometimes a validator incorrectly identified itself as being out of consensus when it really was in consensus. The false negative happens seemingly at random, but never twice in a row. We now require 10 system calls that report being out of consensus before moving forward with an update.

The offline for the affected validators ranged between 1 to 8 hours. For any affected validators we have added a free day to their subscription.

We appreciate you selecting us and understand incidents like this can negatively impact your trust in us. We will continue to do all we can moving forward to provide the best possible quality of service.

If you have any comments, concerns, or suggestions, we encourage you to post a comment or email us at [email protected].

Leave a Reply

Your email address will not be published. Required fields are marked *