Post-Mortem on v4.1.9 release

Leo Chen
2 min readJul 23, 2021

Date/Time: July/20/2021 12:00pm — 9:30pm

User Impact: block time on shard 1/2/3 increased up to 8s

Author: Leo Chen

Context:

On Jul 20, 2021, We have decided to deploy the v4.1.9 release to internal nodes on shard 1, 2, and 3. The v4.1.9 release contains a potential fix of the Out-of-memory (OOM) issue at the libp2p layer. This release is not intended to be a public release and is assumed to be low impact, that’s why we didn’t make the announcement to any public channels. The code was pushed to testnet and didn’t cause any issue in block latency.

At first, we upgraded internal nodes on shard 2. After the upgrade of shard 2 to v4.1.9 release, I didn’t notice any issue on shard 2 from watchdog and explorer. So shard 3 and shard 1 were upgraded then. However, validators reported shard 3 had increased block latency up to 3 seconds. I guessed it was due to the necessary view change triggered during the upgrade, and the new leader was in a different AWS region, which may have caused an increase in the block latency. We thought after the epoch change, the original leader will assume the role and the block time shall return to normal. So we didn’t roll back the release immediately after the report. However, after we waited until the epoch change, validators reported even higher block latency on shard 1 and shard 2, which was up to 8 seconds. Then we decided to revert back to the v4.1.8 release on those shards. After the rollback, all shards returned to a normal block time of 2 seconds.

Lessons learned:

  • Rollback faster. If one shard didn’t exhibit the problem, it doesn’t mean the release has no issues.
  • Leaders in different AWS regions shouldn’t affect the block time.
  • Always start with one or two shards, which should not be shard 0, to validate the release.
  • Always make announcements to the validator community even for internal nodes upgrade, to prepare for any potential impact on the network.

Todo:

  • Deep dive into the p2p issue introduced by the v4.1.9 release (Alex)
  • Do smaller rollout of changes to minimize the impact and better triage (Leo)
  • Standardize the internal node upgrade process with announcements (Leo)

--

--