100% Uptime on Harmony MainNet

Leo Chen
2 min readOct 25, 2019

--

Introduction

100% uptime is Harmony’s top engineering goal in Q4/2019. 100% uptime means the blockchain network should be up without any downtime when the consensus of any shard stops. It doesn’t mean there is no single node downtime as the blockchain network is designed to be resilient to single point of failure. Noted, the uptime should also be measured from the end user’s perspective, which means the usage of blockchain shouldn’t be impacted with any nodes failure.

As Harmony blockchain is a FBFT based blockchain, the leader change is inevitable when the leader may be offline or becomes evil. We has implemented a complete view change algorithm and running on the MainNet for 100+ days. During the leader change, the consensus may be stalled and it may take sometime to reach the consensus. Thus, the uptime is defined as up to 10 minutes to reach consensus. To reach a consensus longer than 10 minutes is deemed as network/shard downtime.

Stability Factors

Learned from our post-mortem in Q3, the following stability factors/issues have to be considered when we are discussing the 100% uptime.

Protocol stability

  • Consensus analysis on all error logs
  • View change has to be able to converge on time
  • State syncing should be resilient to out of sync peers
  • Beacon chain syncing
  • Bad block issues

Network layer stability

  • Bootnode network has to be resilient and elastic
  • Libp2p bugfixes shall be applied actively
  • Libp2p pubsub scalability issue
  • Peer discovery stability

Client program stability

  • OOM oops
  • High CPU usage
  • Idle threads
  • Node.sh out of sync with node binary

Network architecture stability

  • RPC endpoints resilience
  • DNS single-point-of-failure

Node host stability

  • Out of Disk Space
  • Network jitter
  • Single CPU usage
  • Fast node recovery

Deployment stability

  • Staged update on all nodes
  • Decentralized rolling update

Test and Monitoring

To ensure the maximum 100% uptime, we need to plan on the following initiatives on tooling and testing. For each individual stability factor, we will do a specific project to address the concern.

  • Log collection and analysis
  • Automated stress tests on all layers
  • Manual tests on hard-to-automate test cases
  • Monitoring of the nodes and auto-recovery
  • Open source tools and training materials provided to node runners via Pangaea Academy

Engineering Plan

The engineering plan is still working-in-progress. Will be updated.

Phase 1 — the collection of issues

  • Auto log collection from testnet
  • Auto log collection from mainnet
  • Auto log analysis: Warning, Error, Bad blocks
  • Coredump collection

Phase 2 — stress test to discover more issues

  • Tx Benchmark on testnet
  • Release Test Automation

Phase 3 — prioritization

  • Issue analysis
  • Review old issues in SSS and COE projects

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response