100% Uptime on Harmony MainNet

2 min readOct 25, 2019

Introduction

100% uptime is Harmony’s top engineering goal in Q4/2019. 100% uptime means the blockchain network should be up without any downtime when the consensus of any shard stops. It doesn’t mean there is no single node downtime as the blockchain network is designed to be resilient to single point of failure. Noted, the uptime should also be measured from the end user’s perspective, which means the usage of blockchain shouldn’t be impacted with any nodes failure.

As Harmony blockchain is a FBFT based blockchain, the leader change is inevitable when the leader may be offline or becomes evil. We has implemented a complete view change algorithm and running on the MainNet for 100+ days. During the leader change, the consensus may be stalled and it may take sometime to reach the consensus. Thus, the uptime is defined as up to 10 minutes to reach consensus. To reach a consensus longer than 10 minutes is deemed as network/shard downtime.

Stability Factors

Learned from our post-mortem in Q3, the following stability factors/issues have to be considered when we are discussing the 100% uptime.

Protocol stability

Consensus analysis on all error logs
View change has to be able to converge on time
State syncing should be resilient to out of sync peers
Beacon chain syncing
Bad block issues

Network layer stability

Bootnode network has to be resilient and elastic
Libp2p bugfixes shall be applied actively
Libp2p pubsub scalability issue
Peer discovery stability

Client program stability

OOM oops
High CPU usage
Idle threads
Node.sh out of sync with node binary

Network architecture stability

RPC endpoints resilience
DNS single-point-of-failure

Node host stability

Out of Disk Space
Network jitter
Single CPU usage
Fast node recovery

Deployment stability

Staged update on all nodes
Decentralized rolling update

Test and Monitoring

To ensure the maximum 100% uptime, we need to plan on the following initiatives on tooling and testing. For each individual stability factor, we will do a specific project to address the concern.

Log collection and analysis
Automated stress tests on all layers
Manual tests on hard-to-automate test cases
Monitoring of the nodes and auto-recovery
Open source tools and training materials provided to node runners via Pangaea Academy

Engineering Plan

The engineering plan is still working-in-progress. Will be updated.

Phase 1 — the collection of issues

Auto log collection from testnet
Auto log collection from mainnet
Auto log analysis: Warning, Error, Bad blocks
Coredump collection

Phase 2 — stress test to discover more issues

Tx Benchmark on testnet
Release Test Automation

Phase 3 — prioritization

Issue analysis
Review old issues in SSS and COE projects