Introduction
100% uptime is Harmony’s top engineering goal in Q4/2019. 100% uptime means the blockchain network should be up without any downtime when the consensus of any shard stops. It doesn’t mean there is no single node downtime as the blockchain network is designed to be resilient to single point of failure. Noted, the uptime should also be measured from the end user’s perspective, which means the usage of blockchain shouldn’t be impacted with any nodes failure.
As Harmony blockchain is a FBFT based blockchain, the leader change is inevitable when the leader may be offline or becomes evil. We has implemented a complete view change algorithm and running on the MainNet for 100+ days. During the leader change, the consensus may be stalled and it may take sometime to reach the consensus. Thus, the uptime is defined as up to 10 minutes to reach consensus. To reach a consensus longer than 10 minutes is deemed as network/shard downtime.

Stability Factors
Learned from our post-mortem in Q3, the following stability factors/issues have to be considered when we are discussing the 100% uptime.
Protocol stability
- Consensus analysis on all error logs
- View change has to be able to converge on time
- State syncing should be resilient to out of sync peers
- Beacon chain syncing
- Bad block issues
Network layer stability
- Bootnode network has to be resilient and elastic
- Libp2p bugfixes shall be applied actively
- Libp2p pubsub scalability issue
- Peer discovery stability
Client program stability
- OOM oops
- High CPU usage
- Idle threads
- Node.sh out of sync with node binary
Network architecture stability
- RPC endpoints resilience
- DNS single-point-of-failure
Node host stability
- Out of Disk Space
- Network jitter
- Single CPU usage
- Fast node recovery
Deployment stability
- Staged update on all nodes
- Decentralized rolling update
Test and Monitoring
To ensure the maximum 100% uptime, we need to plan on the following initiatives on tooling and testing. For each individual stability factor, we will do a specific project to address the concern.
- Log collection and analysis
- Automated stress tests on all layers
- Manual tests on hard-to-automate test cases
- Monitoring of the nodes and auto-recovery
- Open source tools and training materials provided to node runners via Pangaea Academy
Engineering Plan
The engineering plan is still working-in-progress. Will be updated.
Phase 1 — the collection of issues
- Auto log collection from testnet
- Auto log collection from mainnet
- Auto log analysis: Warning, Error, Bad blocks
- Coredump collection
Phase 2 — stress test to discover more issues
- Tx Benchmark on testnet
- Release Test Automation