Service Launch Checklist and ORR

4 min readJun 25, 2021

Introduction

In addition to the development work, launching any service application/website will require serious consideration on security, scalability, stability, operation, and maintenance. The public-facing services put even higher standards on those requirements. Learned from my years of experience launching services within Amazon, I found it is quite useful to use a checklist during the product planning phase and sign off the list during the launch process. AWS is notorious for its DevOps culture and the 24/7 on-call practice. The benefit of operational excellence is apparent, to provide highly available, highly secure, highly efficient service experiences to customers. Many lessons and post-mortem have been learned from years of failure or mistakes. Nowadays, any AWS team launching a new service needs to go through an “ORR” (Operational Readiness Review) process with principal engineers to ensure a smooth launch and the team be operationally ready.

In Harmony, even though we are focusing on the development of the core protocol of the public blockchain, we still need to provide and maintain numerous service infrastructures to serve users and dApps before the community can fully take over them. Those services include, and not limited to the blockchain explorer, staking dashboard, horizon bridge, and so on.

In this post, we’d like to share the launch checklist and best practices we followed to ensure the security, scalability, and operational excellence of those services. In short, this is an internal process document for planning and product launch. It consists of a launch checklist and a list for operational readiness review.

Any feedback or comments are welcome.

Launch checklist

Launch checklist mainly focuses on source code security, network security, and compute security.

code security check

source code security check is the first step to follow during the development stage.

audit

service source code internal audit, nil pointer, crash handling, memory leakage, etc
service source code external audit with a security audit report
smart-contract audit, both internal and external

dependency

dependency security check, dependency package library update

credentials

security credential handling, KMS support on private keys
no plaintext of any credentials, strong encryption algorithm on the passcode
do not check-in any credentials into GitHub (using gitguardian monitor)
trusted setup, use different credential on a different account to avoid massive leakage

network security check

network security assumes AWS is being used to set up the service. For other cloud providers, please use a similar service.

load balancer

load balancer ELB setup on the backend
SSL certificate auto-renewal
DDoS protection on the load balancer, such as WAF

frontend

host static frontend files on netlify/IPFS/S3

KMS

KMS setup to manage credentials

backend

security group hardening for backend servers
whitelisted SSH access of backend servers to jump host only
expose service port to trusted network only

compute security

compute security is applicable to the VPS used to host the service.

image

up to date LTS OS image and security package updated
cronjob set up for regular security package update
updated docker base image

VPS

sufficient machine spec for the service
cloudwatch monitoring on the load of the VPS
user account protection, use separate accounts on testing/production environments

scalability consideration

Service request projection

product owner shall project the scale upcoming request
projected number of request per day

Load balancer setup

dedicated endpoint for the backend

Scale-up plan

how to monitor the scalability requirement of the service
the plan to get ready for the scale-up issue
mock-up stress test on the services

Operational Readiness Review (ORR)

ORR is used to make sure the service launched with operational excellence in mind. The effort spent on ORR will reduce the further operational loads of the team and provide better service to end-users.

deployment

automation

use ansible to deploy the service
auto remove any credentials from the remote host after deployment
use systemd to launch service to avoid service interruption

docker

docker deployment with docker hub setup using team account
docker build / image / process automation

database

database setup and security rule setup with write permission restriction
database backup rules
no leakage of the service account in Github

vault

ansible-vault setup to protect credential files
encrypt all credentials and config files
private repo set up to save the encrypted files

infura

infura project ID protection
check if high limit account setup is needed for frontend
use different infura project ID for different services

CI/CD

continuous integration/deployment
canary: setup an automated testnet deployment using testnet docker image

release

release test
release tag/sign-off process
release canary

Github integration

Travis build and automated test
integration with GitHub release action

monitoring

paging

uptimerobot monitoring setup
pagerduty integration
on-call rotation setup

service availability

monitor the availability of the service
monitor the availability of the frontend
monitor the availability of the backend

account balance

monitor service account balance on any blockchain
pagerduty integration for account balance checking

dashboard

use Grafana dashboard to monitor services
install node_exporter to monitor system resources
export Prometheus metrics from the service

operation and maintenance

runbook

where is the runbook?
who will update the runbook regularly?

end-user support

support email account setup
discord/telegram support channel setup
published usage FAQ
pop-up or inline help on the frontend

restart process

how to restart the front end?
how to restart the backend?

rollback/revert process

rollback/revert criteria
what is the rollback/revert process?
two person rule on critical rollback

emergency plan

how to pause the service in case of a security breach?
how to take down the backend?
contingency script to increase the threshold

refund process

what is the manual refund process?
what are the refund criteria?