
Introduction
In addition to the development work, launching any service application/website will require serious consideration on security, scalability, stability, operation, and maintenance. The public-facing services put even higher standards on those requirements. Learned from my years of experience launching services within Amazon, I found it is quite useful to use a checklist during the product planning phase and sign off the list during the launch process. AWS is notorious for its DevOps culture and the 24/7 on-call practice. The benefit of operational excellence is apparent, to provide highly available, highly secure, highly efficient service experiences to customers. Many lessons and post-mortem have been learned from years of failure or mistakes. Nowadays, any AWS team launching a new service needs to go through an “ORR” (Operational Readiness Review) process with principal engineers to ensure a smooth launch and the team be operationally ready.
In Harmony, even though we are focusing on the development of the core protocol of the public blockchain, we still need to provide and maintain numerous service infrastructures to serve users and dApps before the community can fully take over them. Those services include, and not limited to the blockchain explorer, staking dashboard, horizon bridge, and so on.
In this post, we’d like to share the launch checklist and best practices we followed to ensure the security, scalability, and operational excellence of those services. In short, this is an internal process document for planning and product launch. It consists of a launch checklist and a list for operational readiness review.
Any feedback or comments are welcome.
Launch checklist
Launch checklist mainly focuses on source code security, network security, and compute security.
code security check
source code security check is the first step to follow during the development stage.
audit
- service source code internal audit, nil pointer, crash handling, memory leakage, etc
- service source code external audit with a security audit report
- smart-contract audit, both internal and external
dependency
- dependency security check, dependency package library update
credentials
- security credential handling, KMS support on private keys
- no plaintext of any credentials, strong encryption algorithm on the passcode
- do not check-in any credentials into GitHub (using gitguardian monitor)
- trusted setup, use different credential on a different account to avoid massive leakage
network security check
network security assumes AWS is being used to set up the service. For other cloud providers, please use a similar service.
load balancer
- load balancer ELB setup on the backend
- SSL certificate auto-renewal
- DDoS protection on the load balancer, such as WAF
frontend
- host static frontend files on netlify/IPFS/S3
KMS
- KMS setup to manage credentials
backend
- security group hardening for backend servers
- whitelisted SSH access of backend servers to jump host only
- expose service port to trusted network only
compute security
compute security is applicable to the VPS used to host the service.
image
- up to date LTS OS image and security package updated
- cronjob set up for regular security package update
- updated docker base image
VPS
- sufficient machine spec for the service
- cloudwatch monitoring on the load of the VPS
- user account protection, use separate accounts on testing/production environments
scalability consideration
Service request projection
- product owner shall project the scale upcoming request
- projected number of request per day
Load balancer setup
- dedicated endpoint for the backend
Scale-up plan
- how to monitor the scalability requirement of the service
- the plan to get ready for the scale-up issue
- mock-up stress test on the services
Operational Readiness Review (ORR)
ORR is used to make sure the service launched with operational excellence in mind. The effort spent on ORR will reduce the further operational loads of the team and provide better service to end-users.
deployment
automation
- use ansible to deploy the service
- auto remove any credentials from the remote host after deployment
- use systemd to launch service to avoid service interruption
docker
- docker deployment with docker hub setup using team account
- docker build / image / process automation
database
- database setup and security rule setup with write permission restriction
- database backup rules
- no leakage of the service account in Github
vault
- ansible-vault setup to protect credential files
- encrypt all credentials and config files
- private repo set up to save the encrypted files
infura
- infura project ID protection
- check if high limit account setup is needed for frontend
- use different infura project ID for different services
CI/CD
- continuous integration/deployment
- canary: setup an automated testnet deployment using testnet docker image
release
- release test
- release tag/sign-off process
- release canary
Github integration
- Travis build and automated test
- integration with GitHub release action
monitoring
paging
- uptimerobot monitoring setup
- pagerduty integration
- on-call rotation setup
service availability
- monitor the availability of the service
- monitor the availability of the frontend
- monitor the availability of the backend
account balance
- monitor service account balance on any blockchain
- pagerduty integration for account balance checking
dashboard
- use Grafana dashboard to monitor services
- install node_exporter to monitor system resources
- export Prometheus metrics from the service
operation and maintenance
runbook
- where is the runbook?
- who will update the runbook regularly?
end-user support
- support email account setup
- discord/telegram support channel setup
- published usage FAQ
- pop-up or inline help on the frontend
restart process
- how to restart the front end?
- how to restart the backend?
rollback/revert process
- rollback/revert criteria
- what is the rollback/revert process?
- two person rule on critical rollback
emergency plan
- how to pause the service in case of a security breach?
- how to take down the backend?
- contingency script to increase the threshold
refund process
- what is the manual refund process?
- what are the refund criteria?