Service Launch Checklist and ORR

Leo Chen
4 min readJun 25, 2021

--

Introduction

In addition to the development work, launching any service application/website will require serious consideration on security, scalability, stability, operation, and maintenance. The public-facing services put even higher standards on those requirements. Learned from my years of experience launching services within Amazon, I found it is quite useful to use a checklist during the product planning phase and sign off the list during the launch process. AWS is notorious for its DevOps culture and the 24/7 on-call practice. The benefit of operational excellence is apparent, to provide highly available, highly secure, highly efficient service experiences to customers. Many lessons and post-mortem have been learned from years of failure or mistakes. Nowadays, any AWS team launching a new service needs to go through an “ORR” (Operational Readiness Review) process with principal engineers to ensure a smooth launch and the team be operationally ready.

In Harmony, even though we are focusing on the development of the core protocol of the public blockchain, we still need to provide and maintain numerous service infrastructures to serve users and dApps before the community can fully take over them. Those services include, and not limited to the blockchain explorer, staking dashboard, horizon bridge, and so on.

In this post, we’d like to share the launch checklist and best practices we followed to ensure the security, scalability, and operational excellence of those services. In short, this is an internal process document for planning and product launch. It consists of a launch checklist and a list for operational readiness review.

Any feedback or comments are welcome.

Launch checklist

Launch checklist mainly focuses on source code security, network security, and compute security.

code security check

source code security check is the first step to follow during the development stage.

audit

  • service source code internal audit, nil pointer, crash handling, memory leakage, etc
  • service source code external audit with a security audit report
  • smart-contract audit, both internal and external

dependency

  • dependency security check, dependency package library update

credentials

  • security credential handling, KMS support on private keys
  • no plaintext of any credentials, strong encryption algorithm on the passcode
  • do not check-in any credentials into GitHub (using gitguardian monitor)
  • trusted setup, use different credential on a different account to avoid massive leakage

network security check

network security assumes AWS is being used to set up the service. For other cloud providers, please use a similar service.

load balancer

  • load balancer ELB setup on the backend
  • SSL certificate auto-renewal
  • DDoS protection on the load balancer, such as WAF

frontend

  • host static frontend files on netlify/IPFS/S3

KMS

  • KMS setup to manage credentials

backend

  • security group hardening for backend servers
  • whitelisted SSH access of backend servers to jump host only
  • expose service port to trusted network only

compute security

compute security is applicable to the VPS used to host the service.

image

  • up to date LTS OS image and security package updated
  • cronjob set up for regular security package update
  • updated docker base image

VPS

  • sufficient machine spec for the service
  • cloudwatch monitoring on the load of the VPS
  • user account protection, use separate accounts on testing/production environments

scalability consideration

Service request projection

  • product owner shall project the scale upcoming request
  • projected number of request per day

Load balancer setup

  • dedicated endpoint for the backend

Scale-up plan

  • how to monitor the scalability requirement of the service
  • the plan to get ready for the scale-up issue
  • mock-up stress test on the services

Operational Readiness Review (ORR)

ORR is used to make sure the service launched with operational excellence in mind. The effort spent on ORR will reduce the further operational loads of the team and provide better service to end-users.

deployment

automation

  • use ansible to deploy the service
  • auto remove any credentials from the remote host after deployment
  • use systemd to launch service to avoid service interruption

docker

  • docker deployment with docker hub setup using team account
  • docker build / image / process automation

database

  • database setup and security rule setup with write permission restriction
  • database backup rules
  • no leakage of the service account in Github

vault

  • ansible-vault setup to protect credential files
  • encrypt all credentials and config files
  • private repo set up to save the encrypted files

infura

  • infura project ID protection
  • check if high limit account setup is needed for frontend
  • use different infura project ID for different services

CI/CD

  • continuous integration/deployment
  • canary: setup an automated testnet deployment using testnet docker image

release

  • release test
  • release tag/sign-off process
  • release canary

Github integration

  • Travis build and automated test
  • integration with GitHub release action

monitoring

paging

  • uptimerobot monitoring setup
  • pagerduty integration
  • on-call rotation setup

service availability

  • monitor the availability of the service
  • monitor the availability of the frontend
  • monitor the availability of the backend

account balance

  • monitor service account balance on any blockchain
  • pagerduty integration for account balance checking

dashboard

operation and maintenance

runbook

  • where is the runbook?
  • who will update the runbook regularly?

end-user support

  • support email account setup
  • discord/telegram support channel setup
  • published usage FAQ
  • pop-up or inline help on the frontend

restart process

  • how to restart the front end?
  • how to restart the backend?

rollback/revert process

  • rollback/revert criteria
  • what is the rollback/revert process?
  • two person rule on critical rollback

emergency plan

  • how to pause the service in case of a security breach?
  • how to take down the backend?
  • contingency script to increase the threshold

refund process

  • what is the manual refund process?
  • what are the refund criteria?

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response