Before 2013, almost all aspects of production operations were
manual: building production-ready artifacts, deploying code,
managing servers, inspecting logs, and more. Deployments were rare,
risky, and time intensive and dev/test/production environments
were inconsistent. Minor fixes could take days or weeks to deliver
Two operations engineers and I improved every area of this process
using a number of tools and training developers to use them.
We spent about a month working with Puppet to automate server
configuration and application deployment. After running into a
number of obstacles, we switched to Chef and have utilized it to
consistently provision servers with identical configurations. Using
community cookbooks, business-specific wrappers, and custom code, we
can now bring a vanilla machine into any state we need: database
server, application server, load balancers, and more. Chef is used
in both test and production environments to ensure consistency
throughout the application lifecycle. Manual provisioning is a
thing of the past.
Continuous Integration and Delivery
I deployed and created build pipelines using Bamboo for all existing
and upcoming projects. Commits to source control automatically
trigger a build which that runs static analysis tools, executes unit
tests, and create an artifact which can be shipped to a test or
production environment. Bamboo also controls deployments to test
environment that any developer can trigger at any time by pushing a
button - these deployments run any integration or functional tests
Simplified Deployments using Capistrano
I built and used community Capistrano recipes to make complex
multi-server deployments easily run from the command line as well as
perform any ad hoc system management that hadn't yet been automated.
Capistrano allowed us to easily deploy to multiple servers in
parallel as well as coordinating different types of servers when
needed - for example, we often deploy new code to a single server
not handling public traffic for a quick smoke test before rolling it
out. With one Capistrano command we could pull the server from the
public pool and deploy the code. Once validated, a second command
would put it back into the pool and then methodically perform the
same sequence for all other production servers.
We brought in many new services and technologies to ensure that
production was properly monitored and that critical problems would
alert someone responsible within minutes of it being detected.
Pingdom was used to monitor basic up/down status of sites, New Relic
makes response time and errors visible, PagerDuty manages on-call
schedules and alerting, and several open source projects like
statsd, collectd, and graphite are used to inspect current and
historical system state. The combination of these technologies have
enabled developers to manage most production applications without
requiring a deep background in server administration. The net
effect is that we can deploy more quickly with fewer problems and
react to new issues incredibly fast.
Many other improvements were made during this process.
Image below not related to me or Turbine. Just awesome.
- Migrated to Nginx from F5 load balancers to reduce cost and
- Implemented consistent backup schedules and tools.
- Isolated sites from one another to improve security and reduce
- Utilized Splunk to aggregate all logs and provide insight to
- Created tools to assist developers with complex Perforce
- Deployed Hubot and wrote custom scripts to inform the team of
production events (and pug bombs).