WelpConf: Striving for Stability
- Greg Poirer
On May 12th, 2016, Heavybit member company Opsee hosted their first ever WelpConf. The event featured guest speaker Andy Smith from Wercker, and a panel moderated by Dan Turchin of BigPanda, featuring Andy Smith & Greg Poirer from Opsee. Videos of the talk and panel are below, along with further thoughts from Greg.
Have you ever been on-call? Has your phone beeped and buzzed at 2 a.m.? Have you ever looked at an alert and thought to yourself, “Welp?”
Striving for Stability
Systems are a representation of reality that strive, and fail, toward determinism. They are only perfect in our mind, and while we make every effort to translate our desires into reality, we often fail to consider every possible input or address every possible failure scenario during implementation, and this leads to instability.
Every action between building locally on a developer’s laptop and deploying to production should be reproducible.
Only that which is reproducible can be stable. The unknown may never be fully known, but 99.99% availability is a laudable achievement. By working toward reproducibility in our actions, we can inform our decision making with assurance that the system will behave in a predictable fashion. Docker is a powerful tool that is a step in the direction toward reproducibility, but there are right ways to do things and there are wrong ways to do things.
The IT Ops Journey: Then, Now, and Next
Computer operators, systems technicians, systems administrators, operations engineers, systems engineers, developers. Throughout the history of computers, the only constant has been operations. Someone is responsible for making sure things are doing what they should be doing. When the fire went out, homo erectus would rekindle the flame.
Playbooks, battleplans, cron job failure e-mails, and scheduled restarts of services have been the tried and true method of reacting to operational incidents for time immemorial. The sysadmin’s handbook twenty years ago is still very much the monitoring of today.
Monitoring has been the Big Five for as long as most of us can remember: ping tests, process checks, CPU utilization, memory utilization, and disk utilization.
In the past few years, intrepid developers have made the things under observation and the ways that we observe them increasingly sophisticated. We’ve gotten exceedingly good at being able to answer the question, “What is happening and where is it happening?” We have not, however, taken to going a step further.
Our distrust of the systems we build has left us in a position of constant manual intervention and ad hoc automation of remediation.
We need goals for IT operations and monitoring. We need to be able to build trust in the systems we build. Anything emitted by a system should be recorded, catalogued, and understood. Relationships between components should be well-defined and discoverable. The people building and deploying systems should have a comprehensive understanding of their failure domains, and the systems themselves should help inform that comprehension.
Sign up here to attend the next WelpConf, and check out our events page for a schedule of upcoming developer focused events at our San Francisco Clubhouse.
Subscribe to Heavybit Updates
You don’t have to build on your own. We help you stay ahead with the hottest resources, latest product updates, and top job opportunities from the community. Don’t miss out—subscribe now.
Content from the Library
Data Council 2025: The Data Science & Algorithms Track with Sean Taylor and Jesse Robbins
Heavybit is thrilled to be sponsoring Data Council 2025, and we invite you to join us in Oakland from Apr 22-24 to experience 3...
How to Create Data Pipelines
How to Create Data Pipelines Introduction to Data Pipelines In today’s data-driven world developers and product managers rely...
Data Council 2025: The Databases Track with Sai Krishna Srirampur and Craig Kerstiens
Heavybit is thrilled to be sponsoring Data Council 2025, and we invite you to join us in Oakland from Apr 22-24 to experience 3...