Table of contents

Look, we've all been there: there's a term, you've heard it one hundred times. You've nodded as others said it in meetings. And now, you've started to say it. The only tiny insignificant problem is that you're not 100% sure what it actually means or how it's different from another similar term. I feel you. So I wrote this DevOps glossary with my highly opinionated definitions of common DevOps industry terms.

Core DevOps Practices

  • Continuous Integration (CI): A practice where developers frequently merge code changes into a central repository, followed by automated builds and tests. Continuous Integration generally does not include the task of releasing that code to production.
  • Continuous Deployment (CD): Automatically deploying code changes to a production environment after passing a series of automated tests. This may even include Canary deployments and rollbacks.
  • DevSecOps: Integrating security practices within the DevOps process, ensuring continuous security consideration throughout the application development and deployment lifecycle. Much like DevOps, DevSecOps is a design goal and not a destination. We want to have better DevSecOps, it’s not just a box we can tick and move on.
  • FinOps: the optimization of infrastructure and practices to reduce overall costs. With a great deal of operations focused on preventing downtime and ensuring performance, FinOps is called out as a distinct practice. Hopefully, FinOps is not anyone’s full time job but rather a periodic process of evaluating our systems to find cheaper ways to get the same results.
  • Shift Left: A practice that involves integrating security and testing early in the software development lifecycle, aiming to identify and fix issues sooner. Also has the advantage of letting the same developers who wrote a feature be the first ones to see when that feature isn’t passing certain tests.

Tooling and Configuration

  • Infrastructure as Code (IaC): Managing and provisioning computing infrastructure through machine-readable definition files, rather than piecemeal configuration through physical configuration and distributed configuration files. If you started building cloud services recently this can seem self-evident as tools like Vercel entirely use config files that are handled as part of your code repository. However when you log in to a public cloud like AWS or Azure and click through menus to add resources to your virtual machine, you’re definitely not doing infrastructure as code!
  • Monitoring as Code (MaC): Defining and managing monitoring settings and alerts through version-controlled code, allowing for automation and scalable monitoring configurations. Just like testing is best managed directly next to the codebase it checks; monitoring as code says the best way to monitor a service reliably is to manage the monitoring configuration right next to the codebase.
  • CLI (Command Line Interface): A tool that enables users to interact with computers and perform tasks through text-based commands.

Monitoring and Observability

  • Observability: An attribute of systems that allows teams to understand the internal states from the system's external outputs. Often understood to mean ‘how well we can figure out how the system is broken during an incident.’
  • API Monitoring: The continuous process of checking API endpoints for uptime, latency, and correct responses to ensure they meet predefined performance benchmarks and SLAs.
  • Dashboards: Visual interfaces that display key metrics and data points in real-time, allowing teams to monitor system performance, health, and activity. Dashboards are meant to imply an improved system of exploring our monitoring data over manually scanning and filtering logs and metrics.
  • “Single Pane of Glass”: Some buzzwords are so omnipresent they’re worth defining. A Single Pane of Glass references a design goal for an observability dashboard that presents all the needed data to analyze an incident. Ideally with such a tool, all your observability data goes to one place and is accessible in one interface. Whether sending all your data to a single SaaS provider who is charging you exorbitantly for data storage means you are pursuing your own operational goals or just helping them to meet sales goals, is up for debate.
  • Synthetic Monitoring: A technique that simulates user interactions with applications or websites to monitor performance and availability with fixed input behaviors and expected outputs. Synthetic monitoring can simulate complex user behaviors like searching, navigating, and even performing CRUD operations.
  • Heartbeat Monitoring: Sending periodic signals to verify the operational status of systems, applications, or scheduled jobs. Heartbeat monitoring implies much simpler checks than those of synthetic monitoring. For example while synthetic monitoring might simulate an update action on a user dashboard, heartbeat monitoring would merely check that this dashboard page returns a 200 OK status code.
  • Alerting: Configuring notifications to inform stakeholders of system issues, anomalies, or performance degradation. Should include features like: multi-channel alerts and fallback notifications.
  • OpenTelemetry: An open source observability framework for cloud-native software that collects, processes, and exports telemetry data (metrics, logs, traces). Intended as an improved tool for observing requests as they travel between microservices.

Quality Assurance and Testing

  • Unit Testing: Tests written to be run against your service in isolation. Often run at high frequency to provide fast feedback.
  • Contract Testing: Testing your service against simulated versions of its dependencies to ensure that your service acts as expected. The process of defining contract tests can be homomorphic with defining the contracts between services.
  • Integration Testing: Checking how well an updated service integrates with the other services within your architecture. Eschewing mocks, stubs, or other attempts to simulate the other services that your service relies on; integration testing sends requests through your complete architecture. Integration testing is generally understood to exclude testing against third-party dependencies like other SaaS tools’ APIs.
  • End-to-End Testing (E2E Testing): Testing that verifies the complete workflow of an application from start to finish against specified requirements. Often best performed by sending an automated browser to complete real service flows against a Staging or Production environment.

Reliability and Resilience

  • Incident Management: The process involved in identifying, analyzing, and correcting hazards to prevent future reoccurrences. Along the process of fixing the problem, incident management may include communication with users once degraded service is identified.
  • Reliability: The measure of a system's ability to perform a required function under stated conditions for a specified period.
  • Resilience: The ability of a system to handle and recover from failures, maintaining operational performance under adverse conditions.
  • Service Level Agreement (SLA): A formal agreement between a service provider and its customers that defines the expected level of service. In enterprise services, a violation of SLA can automatically trigger refunds to a customer.
  • “Nines”: Service Level Agreements are often described as “how many nines” meaning what percentage of time the service is expected to be available. “Four nines” means the service is available 99.99% of the time or all but 4 minutes 21 seconds out of every month. See uptime.is for an SLA and nines calculator.

Software Development and Deployment

  • Code All Your Resources: The practice of defining all infrastructure, monitoring, and deployment configurations as code.
  • Code-first Workflow: An approach prioritizing the use of code to define environments, configurations, and operations.
  • Run From Your Repository: Executing scripts or tools directly from a version control system repository.
  • TypeScript First: Prioritizing TypeScript for script and application development, utilizing its strong typing system. Naturally this is most applicable to developers who are already working in Javascript.

Roles

  • Operations Engineering (Ops): The team dedicated to making all our software run in production. Can run the gamut from database administration to scheduling CPU time to build new AI models. Once software is running on production for users, any aspect of how it’s running and where is the concern of Operations. Developers only get involved when a bug needs to be fixed or features changed. Two notes of disambiguation: operations is sometimes archaically referred to as IT, and some work that would be more associated with traditional IT like supplying and managing employee laptops, is referred to as operations.
  • DevOps Engineer: This is not a real job title, all those listings you see on Indeed are fake. Seriously, DevOps engineers do one of two things: either they do Ops, or they help as a liaison between the product and operations teams. Sadly it seems the first definition is more prevalent: an operations engineer whose title has shifted to reflect that many of the concerns of operations are now expected to be shared by developers.
  • Site Reliability Engineering (SRE): A discipline that incorporates aspects of software engineering into infrastructure and operations problems. Initially pioneered at Google, an SRE is an evolution of an older sysadmin role, reflecting that less of the work is focused on managing systems directly and more on coordinating running software on multiple systems. SRE’s are tasked with defining procedures for incident response, coordinating with developers, and tracking how well the system is maintaining its SLA.


    Do you have more suggestions for the DevOps Glossary? Disagree with any of my definitions? Join the Checkly Slack and let me know!

Share on social