Skip to content
toil-reduction

Case Study: How We Reduced One Client's Toil by 60%

· 6 min read
Case Study: How We Reduced One Client's Toil by 60%

One of the engineers on the team I’m going to describe was spending roughly three hours a day not engineering. He was SSHing into servers to run deploys, digging through log files via tunnels, triaging alerts that didn’t mean anything, and resetting staging environments that never stayed stable. He was good at his job. He was also quietly looking at job listings.

This is a devops toil reduction case study. Here’s what we found in week one, what we fixed over six weeks, and what the numbers looked like on the other side.

What We Found in Week One

The client context

I had a client that ran a B2B SaaS product with a team of twelve people, three of whom were engineers. None of them had a dedicated ops person. Infrastructure duties had accumulated organically. A script here, a manual process there, a “we’ll automate this later” that never got scheduled.

The DORA 2024 report found that toil rose to 30% of engineering time that year, the first increase in five years. This team was tracking well above that.

Two of the three engineers were senior. They knew what good looked like. The gap between what they were doing and what they wanted to be doing was visible and demoralizing.

Running the toil audit

I had them log every operational task they touched for five days. Not estimates, actual logs, timestamped, with duration. This surfaces things people have stopped noticing because they’ve normalized them.

We ended up with 23 distinct tasks. Seven of them accounted for roughly 80% of the time. The process we used is documented in detail in our toil reduction roadmap.

The Toil That Was Killing Them

Here’s what those seven tasks looked like across a typical week:

TaskTime Per OccurrenceFrequencyWeekly Hours
Manual deploy via SSH45 min3x/week2.25 hrs
Database backup verification30 mindaily2.5 hrs
Cert renewal + nginx restart90 minmonthly~0.35 hrs avg
Access request tickets20 min8x/week2.7 hrs
Log access via SSH tunnel25 min10x/week4.2 hrs
Staging environment rebuild2 hrs2x/month~1 hr avg
Alert triage (false positives)15 min12x/week3 hrs
Total~16 hrs/week

Sixteen hours a week, spread across three engineers, with the lead engineer carrying the heaviest load. That’s two full working days of time that produced nothing the product needed.

If you want to convert those hours into dollar figures, our toil cost guide walks through the math. The short version: at a blended $100/hr engineering rate, this team was burning $80,000+ per year on work that shouldn’t exist.

What We Fixed and How

I want to be specific here because vague case studies aren’t useful. These aren’t theoretical fixes. This is what we actually shipped.

Quick wins, weeks one through three

CI/CD deploys. The manual SSH deploy process was eliminated by wiring up GitHub Actions to their existing deployment scripts. Nothing exotic. The workflow triggers on merge to main, runs the deploy, posts the result to Slack. Deploy time dropped from 45 minutes to 8 minutes, and more importantly, the engineer doing it dropped to zero because the machine does it now.

Certbot automation. Cert renewal was a manual process on a calendar reminder. We replaced it with a Certbot systemd timer. The 90-minute quarterly task became a cron job. The engineer never touches it again.

Backup verification. They were manually spot-checking backup files daily to confirm they existed and weren’t zero bytes. We wrote a small script that checks size, age, and runs a restore smoke test on a sample, then posts a daily green/red status to Slack. Thirty minutes a day, eliminated.

Structural fixes, weeks four through six

Self-service access. Eight access request tickets per week means someone is constantly being interrupted to grant permissions. We built a small internal CLI backed by their existing IAM setup. Engineers request access, it gets approved via Slack reaction, access is granted automatically. No ticket queue. The interruption overhead collapsed.

Log access tool. SSHing into a server to tail logs through a tunnel is painful and doesn’t scale. We deployed a lightweight log aggregation setup, gave the team a browser-based query interface, and closed the SSH tunnel as an access method for logs. Ten requests a week that each took 25 minutes dropped to near zero engineering time.

Alert tuning. Twelve false-positive alerts per week is an alert problem, not an infrastructure problem. We audited every firing rule, set minimum thresholds based on actual baselines, added page-only routing for the small subset of alerts that actually warranted waking someone up. Alert fatigue dropped sharply.

The staging environment problem

Staging was a snowflake. It had been set up manually years ago and nobody fully knew what was on it. When it broke, which was often, someone spent two hours rebuilding it from memory and Slack history.

We replaced it with a Terraform configuration that provisions a clean environment in four minutes. The spec lives in version control. Anyone on the team can run it. The two-hour rebuild became a four-minute command.

The After-State, 6-Week Results

We ran the same five-day task log six weeks after the engagement started. Here’s what changed:

MetricBeforeAfterChange
Manual ops hours/week16 hrs6.5 hrs-59%
Deploy time45 min8 min-82%
False-positive alerts/week123-4-70%
Staging env rebuild time2 hrs4 min-97%
Lead engineer weekly toil~18 hrs~7 hrs-61%

The 60% reduction is a measured outcome, not an estimate. It comes from comparing the week-one audit to the week-six re-audit using the same logging methodology.

The lead engineer who had been quietly job-hunting decided to stay. In week seven, the team shipped a feature that had been sitting on the backlog for two months because nobody had had the headspace to start it. That’s the outcome that matters to the CTO.

What Made This Work

A few things that aren’t obvious from the numbers:

Logging before fixing. Most teams jump straight to tooling. The five-day log forces you to see where time actually goes, not where people think it goes. The answers are usually surprising.

Fixing the top seven, not all twenty-three. The 80/20 rule held. We ignored the long tail of small tasks in the first pass. Scope control matters.

Using tools they already had. We didn’t introduce a new platform. Everything here used GitHub Actions, Terraform, basic shell scripting, and their existing cloud IAM. The friction of adoption was low because the tools weren’t new.

The full methodology is in our toil reduction roadmap.

Could Your Team Get Here?

This was a relatively contained environment. More complex infrastructure takes longer to audit and longer to fix. But the pattern holds: every small team we’ve audited has found that a handful of tasks accounts for the majority of their toil, and most of those tasks are automatable with tools they already have.

For a deeper look at what makes work toil versus engineering, start with our complete guide to DevOps toil.

This is a typical fractional DevOps engagement. Six weeks, targeted scope, measurable outcome. No full-time hire required.

Want to see what this looks like for your stack? We offer a free async audit. We review your setup, record a Loom walkthrough, and send a written report with a prioritized list. No call required. Get Your Free Infrastructure Audit →

Related Articles

toil-reduction

How to Build a Toil Reduction Roadmap

The DORA 2024 report dropped a finding that should have caused a minor crisis in every engineering org: toil rose to 30% of engineering time, up from 25% the year before. That’s the first increase in five years, and it happened while teams were actively adopting AI tools and automation platforms. More tooling, more toil. Something isn’t working.

· 14 min read

How mature is your DevOps?

Take our free assessment. Get a maturity score across 5 dimensions and specific recommendations — written by an engineer, not a bot.

Free DevOps Assessment

Get DevOps insights in your inbox

No spam. Unsubscribe anytime.