Field report — July 2026

I Can't Code. My Company Is Run by 100+ AI Agents Anyway.

A field report from the operator's seat — including the failures.

system nominal · last incident: 4 fixes shipped, 0 rollbacks

Here's what a bad night looks like now. Somewhere past midnight, a background job hangs. A watchdog notices within the hour. The repair loop wakes up, checks whether this is provably safe to fix (a hung job restart is), restarts it, probes that it actually came back healthy, and writes one line into tomorrow's digest. My phone stays dark. That's the whole point. I find out over coffee, in a one-line summary, that something broke and got fixed while I slept. I couldn't tell you what the code that did this looks like. I've never read it. I can't read code at all.

One person, one real company

I'm the managing director of a small e-commerce company in Basel, Switzerland. Our main brand sells kitchen accessories on Amazon in 8 EU marketplaces. No IT department. No developer. No technical co-founder hiding in the back. What the company does have: 101 scheduled background jobs, 66 custom agent skills, around 440 helper scripts, ~10 API integrations, and more than 3,000 automated tests — all of it built by AI agents, mostly Claude Code, with a second AI from a competing vendor as reviewer. At one snapshot this week, 7 Claude sessions were working in parallel. Later the same night: 15. This isn't a demo repo. It's the machine my actual company runs on:

Email:: every inbound mail across three mailboxes gets classified; roughly 70% is noise and gets silently filed. What matters lands on my phone as a card with buttons. Tap "draft," and a pipeline writes the reply and pushes it through four independent quality gates before I send it with one tap. Legal and finance mail is deliberately hands-off — agents never send email on their own, no exceptions, ever.
Advertising:: Amazon PPC runs in-house. Agents collect data, keep baselines, flag anomalies, write the weekly report, prepare campaign changes. Anything that moves money passes an approval sheet plus an adversarial "advisor" gate. Go-lives get verified programmatically across all four levels of Amazon's campaign hierarchy.
IP enforcement:: monitors scan marketplaces for products infringing our patents and registered designs — perceptual hashing, embeddings, a vision model. Evidence is scraped and archived automatically, infringement reports come out as court-ready PDFs. Filing always needs my explicit go.
Everything else:: invoice filing, FBA inbound shipments end-to-end, listing audits, automated review requests, a vacation rental with agent-drafted guest communication — and two knowledge bases the system re-crawls monthly so agents answer from current documentation instead of model memory.

The fair question is not "how much does it do." It's: why hasn't this blown up in your face?

Scheduled_jobs

across 8 EU marketplaces

Agent_skills

custom Claude Code skills

Helper_scripts

shell/Python automation

Automated_tests

green before every deploy

"This isn't a demo repo. It's the machine my actual company runs on."

Section 02 — The operations layer

Autonomy is earned, not granted.

Every subsystem climbs a ladder: shadow mode (proposes, executes nothing) → pilot on real data → gated mode (every action needs my tap) → auto mode, for a narrow, defined class of actions only. The newest autonomous subsystem — an investigator that root-causes alarms — spent a supervised shadow week before earning auto mode.

36 cases · 17 false alarms dismissed · 6 real fixes shipped · 0 rollbacks
The maker is never the checker.

Nothing important is verified by the model that produced it — and where it counts, not even by the same vendor. Claude's fixes get reviewed by OpenAI's Codex. Codex's code gets verified by Claude.
Heal first, page never.

Background failures don't interrupt me. The system fixes what's provably safe, files the rest to a dashboard, sends one digest a day. Only a short allowlist of true emergencies may ping my phone.
Everything has an off switch.

Every autonomous loop has a one-line kill switch, snapshots before changes, automatic rollback if the audit fails, a hard cap of two fix attempts before escalation. A protected zone — the agents' own guardrails and configs — no agent may ever touch autonomously.
Incidents become law.

Every real failure ends as a written rule, loaded into every future session. A model once wrote "Friday" next to a date that was a Saturday — since then every weekday is verified programmatically. Every number in outbound text must trace to a source file.
A canary for context rot.

Long AI sessions degrade quietly. The canary: the assistant must address Kevin by name in every reply. The day the name disappears, the context is rotting.
Measure, don't vibe.

A 7-day token audit found one repair loop running ~95% no-op — ~€231/week of API-equivalent spend for zero changes. Built an LLM-free pre-check gate the same day. A separate benchmark moved email drafting to a mid-tier model at equal quality and ~1.8x lower cost.

Section 03 — The scar tissue

None of this is an argument against the system. It is the system.

INCIDENT_01

The campaigns that lied

Three ad campaigns ~~reported "live"~~ — underneath, every ad group, keyword, and product ad was still paused, serving nothing. Caught by Kevin himself, with domain knowledge.

→ A four-level programmatic go-live check is now mandatory.

INCIDENT_02

The backup that quietly rotted

~~A shell-script flaw mis-rotated backups for four weeks~~ before anyone noticed.

→ Every file-touching helper script now needs a fixture-based smoke test before it may run on schedule.

INCIDENT_03

The test suite that cried wolf

~~A migration test hit the real alert channel~~ — six fake emergencies paged Kevin's phone.

→ Tests are now hermetically sealed from production alerting.

INCIDENT_04

The drafts that vanished

~~Reply drafts were silently discarded~~ by an over-strict verifier.

→ Fixed with regression tests — "the pipeline says OK" is never the final word.

Failures get caught, contained, reversed — and turned into law.

Kevin still can't write a for-loop. Running this thing forged a role he'd call agent operations: specify outcomes precisely, decide what may run autonomously and what never will, design verification that doesn't rest on trusting any single model, budget attention and tokens, harden the machine incident by incident.

It isn't a coding skill. It's an operating one — closer to running a plant than shipping a feature.

Not a developer's skill set — how a plant manager thinks about machines, how a CFO thinks about controls.

github.com/kevintheo-ai/agent-ops-kit

The watchdog pattern, the LLM-free pre-check gate, a heal-loop template with kill switch, shadow mode, protected zones and rollback — plus the staged-autonomy ladder. MIT license.

git clone github.com/kevintheo-ai/agent-ops-kit

View on GitHub →

I Can't Code. My Company Is Run by 100+ AI Agents Anyway.

One person, one real company

The machine, by the numbers

Section 02 — The operations layer

Autonomy is earned, not granted.

The maker is never the checker.

Heal first, page never.

Everything has an off switch.

Incidents become law.

A canary for context rot.

Measure, don't vibe.