Monitoring
Purpose: Give the Engagement Lead a small, stable set of efficiency metrics — and a cadence for reviewing them — to track Swifter adoption and delivery health once the team is delivering real work through the platform.
Scope: ongoing delivery only
Monitoring attaches to the delivery part of the engagement — once Onboard is closed and the team is running real work items through Swifter on their own. It does not run during Explore, Setup, or Onboard. The reason is simple: the metrics measure what happens when real work items flow through agents, and those workflows are only active once delivery is under way. Outside delivery there is no signal worth reviewing.
The cadence and the metrics described below assume the team is actively delivering through the platform. If the team pauses delivery for any reason, Monitoring is paused, not run on stale data.
Why monitor
A Swifter engagement is judged on two outcomes: the team actually uses the agent (adoption), and the agent produces clean work without constant steering (efficiency). Without instrumentation, both questions get answered by anecdote — usually too late to course-correct. Monitoring pins the Engagement Lead to four headline metrics that the analytics pipeline already computes per work item, and to a weekly review cadence that turns those numbers into delivery decisions rather than dashboard wallpaper.
The four metrics are deliberately narrow. automation_ratio answers "is delivery actually agent-driven?". manual_intervention, verification_pass_rate, and code_churn_ratio answer "is the agent's output any good?". Other questions an Engagement Lead might reasonably ask — time-to-PR, cost per work item, acceptance-test pass rate, cycle-time, the adoption funnel — are not surfaced by the analytics pipeline today; some are derivable from raw fields, others live in operational telemetry. The What is not measured today section is explicit about that gap so the Engagement Lead does not promise stakeholders numbers the platform cannot produce.
How the four metrics fit together
| Question the EL asks | Metric | Section below |
|---|---|---|
| How much of delivery is genuinely agent-driven? | automation_ratio | Swifter Adoption |
| How much human steering did each work item need? | manual_intervention | Manual intervention |
| How clean was the agent's first-try output? | verification_pass_rate | Verification pass rate |
| Were code changes proportionate to scope? | code_churn_ratio | Code churn ratio |
All four are stored 0-100 % in the per-work-item extended view. The upstream aggregated view stores some of the same rates as 0-1 ratios; consumers must not mix scales when quoting numbers. Averages exclude work items with no linked PR, so trend lines lag in-flight work.
Swifter Adoption
| Metric | Range | What it measures | Source | Defined in |
|---|---|---|---|---|
automation_ratio | 0-100 % | Share of a PR's code authored by the agent (vs. by a human); 100 = fully agent-written, 0 = fully human. | Linked PR's agent_ratio (commit-level authored_by_agent flag aggregated to PR, then mapped to work item via prs_id_vals → linked_pr_id). | swifter-ai-data-analytics — models/dataframe_models/swifter_data.py:44; pipelines/stats_calculation/nodes.py:19 |
What automation_ratio measures
automation_ratio is the per-work-item percentage of PR commits authored by the Swifter agent. It is the direct adoption signal: opening sessions, drafting work items, or even reading the dashboard do not count — only commits whose author is the agent, merged into a PR linked to the work item, move the number. That distinction matters because every other "adoption" proxy (logins, sessions per week, work items opened) over-reports usage in the early weeks of an engagement, when the team is exploring rather than delivering.
The metric is per work item by design. Each work item carries a list of linked PRs (prs_id_vals), and analytics resolves that list to the PR-level agent_ratio, then surfaces the value back on the work item row. The Engagement Lead then aggregates those rows to phase, project, or org level depending on the audience. Per-work-item values are noisy and best used for spot-checks; project-level rolling averages are the number that goes on the dashboard.
How the metric is computed
The analytics Kedro pipeline runs four stages in series:
extract_data— pulls raw Swifter (work items, work sessions) and Git (PRs, commits, modified files) events.preprocess_swifter_data— normalises work-item and session shapes into a typedSwifterWorkItemsAggregatedframe.preprocess_git_data— normalises PR, commit, and file-modification shapes, and computes the per-PRagent_ratioby aggregating the commit-levelauthored_by_agentboolean.stats_calculation— joins work items to their linked PR, copiesagent_ratioonto the work item, scales 0-1 → 0-100, and emits the extended frame the dashboard consumes.
Two rules in that flow shape every chart downstream. First, the join key is linked_pr_id derived from prs_id_vals; if a work item has no PR yet (linked_pr_id == '') it is excluded from averages — meaning in-progress work is silently dropped and merged work is over-represented. Second, the value is stored 0-100 in the extended frame but 0-1 in upstream aggregations; never mix scales when quoting figures across pages or dashboards.
How to read it during an engagement
Read automation_ratio as a trend, not a snapshot. Early-delivery values are dominated by exploration commits, partial-team adoption, and a sparse PR set, so single-week readings carry no information. The signal is direction: is the project-level rolling average climbing as the team gets through the Cookbook material, or is it stuck while session counts rise? The latter pattern — sessions up, automation flat — usually means the team is using Swifter for analysis and editing by hand, which is a coaching opportunity rather than a platform issue.
Indicative trajectories:
| Engagement shape | Week 1–2 | Week 4–6 | Steady state |
|---|---|---|---|
| Swifter-led delivery (Swifter team writes the software on the platform) | 60–80 % | 75–90 % | Stable in 80–95 % band |
| Customer-led delivery (client developers own delivery on the platform) | 20–40 % | 40–60 % | Rising; 60 %+ once the team is independent |
| Swifter delivery-centre team (a dedicated Swifter delivery team owns delivery) | 50–70 % | 70–85 % | Stable in 80 %+ band |
These are reference shapes from observed delivery, not hard targets. The Engagement Lead should set explicit per-engagement targets at kick-off and revisit them at each milestone.
Caveats and exclusions
- No-PR work items are excluded. Work items still in spec, in QA, or abandoned mid-session never enter the average. Project-level numbers will over-represent finished work and under-represent backlog drag.
- Scale mismatch. Extended frame is 0-100, aggregated frame stores
agent_ratioand adjacent rates as 0-1. Always note which frame a number came from. - Commit attribution depends on Git identity. If a developer commits agent-generated code under their own identity,
authored_by_agentis false and the work item looks human-driven. The metric measures attribution, not authorship intent. - PR-level granularity. A WI linked to multiple PRs takes the join's resolved
agent_ratio; mixed-authorship PRs are blended at the PR level before the WI sees them.
Manual intervention
| Metric | Range | What it measures | Source data | Defined in |
|---|---|---|---|---|
manual_intervention | 0-100 % | Share of user messages in a work session that are "extra prompts" — corrections, clarifications, manual nudges beyond the initial WI description. | user_extra_prompts_count / user_message_count from SwifterWorkItemsAggregated. | models/dataframe_models/swifter_data.py:46; pipelines/stats_calculation/nodes.py:21 |
manual_intervention is the headline "how much human nudging did this work item need?" metric. The denominator is total user messages across the work item's sessions; the numerator is the subset classified as extra prompts — corrections, clarifications, do-overs, and manual nudges beyond the initial description. High values mean the team had to steer the agent message by message; low values mean the work item description carried the agent to a clean result.
The reference baseline from a recent production pilot is 17.3 % macro-average across 53 work items, 21.3 % when weighted by message volume (696 of 3,261 user messages). Top intervention-heavy items in that dataset ran ~33 % (the highest-touched WI hit 33.3 %), and the cluster of QA acceptance-criteria items sat in the high twenties despite scoring 8-9 of 9 on description quality — strong evidence that, above a certain description-quality floor, residual intervention is a platform / skill issue rather than a description issue.
Read the metric two ways. As a project trend, rising intervention is the earliest visible signal of either a platform regression or a team-skill drift. As a per-work-item flag, top-of-distribution values are the right place to start a spot-check before escalating anything. The interpretation playbook is in How these metrics combine.
Verification pass rate
| Metric | Range | What it measures | Source data | Defined in |
|---|---|---|---|---|
verification_pass_rate | 0-100 % | Share of verification gates that passed in the work session on first try. | Aggregated from session-level verification events; stored 0-1 at aggregation, scaled to 0-100 at extension. | models/dataframe_models/swifter_data.py:33,47; pipelines/stats_calculation/nodes.py:22 |
verification_pass_rate measures how clean the agent's output is on the first run, before human escalation or auto-fix loops. A verification gate is any session-level check the platform runs against the agent's output — type checks, build gates, lint passes, the skill's own self-verification step. A pass on first try means the gate cleared without re-prompting; a fail means at least one retry was needed.
Two things to remember when quoting the number. First, it is stored 0-1 at the aggregation stage and scaled to 0-100 in the extended frame — never mix the scales when comparing values across pages. Second, the metric is silent about gates that never ran: a work item where no verification fired will show a clean rate but no quality signal. Treat verification_event_number as the denominator-of-record when judging whether a verification_pass_rate reading is meaningful.
In practice, low values cluster around two patterns: skills that emit code without re-reading their own framework knowledge (common cause of convention-violation failures), and work items where description scope expanded silently between spec and implementation, forcing re-runs. The first is a skill-engineering fix; the second shows up jointly with high code_churn_ratio and is a scoping problem.
Code churn ratio
| Metric | Range | What it measures | Source data | Defined in |
|---|---|---|---|---|
code_churn_ratio | 0-100 % | Code churn associated with the PR — proportion of changes vs. scope. | Linked PR's code_churn_ratio from Git data (insertions / deletions / files / lines). | models/dataframe_models/swifter_data.py:45; pipelines/stats_calculation/nodes.py:20 |
code_churn_ratio flags PRs where change volume looks disproportionate to the work item's stated scope. The aggregation pulls insertions, deletions, files, and total lines from the PR's commits and resolves them to a 0-1 ratio at the PR level, which the extended frame scales to 0-100. The precise formula is computed inside the preprocess stage and is not exposed at the metric's declaration site, so quote it as a relative indicator rather than an absolute physical quantity.
Read it alongside verification_pass_rate and the description-quality score. High churn on a narrowly scoped WI usually means one of three things: the WI implicitly grew (the description said one screen, the PR touched five), the agent rewrote the same artefact multiple times in-session, or a sibling reference dragged in unintended files. None of those are fixed by re-running the agent; they are fixed by re-splitting the WI or tightening the scope-out clause.
Weekly review cadence
The Engagement Lead pulls the four headline metrics from the analytics dashboard once per week and compares them to the prior week's snapshot. The discipline is trend-watching, not point-in-time reading: single-week spikes are noise unless they persist for two consecutive readings or coincide with a known platform or team change. The whole review is meant to take under thirty minutes — pull, compare, decide, log — and is on the Engagement Lead's calendar as a recurring slot, not an ad-hoc task.
The pull is project-level by default. Per-work-item rows are loaded only when one of the project-level numbers trips the thresholds in the next section. This ordering keeps the weekly review cheap and reserves the per-WI drill-down for cases where it actually pays off.
What to look at
Read the four metrics together rather than one at a time. automation_ratio is the adoption signal; the other three are the execution-quality signals. The lookup table below is the working "metric → threshold → action" reference for the weekly review:
| Metric | Healthy range (project-level) | Trigger to drill down | First action when triggered |
|---|---|---|---|
automation_ratio | Trending up week-over-week toward the engagement target | Flat for 2+ weeks while sessions rise | Coaching: re-walk the team through the Onboard Cookbook. |
manual_intervention | Near or below the 17.3 % macro-average baseline | Crosses 25 %, or rises 5+ points week-over-week | Spot-check the top intervention-heavy WIs (see drill-down section below). |
verification_pass_rate | Stable; trending up as the team learns the skills | Drops 10+ points week-over-week | Inspect the failing gates; classify as skill issue vs. scope issue. |
code_churn_ratio | Steady; no systematic outliers | Top-decile WIs running well above project median | Read those WIs against the cookbook; check for missing scope-out. |
Thresholds above are starting points the Engagement Lead tightens per engagement after the first two milestones.
When to drill down
Drill from project-level to per-work-item when a metric moves materially against its own trend or against the established baseline. Start with the top intervention-heavy work items — observed pilot data shows the highest single-WI intervention rate at 33.3 %, with several others in the high 20s, and those outliers are where actionable causes live.
Before assuming a platform issue, run the spot-check list against the offending work item:
- Description quality. Is the WI scored Excellent (7-9) on the Work-Item Composition rubric, or is it sitting in the Poor bucket? A Poor WI explains intervention without further investigation.
- Scope-out clause. Does the description say what is out of scope ("X is out of scope; use mocked data")? Missing scope-out is the highest-leverage authoring fix.
- Sibling reference. Does the WI name a sibling page or component (
Use approach and patterns of <Sibling>)? Missing siblings produce churn. - Routing correctness. Defects routed through autonomous triage rather than the analyst? Backend "Create API" routed to backend-analyst, not backend-architect?
- Re-creation pattern. Is this an "attempt-2"/"attempt-3" respawn of a prior WI? If so, the description was amended-by-spawning instead of amended-in-place — the analytics will look bad even when the platform is fine.
Only after that checklist comes up clean is escalation to Swifter the right move.
Milestone roll-ups
At each delivery milestone the Engagement Lead rolls weekly snapshots into a project-level summary for client-facing stakeholders. The roll-up has two jobs: it replaces ad-hoc status reporting, and it records the engagement's trajectory so the next milestone starts with a baseline rather than a blank page.
The format is deliberately minimal — three sections, no decoration:
- Headline numbers. Project-level
automation_ratio,manual_intervention,verification_pass_rate,code_churn_ratiofor the milestone window, with prior-milestone comparison. - Top intervention-heavy WIs. Up to five, with one-line cause classification (description / routing / platform / scope creep).
- Decisions taken. Coaching sessions held, platform escalations raised, WIs re-split, cookbook updates pushed.
The roll-up is the only artefact the Engagement Lead shares outside the delivery team unless a specific question requires a deeper cut.
How these metrics combine
The four metrics are not independent; their joint pattern is what tells the Engagement Lead what to do. The decision table below is the working playbook:
| Pattern | Likely cause | Recommended action |
|---|---|---|
automation_ratio rising, others stable | Healthy onboarding | Stay the course; log the trajectory. |
automation_ratio flat, sessions up | Team using Swifter for analysis only, hand-editing the code | Coaching: re-walk the team through the Onboard Cookbook. |
manual_intervention rising, description scores stable | Platform bottleneck (skill regression, model drift, infra) | Escalate to Swifter with the top intervention-heavy WIs as evidence. |
manual_intervention rising, description scores declining | Team-authoring drift | Book a cookbook refresher with the team; focus on scope-out and sibling references. |
verification_pass_rate low, code_churn_ratio high | Scope creep inside the WI | Re-split the WI before re-running the agent. |
verification_pass_rate low, code_churn_ratio normal | Skill-level quality issue (convention drift, missing post-generation review) | Skill-engineering fix; spot-check sessions for recurring failure mode. |
Description quality high, manual_intervention still high | Skill bottleneck, not description bottleneck | Invest in the underlying skill; do not waste effort re-writing already-good WIs. |
These rules apply at project level. Per-WI signals are noisier and should be treated as case studies that inform the project-level decision, not as actionable in isolation.
Baseline metrics — captured in Explore, compared in delivery
The four platform metrics above tell the EL how delivery is running inside Swifter. A second, smaller set of baseline business metrics is captured during the Explore Interview (Block 1) and compared against actuals at the 2-week post-onboarding check-in. The point is to anchor success in the team's own numbers rather than a generic claim.
| Baseline metric | Captured at | Compared against |
|---|---|---|
| Time from work-item creation to PR merged | Explore Block 1 (rough estimate) | Post-onboarding observation at the 2-week check-in |
| PR review cycles per feature | Explore Block 1 (rough estimate) | Post-onboarding observation at the 2-week check-in |
| Sprint capacity consumed by rework | Explore Block 1 (rough estimate) | Post-onboarding observation at the 2-week check-in |
Outcome vocabulary for the EL. When asked "what results do teams see?", the answer is "we measure it together — we establish your baseline at Explore and compare at the 2-week mark. We don't quote percentages before we have your numbers." This framing is deliberate. Pre-engagement quoting of generic percentages is the leading cause of misaligned expectations after the first demo.
What is not measured today
The analytics pipeline does not currently surface several metrics an Engagement Lead might reasonably ask for. Some are derivable from existing fields; others live in operational telemetry on a different stack. Be explicit with stakeholders about the gap rather than approximating:
- Time-to-PR — wall-clock from work item open to PR merge. Derivable from
createdAt+ PR close date, but not exposed as a column. - Cost per work item — LLM and sandbox spend. Lives in operational cost-tracking, not the analytics pipeline.
- Acceptance-test pass rate — test runtime sits in a separate Swifter module; integration with
stats_calculationis not present. - Cycle-time distribution / first-time-DONE rate — derivable from work-item status history, not surfaced as a column.
- Adoption funnel (organisations → projects → users → sessions) — operational metric, lives in a different observability stack.
Treat the four headline metrics as the engagement-level signal; everything in the list above is a separate request to Swifter or a one-off analysis, not a promise the dashboard fulfils today.
How this page relates to the rest of the Cookbook
The Monitoring loop hands off in one direction. Persistently low verification_pass_rate or rising manual_intervention with stable description quality is a signal to revisit the rest of the Cookbook — the team may need a refresher on work-item authoring (see Work-Item Composition) or on routing (see Technical Execution). The dashboard surfaces these metrics in the Dashboard interface; the operating definitions live here.