Skip to main content

Monitoring

Purpose: Give the Engagement Lead a small, stable set of efficiency metrics — and a cadence for reviewing them — to track Swifter adoption and delivery health once the team is delivering real work through the platform.

Scope: ongoing delivery only

Monitoring attaches to the delivery part of the engagement — once Onboard is closed and the team is running real work items through Swifter on their own. It does not run during Explore, Setup, or Onboard. The reason is simple: the metrics measure what happens when real work items flow through agents, and those workflows are only active once delivery is under way. Outside delivery there is no signal worth reviewing.

The cadence and the metrics described below assume the team is actively delivering through the platform. If the team pauses delivery for any reason, Monitoring is paused, not run on stale data.

Why monitor

A Swifter engagement is judged on two outcomes: the team actually uses the agent (adoption), and the agent produces clean work without constant steering (efficiency). Without instrumentation, both questions get answered by anecdote — usually too late to course-correct. Monitoring pins the Engagement Lead to four headline metrics that the analytics pipeline already computes per work item, and to a weekly review cadence that turns those numbers into delivery decisions rather than dashboard wallpaper.

The four metrics are deliberately narrow. automation_ratio answers "is delivery actually agent-driven?". manual_intervention, verification_pass_rate, and code_churn_ratio answer "is the agent's output any good?". Other questions an Engagement Lead might reasonably ask — time-to-PR, cost per work item, acceptance-test pass rate, cycle-time, the adoption funnel — are not surfaced by the analytics pipeline today; some are derivable from raw fields, others live in operational telemetry. The What is not measured today section is explicit about that gap so the Engagement Lead does not promise stakeholders numbers the platform cannot produce.

How the four metrics fit together

Question the EL asksMetricSection below
How much of delivery is genuinely agent-driven?automation_ratioSwifter Adoption
How much human steering did each work item need?manual_interventionManual intervention
How clean was the agent's first-try output?verification_pass_rateVerification pass rate
Were code changes proportionate to scope?code_churn_ratioCode churn ratio

All four are stored 0-100 % in the per-work-item extended view. The upstream aggregated view stores some of the same rates as 0-1 ratios; consumers must not mix scales when quoting numbers. Averages exclude work items with no linked PR, so trend lines lag in-flight work.

Swifter Adoption

MetricRangeWhat it measuresSourceDefined in
automation_ratio0-100 %Share of a PR's code authored by the agent (vs. by a human); 100 = fully agent-written, 0 = fully human.Linked PR's agent_ratio (commit-level authored_by_agent flag aggregated to PR, then mapped to work item via prs_id_valslinked_pr_id).swifter-ai-data-analyticsmodels/dataframe_models/swifter_data.py:44; pipelines/stats_calculation/nodes.py:19

What automation_ratio measures

automation_ratio is the per-work-item percentage of PR commits authored by the Swifter agent. It is the direct adoption signal: opening sessions, drafting work items, or even reading the dashboard do not count — only commits whose author is the agent, merged into a PR linked to the work item, move the number. That distinction matters because every other "adoption" proxy (logins, sessions per week, work items opened) over-reports usage in the early weeks of an engagement, when the team is exploring rather than delivering.

The metric is per work item by design. Each work item carries a list of linked PRs (prs_id_vals), and analytics resolves that list to the PR-level agent_ratio, then surfaces the value back on the work item row. The Engagement Lead then aggregates those rows to phase, project, or org level depending on the audience. Per-work-item values are noisy and best used for spot-checks; project-level rolling averages are the number that goes on the dashboard.

How the metric is computed

The analytics Kedro pipeline runs four stages in series:

  1. extract_data — pulls raw Swifter (work items, work sessions) and Git (PRs, commits, modified files) events.
  2. preprocess_swifter_data — normalises work-item and session shapes into a typed SwifterWorkItemsAggregated frame.
  3. preprocess_git_data — normalises PR, commit, and file-modification shapes, and computes the per-PR agent_ratio by aggregating the commit-level authored_by_agent boolean.
  4. stats_calculation — joins work items to their linked PR, copies agent_ratio onto the work item, scales 0-1 → 0-100, and emits the extended frame the dashboard consumes.

Two rules in that flow shape every chart downstream. First, the join key is linked_pr_id derived from prs_id_vals; if a work item has no PR yet (linked_pr_id == '') it is excluded from averages — meaning in-progress work is silently dropped and merged work is over-represented. Second, the value is stored 0-100 in the extended frame but 0-1 in upstream aggregations; never mix scales when quoting figures across pages or dashboards.

How to read it during an engagement

Read automation_ratio as a trend, not a snapshot. Early-delivery values are dominated by exploration commits, partial-team adoption, and a sparse PR set, so single-week readings carry no information. The signal is direction: is the project-level rolling average climbing as the team gets through the Cookbook material, or is it stuck while session counts rise? The latter pattern — sessions up, automation flat — usually means the team is using Swifter for analysis and editing by hand, which is a coaching opportunity rather than a platform issue.

Indicative trajectories:

Engagement shapeWeek 1–2Week 4–6Steady state
Swifter-led delivery (Swifter team writes the software on the platform)60–80 %75–90 %Stable in 80–95 % band
Customer-led delivery (client developers own delivery on the platform)20–40 %40–60 %Rising; 60 %+ once the team is independent
Swifter delivery-centre team (a dedicated Swifter delivery team owns delivery)50–70 %70–85 %Stable in 80 %+ band

These are reference shapes from observed delivery, not hard targets. The Engagement Lead should set explicit per-engagement targets at kick-off and revisit them at each milestone.

Caveats and exclusions

  • No-PR work items are excluded. Work items still in spec, in QA, or abandoned mid-session never enter the average. Project-level numbers will over-represent finished work and under-represent backlog drag.
  • Scale mismatch. Extended frame is 0-100, aggregated frame stores agent_ratio and adjacent rates as 0-1. Always note which frame a number came from.
  • Commit attribution depends on Git identity. If a developer commits agent-generated code under their own identity, authored_by_agent is false and the work item looks human-driven. The metric measures attribution, not authorship intent.
  • PR-level granularity. A WI linked to multiple PRs takes the join's resolved agent_ratio; mixed-authorship PRs are blended at the PR level before the WI sees them.

Manual intervention

MetricRangeWhat it measuresSource dataDefined in
manual_intervention0-100 %Share of user messages in a work session that are "extra prompts" — corrections, clarifications, manual nudges beyond the initial WI description.user_extra_prompts_count / user_message_count from SwifterWorkItemsAggregated.models/dataframe_models/swifter_data.py:46; pipelines/stats_calculation/nodes.py:21

manual_intervention is the headline "how much human nudging did this work item need?" metric. The denominator is total user messages across the work item's sessions; the numerator is the subset classified as extra prompts — corrections, clarifications, do-overs, and manual nudges beyond the initial description. High values mean the team had to steer the agent message by message; low values mean the work item description carried the agent to a clean result.

The reference baseline from a recent production pilot is 17.3 % macro-average across 53 work items, 21.3 % when weighted by message volume (696 of 3,261 user messages). Top intervention-heavy items in that dataset ran ~33 % (the highest-touched WI hit 33.3 %), and the cluster of QA acceptance-criteria items sat in the high twenties despite scoring 8-9 of 9 on description quality — strong evidence that, above a certain description-quality floor, residual intervention is a platform / skill issue rather than a description issue.

Read the metric two ways. As a project trend, rising intervention is the earliest visible signal of either a platform regression or a team-skill drift. As a per-work-item flag, top-of-distribution values are the right place to start a spot-check before escalating anything. The interpretation playbook is in How these metrics combine.

Verification pass rate

MetricRangeWhat it measuresSource dataDefined in
verification_pass_rate0-100 %Share of verification gates that passed in the work session on first try.Aggregated from session-level verification events; stored 0-1 at aggregation, scaled to 0-100 at extension.models/dataframe_models/swifter_data.py:33,47; pipelines/stats_calculation/nodes.py:22

verification_pass_rate measures how clean the agent's output is on the first run, before human escalation or auto-fix loops. A verification gate is any session-level check the platform runs against the agent's output — type checks, build gates, lint passes, the skill's own self-verification step. A pass on first try means the gate cleared without re-prompting; a fail means at least one retry was needed.

Two things to remember when quoting the number. First, it is stored 0-1 at the aggregation stage and scaled to 0-100 in the extended frame — never mix the scales when comparing values across pages. Second, the metric is silent about gates that never ran: a work item where no verification fired will show a clean rate but no quality signal. Treat verification_event_number as the denominator-of-record when judging whether a verification_pass_rate reading is meaningful.

In practice, low values cluster around two patterns: skills that emit code without re-reading their own framework knowledge (common cause of convention-violation failures), and work items where description scope expanded silently between spec and implementation, forcing re-runs. The first is a skill-engineering fix; the second shows up jointly with high code_churn_ratio and is a scoping problem.

Code churn ratio

MetricRangeWhat it measuresSource dataDefined in
code_churn_ratio0-100 %Code churn associated with the PR — proportion of changes vs. scope.Linked PR's code_churn_ratio from Git data (insertions / deletions / files / lines).models/dataframe_models/swifter_data.py:45; pipelines/stats_calculation/nodes.py:20

code_churn_ratio flags PRs where change volume looks disproportionate to the work item's stated scope. The aggregation pulls insertions, deletions, files, and total lines from the PR's commits and resolves them to a 0-1 ratio at the PR level, which the extended frame scales to 0-100. The precise formula is computed inside the preprocess stage and is not exposed at the metric's declaration site, so quote it as a relative indicator rather than an absolute physical quantity.

Read it alongside verification_pass_rate and the description-quality score. High churn on a narrowly scoped WI usually means one of three things: the WI implicitly grew (the description said one screen, the PR touched five), the agent rewrote the same artefact multiple times in-session, or a sibling reference dragged in unintended files. None of those are fixed by re-running the agent; they are fixed by re-splitting the WI or tightening the scope-out clause.

Weekly review cadence

The Engagement Lead pulls the four headline metrics from the analytics dashboard once per week and compares them to the prior week's snapshot. The discipline is trend-watching, not point-in-time reading: single-week spikes are noise unless they persist for two consecutive readings or coincide with a known platform or team change. The whole review is meant to take under thirty minutes — pull, compare, decide, log — and is on the Engagement Lead's calendar as a recurring slot, not an ad-hoc task.

The pull is project-level by default. Per-work-item rows are loaded only when one of the project-level numbers trips the thresholds in the next section. This ordering keeps the weekly review cheap and reserves the per-WI drill-down for cases where it actually pays off.

What to look at

Read the four metrics together rather than one at a time. automation_ratio is the adoption signal; the other three are the execution-quality signals. The lookup table below is the working "metric → threshold → action" reference for the weekly review:

MetricHealthy range (project-level)Trigger to drill downFirst action when triggered
automation_ratioTrending up week-over-week toward the engagement targetFlat for 2+ weeks while sessions riseCoaching: re-walk the team through the Onboard Cookbook.
manual_interventionNear or below the 17.3 % macro-average baselineCrosses 25 %, or rises 5+ points week-over-weekSpot-check the top intervention-heavy WIs (see drill-down section below).
verification_pass_rateStable; trending up as the team learns the skillsDrops 10+ points week-over-weekInspect the failing gates; classify as skill issue vs. scope issue.
code_churn_ratioSteady; no systematic outliersTop-decile WIs running well above project medianRead those WIs against the cookbook; check for missing scope-out.

Thresholds above are starting points the Engagement Lead tightens per engagement after the first two milestones.

When to drill down

Drill from project-level to per-work-item when a metric moves materially against its own trend or against the established baseline. Start with the top intervention-heavy work items — observed pilot data shows the highest single-WI intervention rate at 33.3 %, with several others in the high 20s, and those outliers are where actionable causes live.

Before assuming a platform issue, run the spot-check list against the offending work item:

  • Description quality. Is the WI scored Excellent (7-9) on the Work-Item Composition rubric, or is it sitting in the Poor bucket? A Poor WI explains intervention without further investigation.
  • Scope-out clause. Does the description say what is out of scope ("X is out of scope; use mocked data")? Missing scope-out is the highest-leverage authoring fix.
  • Sibling reference. Does the WI name a sibling page or component (Use approach and patterns of <Sibling>)? Missing siblings produce churn.
  • Routing correctness. Defects routed through autonomous triage rather than the analyst? Backend "Create API" routed to backend-analyst, not backend-architect?
  • Re-creation pattern. Is this an "attempt-2"/"attempt-3" respawn of a prior WI? If so, the description was amended-by-spawning instead of amended-in-place — the analytics will look bad even when the platform is fine.

Only after that checklist comes up clean is escalation to Swifter the right move.

Milestone roll-ups

At each delivery milestone the Engagement Lead rolls weekly snapshots into a project-level summary for client-facing stakeholders. The roll-up has two jobs: it replaces ad-hoc status reporting, and it records the engagement's trajectory so the next milestone starts with a baseline rather than a blank page.

The format is deliberately minimal — three sections, no decoration:

  1. Headline numbers. Project-level automation_ratio, manual_intervention, verification_pass_rate, code_churn_ratio for the milestone window, with prior-milestone comparison.
  2. Top intervention-heavy WIs. Up to five, with one-line cause classification (description / routing / platform / scope creep).
  3. Decisions taken. Coaching sessions held, platform escalations raised, WIs re-split, cookbook updates pushed.

The roll-up is the only artefact the Engagement Lead shares outside the delivery team unless a specific question requires a deeper cut.

How these metrics combine

The four metrics are not independent; their joint pattern is what tells the Engagement Lead what to do. The decision table below is the working playbook:

PatternLikely causeRecommended action
automation_ratio rising, others stableHealthy onboardingStay the course; log the trajectory.
automation_ratio flat, sessions upTeam using Swifter for analysis only, hand-editing the codeCoaching: re-walk the team through the Onboard Cookbook.
manual_intervention rising, description scores stablePlatform bottleneck (skill regression, model drift, infra)Escalate to Swifter with the top intervention-heavy WIs as evidence.
manual_intervention rising, description scores decliningTeam-authoring driftBook a cookbook refresher with the team; focus on scope-out and sibling references.
verification_pass_rate low, code_churn_ratio highScope creep inside the WIRe-split the WI before re-running the agent.
verification_pass_rate low, code_churn_ratio normalSkill-level quality issue (convention drift, missing post-generation review)Skill-engineering fix; spot-check sessions for recurring failure mode.
Description quality high, manual_intervention still highSkill bottleneck, not description bottleneckInvest in the underlying skill; do not waste effort re-writing already-good WIs.

These rules apply at project level. Per-WI signals are noisier and should be treated as case studies that inform the project-level decision, not as actionable in isolation.

Baseline metrics — captured in Explore, compared in delivery

The four platform metrics above tell the EL how delivery is running inside Swifter. A second, smaller set of baseline business metrics is captured during the Explore Interview (Block 1) and compared against actuals at the 2-week post-onboarding check-in. The point is to anchor success in the team's own numbers rather than a generic claim.

Baseline metricCaptured atCompared against
Time from work-item creation to PR mergedExplore Block 1 (rough estimate)Post-onboarding observation at the 2-week check-in
PR review cycles per featureExplore Block 1 (rough estimate)Post-onboarding observation at the 2-week check-in
Sprint capacity consumed by reworkExplore Block 1 (rough estimate)Post-onboarding observation at the 2-week check-in

Outcome vocabulary for the EL. When asked "what results do teams see?", the answer is "we measure it together — we establish your baseline at Explore and compare at the 2-week mark. We don't quote percentages before we have your numbers." This framing is deliberate. Pre-engagement quoting of generic percentages is the leading cause of misaligned expectations after the first demo.

What is not measured today

The analytics pipeline does not currently surface several metrics an Engagement Lead might reasonably ask for. Some are derivable from existing fields; others live in operational telemetry on a different stack. Be explicit with stakeholders about the gap rather than approximating:

  • Time-to-PR — wall-clock from work item open to PR merge. Derivable from createdAt + PR close date, but not exposed as a column.
  • Cost per work item — LLM and sandbox spend. Lives in operational cost-tracking, not the analytics pipeline.
  • Acceptance-test pass rate — test runtime sits in a separate Swifter module; integration with stats_calculation is not present.
  • Cycle-time distribution / first-time-DONE rate — derivable from work-item status history, not surfaced as a column.
  • Adoption funnel (organisations → projects → users → sessions) — operational metric, lives in a different observability stack.

Treat the four headline metrics as the engagement-level signal; everything in the list above is a separate request to Swifter or a one-off analysis, not a promise the dashboard fulfils today.

How this page relates to the rest of the Cookbook

The Monitoring loop hands off in one direction. Persistently low verification_pass_rate or rising manual_intervention with stable description quality is a signal to revisit the rest of the Cookbook — the team may need a refresher on work-item authoring (see Work-Item Composition) or on routing (see Technical Execution). The dashboard surfaces these metrics in the Dashboard interface; the operating definitions live here.