Skip to content

Story: Review-E catches a bug

On 2026-04-21, Dev-E opened PR rig-conductor#162 — a change to the StaleHeartbeatService that added a configurable escalation path. The PR had green CI. It had the Closes #N link. It looked clean.

Review-E’s verdict: CHANGES_REQUESTED.

She was right.


Issue rig-conductor#158 asked: “Make the StaleHeartbeatService escalation threshold configurable per-agent. Some agents (iBuild-E, slow iOS builds) should have a longer threshold before being marked stuck.”

Dev-E’s implementation added a per_agent_thresholds dictionary to the service config:

public class HeartbeatConfig
{
public int DefaultThresholdMinutes { get; set; } = 30;
public Dictionary<string, int> PerAgentThresholds { get; set; } = new();
}

And the check:

private int GetThreshold(string agentId)
{
return _config.PerAgentThresholds.TryGetValue(agentId, out var t)
? t
: _config.DefaultThresholdMinutes;
}

On the surface: reasonable. The logic reads from config, falls back to default, threshold applied per-agent.


Review-E left three comments. The critical one:

[Line 47 — StaleHeartbeatService.cs]

GetThreshold returns the configured value directly without validating the range. A misconfigured PerAgentThresholds: { "dev-e": 0 } would cause every heartbeat check to immediately mark Dev-E as stuck (0-minute threshold). A negative value would never mark anything as stuck.

This is a config-poisoning footgun. Add a guard:

if (t <= 0) throw new InvalidOperationException(
$"HeartbeatConfig: threshold for {agentId} must be > 0, got {t}");

Or clamp silently with a warning log. Either is acceptable; unguarded is not.

The second comment:

[Line 83 — HeartbeatConfig.cs]

PerAgentThresholds is initialized as new() — but if deserialized from JSON/YAML config with no key present, it may arrive as null depending on the serializer settings. The TryGetValue call on line 47 will NullReferenceException. Add ?? new() at the call site or initialize in the setter.

The third comment:

[Missing test]

No test for the zero-threshold or negative-threshold cases. The config-poisoning scenario is high-severity (could mark all agents stuck on bad config deploy). Please add a unit test that verifies the guard throws on threshold: 0.


Dev-E addressed all three in a single follow-up commit:

private int GetThreshold(string agentId)
{
var thresholds = _config.PerAgentThresholds ?? new();
if (!thresholds.TryGetValue(agentId, out var t))
return _config.DefaultThresholdMinutes;
if (t <= 0)
throw new InvalidOperationException(
$"HeartbeatConfig: threshold for '{agentId}' must be > 0, got {t}. " +
$"Check appsettings.yaml PerAgentThresholds.");
return t;
}

And a test:

[Fact]
public void GetThreshold_ZeroValue_Throws()
{
var config = new HeartbeatConfig
{
PerAgentThresholds = new() { ["dev-e"] = 0 }
};
var svc = new StaleHeartbeatService(config, ...);
Assert.Throws<InvalidOperationException>(() => svc.GetThresholdPublic("dev-e"));
}

Review-E resolved all three threads and posted: APPROVED.


The bug Review-E caught was not a syntax error or a test failure — it was a semantic correctness problem that CI would never catch. A zero-threshold config value would have shipped to production and silently marked all agents stuck on the next deploy that included a misconfigured threshold.

This is the case for independent review: Dev-E’s implementation was internally consistent. She implemented what was asked for. Review-E’s job was to ask: “what happens when this is used wrong?” That’s a different question from “does this work?” and it requires a reader with no attachment to the implementation.

The rig’s separation — Dev-E writes, Review-E reviews — isn’t a formality. It’s the same reason human engineering teams have code review. The author is the worst reviewer of their own work. In the rig, that principle is architectural, not cultural.

Next: when the rig decommissions a site it built — and how teardown is as deliberate as bootstrap.