The Eval · v1 · June 2026

Take the model you already run.
Add Rocky.
It improved the answer on 27 of 30 detection-engineering questions.

Current. Cited. Grounded. We asked the same model 30 real questions with and without Rocky, then graded the answers blind. Here are the receipts, with the misses left in.

What we tested

30 real questions a detection engineer actually asks: live CVE triage, ATT&CK technique coverage, Sigma authoring, LOLBin abuse, credential access, persistence, cloud baselines, CI/CD, and EDR rule conversion.

Each asked to the same model, with and without Rocky. Models: Claude Haiku 4.5 and Claude Opus 4.8.

Graded blind by an LLM judge on four axes (accuracy, freshness, citation, and specificity), scored 0–20 per answer.

The set deliberately includes 6 questions we expected to lose, where freshness doesn't matter and the bigger model should win. The spread is the point: we didn't cherry-pick.

The lift

Adding Rocky improved the answer on 27 of 30 questions, for both models. The three small dips each fall in cases where the bare model was already strong, below the judge's noise floor. Hold the model constant and the gain is consistent:

Claude Haiku 4.5

+5.3

without Rocky

9.8

+ Rocky

15.1

Claude Opus 4.8

+3.9

without Rocky

14.7

+ Rocky

18.6

Scores are out of 20, averaged over 30 questions.

Where the lift lands

Same-model lift by category. The win is carried by time-sensitive questions a frozen model can't answer, and stays positive everywhere else.

Live CVE / KEV triage (6)

+11.4

Behavioral triage (7)

+3.9

Technique → detection & authoring (8)

+2.8

Prioritization & tuning (2)

+3.5

Cloud, pipeline & EDR conversion (3)

+2.0

Credential access & lateral movement (4)

+1.8

The cost angle

Pair a cheaper model you already run with Rocky and you land alongside the frontier on average, at a fraction of the token cost. Haiku + Rocky averaged 15.1 against bare Opus's 14.7, and it doesn't just match the average, it dominates time-sensitive questions while trailing slightly on static knowledge the bigger model already has.

~$0.006

per answer, Haiku + Rocky

~$0.027

per answer, bare Opus

Roughly a fifth of the cost. The extra input Rocky adds is query-specific retrieved context: the current, cited detection knowledge a frozen model can't have.

Receipts

Four exchanges, same model (Haiku 4.5), with and without Rocky. These illustrate the 27-of-30 result; they aren't the dataset.

One of 6 live-CVE / KEV wins

“Is CVE-2026-20262 on CISA's known-exploited list? What is it, and what should I be hunting for?”

Without Rocky7/20

I don't have information about CVE-2026-20262 in my training data. The number suggests a 2026 date, beyond my knowledge cutoff. Check CISA's KEV catalog and the NVD directly, then hunt for the affected software and exploitation patterns once you confirm the details.

With Rocky18/20

Yes, CVE-2026-20262 is on CISA’s KEV catalog. It’s a directory/path-traversal vulnerability (CWE-22) in Cisco Catalyst SD-WAN Manager that lets an authenticated remote attacker create or overwrite arbitrary files. Added to KEV 2026-06-15; federal remediation due 2026-06-29. Hunt for unusual file writes from SD-WAN Manager processes, path-traversal sequences in HTTP requests to Manager instances, and processes spawned from unexpected locations.

cited: CISA KEV: CVE-2026-20262 (Cisco Catalyst SD-WAN Manager)

A typical SOC triage question

“Is powershell.exe spawned by excel.exe suspicious?”

Without Rocky12/20

Context-dependent, but warrants investigation. There are legitimate reasons (macros, add-ins, automation) and red flags (hidden window, encoded commands, network callbacks). Examine the command line and user interaction to determine threat level.

With Rocky18/20

Yes, treat it as a high-confidence indicator of malicious activity. A SigmaHQ rule flags PowerShell invoked from script-engine parents, and Excel is a documented LOLBin. In the telemetry, excel.exe spawned powershell.exe only 8 times out of 9,261,865 powershell observations (0.00%), and it doesn’t appear among normal parents at all. Investigate the script content, command line, and the document’s source.

cited: Sigma: Suspicious PowerShell From Script Engines · LOLBAS: Excel.exe

Detection-validation / coverage

“I'm building detection for PowerShell execution (T1059.001). What adversary-emulation tests can I run to validate coverage, and what should I detect?”

Without Rocky9/20

Run Atomic Red Team T1059.001 with its test IDs, plus other frameworks. Monitor process creation (parent/child, command-line flags like -enc and -ep bypass), and enable Script Block Logging (Event ID 4104). Watch for encoded commands and execution-policy bypass.

With Rocky17/20

Names the specific Atomic Red Team tests to run (Mimikatz credential dumping, PowerUp Invoke-AllChecks, and the -EncodedCommand / -Command parameter variations), each with the exact command and what to detect. Ties the high-confidence signals to a SigmaHQ rule listing 200+ malicious commandlets, and calls out the gaps (ScriptBlock logging, constrained-language-mode bypasses, parent-process context) using real parent-prevalence counts.

cited: Atomic Red Team: T1059.001 · Sigma: Malicious PowerShell Commandlets

A second live-CVE win

“Are there any actively-exploited UniFi vulnerabilities I should patch?”

Without Rocky7/20

As of my last update (April 2024), check CVE-2024-21892 and CVE-2023-46805, but my information may be outdated. Review Ubiquiti security advisories and the CISA KEV catalog for current active exploits.

With Rocky19/20

Yes. CVE-2026-34910, a UniFi OS improper-input-validation vulnerability allowing command injection, was added to CISA KEV on 2026-06-23 (federal remediation due 2026-06-26). It maps to T1190 (Exploit Public-Facing Application). Apply vendor mitigations per CISA’s BOD 26-04 and prioritize by internet exposure.

cited: CISA KEV: CVE-2026-34910 (Ubiquiti UniFi OS)

What we learned, and what we're fixing

Of the 30, adding Rocky helped 27 and was flat-to-slightly-down on 3. The most instructive miss was BYOVD (bring your own vulnerable driver). On that one, Rocky nudged the answer down: the bare model cited LOLDrivers and Rocky didn't, and it hedged on the ATT&CK IDs.

That's a grounding gap, not a reasoning gap. LOLDrivers is exactly the kind of living dataset that should fire on a driver question, and it wasn't wired in yet. The fix is a data wedge: add LOLDrivers grounding, re-run, and show the lift. Every miss files a gap; this is the one we're closing first.

What this is, and isn't

This is one run: 30 questions across the detection-engineering categories above, each asked to the same model with and without Rocky, graded blind by an LLM judge on four axes. We version it v1 · June 2026 because we re-run it as we ship new data.

It isn't the full scientific benchmark, and it isn't a head-to-head superiority claim against any model. What it shows is narrow and real: on detection-engineering questions, adding Rocky to the model you already run made the answer better 27 times out of 30, and matched a frontier model on average at a fraction of the cost. As we grow the question set and wire in more sources, v2 will say more.

Try it on your own questions.

Start chatting with Rocky How Rocky works