Claw-Eval-Live: A Live Agent Benchmark for
Evolving Real-World Workflow

Seeking alpha tasks from live workflow signals.

Task topics sourced from ClawHub marketplace signals — currently instantiated from a high-coverage top-skill snapshot by downloads — not synthetic prompts, but real enterprise workflow patterns that the market demands. 105 tasks across 17 families, scored with deterministic checks plus structured judging. Re-evaluated quarterly to stay aligned with evolving market needs.

Current: April 2026 · v1.0
⏳ Live Next refresh in
--days
:
--hrs
:
--min
105Tasks
17Families
13Frontier Models
QQuarterly Refresh
📅

Release Roadmap

← drag to scroll →
v1.0
Apr 2026
105 tasks · 13 models
CURRENT
v1.1
Next quarter
Re-ingested ClawHub signals · refreshed task set
NEXT

Future versions follow a quarterly cadence as marketplace signals evolve. Specific dates are intentionally not pre-committed.

Leaderboard

pass 105 tasks
# Model Pass Rate Overall Completion Score Total Time

Ranked by Pass Rate (pass ≥ 0.80). Ties broken by Overall Completion Score (raw mean across all 105 tasks, no discount). Total Time is summed across all 105 tasks.

Overall Completion Score by Family

Task Browser

105 real-world enterprise workflow tasks. Click a row to see full prompt, tools, and grading details.

# Task Family Difficulty Grader Tools
📈

Why a Live Benchmark?

Existing benchmarks freeze at publication and drift from real-world needs. We found task allocation in popular benchmarks diverges significantly from what enterprise users actually demand.

3.5×
Under-represented
Workspace-Repair: 27% market vs ~8% in benchmarks
2.8×
Over-represented
Research: 12% market vs ~33% in benchmarks
Quarterly
Refresh Cycle
Re-ingest real-world signals, regenerate & reselect tasks

Market Demand vs Existing Benchmark Allocation

Market Weight (ClawHub) Existing Benchmarks
Workspace-Repair
27.4%
~8%
Document-Transform
22.3%
~17%
Cross-Tool
19.6%
~15%
Research
11.8%
~33%
Data-Analysis
10.5%
~23%
Communication
7.0%
~4%
Key finding: Claw-Eval-Live corrects this mismatch by aligning task distribution with real-world demand signals, re-calibrated every quarter.

Automated Signal-to-Task Pipeline

Real-world marketplace signals are automatically converted into benchmark tasks in 5 steps. The entire pipeline is scripted and reproducible — no manual task curation, no subjective selection. Same scripts + same signal snapshot → same task set.

1

Signal Collection

Automatically ingest top marketplace skills ranked by downloads on ClawHub. The current release uses a top-500 snapshot: 572K+ downloads for Shell & Terminal alone, 19 skill families identified.

2

Pattern Clustering

Semantic matching on skill names and summaries to group into workflow patterns. This iteration: 33 distinct patterns across 6 benchmark-relevant families.

3

Family Weighting

Benchmark slots allocated proportionally to download volume: Workspace-Repair 27%, Document-Transform 22%, Cross-Tool 18%, Research 12%, Data-Analysis 11%, Communication 7%.

4

Task Seed Expansion

24 task seeds → 178 candidates via LLM-assisted generation. Each candidate auto-packaged with prompt, tool list, mock service config, and deterministic grader script.

5

Discrimination-Optimized Selection

Final 105 tasks selected via tier-alignment scoring, maximizing model differentiation. Tasks that all models ace or all fail are automatically dropped.

Fully Automated

From raw download CSVs to final task.yaml — every step is a script. Human review is optional, not required.

Reproducible

Anyone can re-run the 5-stage pipeline with the same signal snapshot and independently verify the output.

Data-Backed

Every family weight traces to ClawHub download volumes — no hand-waving, no intuition, no committee vote.

🔄

Quarterly Evolution

Market demand shifts every quarter — the benchmark shifts with it. We re-run the full pipeline on fresh ClawHub signals each quarter. As new skill categories surge and others plateau, the task distribution automatically re-calibrates.

Family Demand Shift (Illustrative)

Family Q3 2025 Q4 2025 Q1 2026 Q2 2026 (proj.) Trend
Workspace-Repair 22.1%24.8%27.4%29.0% ▲ +7pp
Document-Transform 24.5%23.2%22.3%21.5% ▼ −3pp
Cross-Tool 16.8%18.0%19.6%20.5% ▲ +4pp
Research 14.2%12.9%11.8%11.0% ▼ −3pp
Data-Analysis 12.8%11.5%10.5%10.0% ▼ −3pp
Communication 5.2%6.0%7.0%7.5% ▲ +2pp

* Q3/Q4 2025 derived from earlier ClawHub snapshots; Q2 2026 projected from current growth trajectory. Each quarterly refresh re-runs the full automated pipeline on the latest signals.

What this means: Workspace-Repair and Cross-Tool orchestration are gaining share every quarter as enterprises adopt more agent-driven workflows. A benchmark frozen in Q3 2025 would already under-represent these categories by ~7 percentage points. Claw-Eval-Live re-calibrates every quarter to stay aligned.

Release Timeline

Q2 2026 — v1.0 (Current)

105 tasks, 17 families, 13 models evaluated. Signal source: ClawHub skill-download snapshot as of April 2026.

2

Q3 2026 — v1.1

Updated tasks from real-world signals.

3

Q4 2026 — v1.2

Updated tasks from real-world signals.

4

Q1 2027 — v1.3

Updated tasks from real-world signals.

How Evaluation Works

Rule-based extraction + structured LLM judging, fully explainable scores. Each task has precise rules that capture structured representations from agent outputs. Where LLM judges are used, they operate under explicit rubrics with grounded evidence — every point deducted is traceable to a specific check.

📄
Task Definition
task.yaml + fixtures/
prompt, tools, mock data
Mock Services
8 services with
fixed JSON fixtures
🤖
Agent Loop
prompt → tool call →
response → repeat
max 24 turns · 300s
📜
JSONL Trace
tool calls, responses
tokens, wall time
Grader
Python script per task
deterministic checks
%
Score
0.0 – 1.0
pass ≥ 0.80

Grading Evidence Sources

15-20% Data Retrieval

Correct API calls made — verified via dispatch logs recorded by mock services.

40-60% Data Accuracy

Correct numbers, entities, and conclusions in output — verified against ground truth fixtures.

10-20% Action Verification

Required mutations completed — verified via service audit data (created/updated records).

Scoring & Ranking

Pass Threshold
≥ 0.80
Score ≥ 0.80 = PASS, otherwise FAIL
Overall Completion Score
raw mean
Mean of all 105 task scores, no discount
Ranking
Pass Rate
Primary: Pass Rate · Tiebreak: Completion Score
🔗

Related Work

CE
Claw-Eval
Toward Trustworthy Evaluation of Autonomous Agents
The foundational benchmark Claw-Eval-Live extends. Claw-Eval-Live is the live, signal-driven sequel of the Claw-Eval line of work, refreshing tasks quarterly as enterprise workflow demand evolves.
Visit Claw-Eval  →

Blog

📝

Coming Soon

Benchmark analysis, methodology deep-dives, and quarterly update notes will be posted here.

Frequently Asked Questions

What is Claw-Eval-Live?

Claw-Eval-Live is a living benchmark for evaluating AI agents on real-world enterprise workflows. It contains 105 tasks across 17 families, grounded in continuously updated ClawHub marketplace signals. Unlike static benchmarks, it is re-calibrated quarterly using fresh marketplace signals.

How is it different from other benchmarks?

Most benchmarks freeze at publication. Claw-Eval-Live uses an automated signal-to-task pipeline that re-ingests ClawHub marketplace signals every quarter, ensuring task distribution always reflects current market demand. For example, Workspace-Repair represents 27% of our benchmark vs ~8% in traditional benchmarks.

How often is it updated?

Every quarter (every 3 months). Each update re-runs the full automated pipeline on the latest ClawHub marketplace signals.

How are tasks scored?

Each task is scored 0.0–1.0 using rule-based extraction combined with structured LLM judging under explicit rubrics. Scores ≥ 0.80 count as PASS. Overall Completion Score is the raw mean across all 105 tasks. Ranking: primary = Pass Rate, tiebreak = Overall Completion Score.

What signals is Claw-Eval-Live built from?

Claw-Eval-Live is built from real-world ClawHub marketplace signals, primarily top skills ranked by downloads. The current release uses a top-500 snapshot to derive task families and weights, but the benchmark name itself is not tied to a fixed number.