Seeking alpha tasks from live workflow signals.
Task topics sourced from ClawHub marketplace signals — currently instantiated from a high-coverage top-skill snapshot by downloads — not synthetic prompts, but real enterprise workflow patterns that the market demands. 105 tasks across 17 families, scored with deterministic checks plus structured judging. Re-evaluated quarterly to stay aligned with evolving market needs.
Future versions follow a quarterly cadence as marketplace signals evolve. Specific dates are intentionally not pre-committed.
| # | Model | Pass Rate ▼ | Overall Completion Score ▲ | Total Time |
|---|
Ranked by Pass Rate (pass ≥ 0.80). Ties broken by Overall Completion Score (raw mean across all 105 tasks, no discount). Total Time is summed across all 105 tasks.
105 real-world enterprise workflow tasks. Click a row to see full prompt, tools, and grading details.
| # | Task | Family | Difficulty | Grader | Tools |
|---|
Existing benchmarks freeze at publication and drift from real-world needs. We found task allocation in popular benchmarks diverges significantly from what enterprise users actually demand.
Real-world marketplace signals are automatically converted into benchmark tasks in 5 steps. The entire pipeline is scripted and reproducible — no manual task curation, no subjective selection. Same scripts + same signal snapshot → same task set.
Automatically ingest top marketplace skills ranked by downloads on ClawHub. The current release uses a top-500 snapshot: 572K+ downloads for Shell & Terminal alone, 19 skill families identified.
Semantic matching on skill names and summaries to group into workflow patterns. This iteration: 33 distinct patterns across 6 benchmark-relevant families.
Benchmark slots allocated proportionally to download volume: Workspace-Repair 27%, Document-Transform 22%, Cross-Tool 18%, Research 12%, Data-Analysis 11%, Communication 7%.
24 task seeds → 178 candidates via LLM-assisted generation. Each candidate auto-packaged with prompt, tool list, mock service config, and deterministic grader script.
Final 105 tasks selected via tier-alignment scoring, maximizing model differentiation. Tasks that all models ace or all fail are automatically dropped.
From raw download CSVs to final task.yaml — every step is a script. Human review is optional, not required.
Anyone can re-run the 5-stage pipeline with the same signal snapshot and independently verify the output.
Every family weight traces to ClawHub download volumes — no hand-waving, no intuition, no committee vote.
Market demand shifts every quarter — the benchmark shifts with it. We re-run the full pipeline on fresh ClawHub signals each quarter. As new skill categories surge and others plateau, the task distribution automatically re-calibrates.
| Family | Q3 2025 | Q4 2025 | Q1 2026 | Q2 2026 (proj.) | Trend |
|---|---|---|---|---|---|
| Workspace-Repair | 22.1% | 24.8% | 27.4% | 29.0% | ▲ +7pp |
| Document-Transform | 24.5% | 23.2% | 22.3% | 21.5% | ▼ −3pp |
| Cross-Tool | 16.8% | 18.0% | 19.6% | 20.5% | ▲ +4pp |
| Research | 14.2% | 12.9% | 11.8% | 11.0% | ▼ −3pp |
| Data-Analysis | 12.8% | 11.5% | 10.5% | 10.0% | ▼ −3pp |
| Communication | 5.2% | 6.0% | 7.0% | 7.5% | ▲ +2pp |
* Q3/Q4 2025 derived from earlier ClawHub snapshots; Q2 2026 projected from current growth trajectory. Each quarterly refresh re-runs the full automated pipeline on the latest signals.
105 tasks, 17 families, 13 models evaluated. Signal source: ClawHub skill-download snapshot as of April 2026.
Updated tasks from real-world signals.
Updated tasks from real-world signals.
Updated tasks from real-world signals.
Rule-based extraction + structured LLM judging, fully explainable scores. Each task has precise rules that capture structured representations from agent outputs. Where LLM judges are used, they operate under explicit rubrics with grounded evidence — every point deducted is traceable to a specific check.
Correct API calls made — verified via dispatch logs recorded by mock services.
Correct numbers, entities, and conclusions in output — verified against ground truth fixtures.
Required mutations completed — verified via service audit data (created/updated records).
📝
Coming Soon
Benchmark analysis, methodology deep-dives, and quarterly update notes will be posted here.
Claw-Eval-Live is a living benchmark for evaluating AI agents on real-world enterprise workflows. It contains 105 tasks across 17 families, grounded in continuously updated ClawHub marketplace signals. Unlike static benchmarks, it is re-calibrated quarterly using fresh marketplace signals.
Most benchmarks freeze at publication. Claw-Eval-Live uses an automated signal-to-task pipeline that re-ingests ClawHub marketplace signals every quarter, ensuring task distribution always reflects current market demand. For example, Workspace-Repair represents 27% of our benchmark vs ~8% in traditional benchmarks.
Every quarter (every 3 months). Each update re-runs the full automated pipeline on the latest ClawHub marketplace signals.
Each task is scored 0.0–1.0 using rule-based extraction combined with structured LLM judging under explicit rubrics. Scores ≥ 0.80 count as PASS. Overall Completion Score is the raw mean across all 105 tasks. Ranking: primary = Pass Rate, tiebreak = Overall Completion Score.
Claw-Eval-Live is built from real-world ClawHub marketplace signals, primarily top skills ranked by downloads. The current release uses a top-500 snapshot to derive task families and weights, but the benchmark name itself is not tied to a fixed number.