How often is Claw-Eval-Live updated?

Every quarter (every 3 months). Each update re-runs the full automated pipeline on the latest ClawHub marketplace signals and produces a new version of the benchmark with updated task distribution and model evaluations.

How many models does Claw-Eval-Live evaluate?

The current version (v1.0, April 2026) evaluates 13 frontier models including Claude Opus 4.6, GPT-5.4, Claude Sonnet 4.6, GLM-5, MiniMax M2.7, Gemini 3.1 Pro, MiMo V2 Pro, Kimi K2.5, and others.

How are tasks scored in Claw-Eval-Live?

Each task is scored 0.0-1.0 using rule-based extraction combined with structured LLM judging under explicit rubrics. Scores >= 0.80 count as PASS. The Overall Completion Score is the raw mean across all 105 tasks with no discount. Models are ranked primarily by Pass Rate, with Overall Completion Score as tiebreaker.

Claw-Eval-Live: A Live Agent Benchmark for
Evolving Real-World Workflow

Seeking alpha tasks from live workflow signals.

Task topics sourced from ClawHub marketplace signals — currently instantiated from a high-coverage top-skill snapshot by downloads — not synthetic prompts, but real enterprise workflow patterns that the market demands. 105 tasks across 17 families, scored with deterministic checks plus structured judging. Re-evaluated quarterly to stay aligned with evolving market needs.

Current: April 2026 · v1.0

⏳ Live Next refresh in

--days

--hrs

--min

105Tasks

17Families

13Frontier Models

QQuarterly Refresh

📅

Release Roadmap

← drag to scroll →

v1.0

Apr 2026

105 tasks · 13 models

CURRENT

v1.1

Next quarter

Re-ingested ClawHub signals · refreshed task set

Future versions follow a quarterly cadence as marketplace signals evolve. Specific dates are intentionally not pre-committed.

Leaderboard

pass 105 tasks

#	Model	Pass Rate ▼	Overall Completion Score ▲	Total Time

Ranked by Pass Rate (pass ≥ 0.80). Ties broken by Overall Completion Score (raw mean across all 105 tasks, no discount). Total Time is summed across all 105 tasks.

Overall Completion Score by Family

Task Browser

105 real-world enterprise workflow tasks. Click a row to see full prompt, tools, and grading details.

#	Task	Family	Difficulty	Grader	Tools

📈

Why a Live Benchmark?

Existing benchmarks freeze at publication and drift from real-world needs. We found task allocation in popular benchmarks diverges significantly from what enterprise users actually demand.

3.5×

Under-represented

Workspace-Repair: 27% market vs ~8% in benchmarks

2.8×

Over-represented

Research: 12% market vs ~33% in benchmarks

Quarterly

Refresh Cycle

Re-ingest real-world signals, regenerate & reselect tasks

Market Demand vs Existing Benchmark Allocation

Market Weight (ClawHub) Existing Benchmarks

Workspace-Repair

27.4%

~8%

Document-Transform

22.3%

~17%

Cross-Tool

19.6%

~15%

Research

11.8%

~33%

Data-Analysis

10.5%

~23%

Communication

7.0%

~4%

Key finding: Claw-Eval-Live corrects this mismatch by aligning task distribution with real-world demand signals, re-calibrated every quarter.

⚙

Automated Signal-to-Task Pipeline

Real-world marketplace signals are automatically converted into benchmark tasks in 5 steps. The entire pipeline is scripted and reproducible — no manual task curation, no subjective selection. Same scripts + same signal snapshot → same task set.

Signal Collection

Automatically ingest top marketplace skills ranked by downloads on ClawHub. The current release uses a top-500 snapshot: 572K+ downloads for Shell & Terminal alone, 19 skill families identified.

Pattern Clustering

Semantic matching on skill names and summaries to group into workflow patterns. This iteration: 33 distinct patterns across 6 benchmark-relevant families.

Family Weighting

Benchmark slots allocated proportionally to download volume: Workspace-Repair 27%, Document-Transform 22%, Cross-Tool 18%, Research 12%, Data-Analysis 11%, Communication 7%.

Task Seed Expansion

24 task seeds → 178 candidates via LLM-assisted generation. Each candidate auto-packaged with prompt, tool list, mock service config, and deterministic grader script.

Discrimination-Optimized Selection

Final 105 tasks selected via tier-alignment scoring, maximizing model differentiation. Tasks that all models ace or all fail are automatically dropped.

Fully Automated

From raw download CSVs to final task.yaml — every step is a script. Human review is optional, not required.

Reproducible

Anyone can re-run the 5-stage pipeline with the same signal snapshot and independently verify the output.

Data-Backed

Every family weight traces to ClawHub download volumes — no hand-waving, no intuition, no committee vote.

🔄

Quarterly Evolution

Market demand shifts every quarter — the benchmark shifts with it. We re-run the full pipeline on fresh ClawHub signals each quarter. As new skill categories surge and others plateau, the task distribution automatically re-calibrates.

Family Demand Shift (Illustrative)

Family	Q3 2025	Q4 2025	Q1 2026	Q2 2026 (proj.)	Trend
Workspace-Repair	22.1%	24.8%	27.4%	29.0%	▲ +7pp
Document-Transform	24.5%	23.2%	22.3%	21.5%	▼ −3pp
Cross-Tool	16.8%	18.0%	19.6%	20.5%	▲ +4pp
Research	14.2%	12.9%	11.8%	11.0%	▼ −3pp
Data-Analysis	12.8%	11.5%	10.5%	10.0%	▼ −3pp
Communication	5.2%	6.0%	7.0%	7.5%	▲ +2pp

* Q3/Q4 2025 derived from earlier ClawHub snapshots; Q2 2026 projected from current growth trajectory. Each quarterly refresh re-runs the full automated pipeline on the latest signals.

What this means: Workspace-Repair and Cross-Tool orchestration are gaining share every quarter as enterprises adopt more agent-driven workflows. A benchmark frozen in Q3 2025 would already under-represent these categories by ~7 percentage points. Claw-Eval-Live re-calibrates every quarter to stay aligned.

Release Timeline

✓

Q2 2026 — v1.0 (Current)

105 tasks, 17 families, 13 models evaluated. Signal source: ClawHub skill-download snapshot as of April 2026.

Q3 2026 — v1.1

Updated tasks from real-world signals.

Q4 2026 — v1.2

Updated tasks from real-world signals.

Q1 2027 — v1.3

Updated tasks from real-world signals.

✅

How Evaluation Works

Rule-based extraction + structured LLM judging, fully explainable scores. Each task has precise rules that capture structured representations from agent outputs. Where LLM judges are used, they operate under explicit rubrics with grounded evidence — every point deducted is traceable to a specific check.

📄

Task Definition

task.yaml + fixtures/
prompt, tools, mock data

→

⚙

Mock Services

8 services with
fixed JSON fixtures

→

🤖

Agent Loop

prompt → tool call →
response → repeat

max 24 turns · 300s

→

📜

JSONL Trace

tool calls, responses
tokens, wall time

→

✅

Grader

Python script per task
deterministic checks

→

Score

0.0 – 1.0
pass ≥ 0.80

Grading Evidence Sources

15-20% Data Retrieval

Correct API calls made — verified via dispatch logs recorded by mock services.

40-60% Data Accuracy

Correct numbers, entities, and conclusions in output — verified against ground truth fixtures.

10-20% Action Verification

Required mutations completed — verified via service audit data (created/updated records).

Scoring & Ranking

Pass Threshold

≥ 0.80

Score ≥ 0.80 = PASS, otherwise FAIL

Overall Completion Score

raw mean

Mean of all 105 task scores, no discount

Ranking

Pass Rate

Primary: Pass Rate · Tiebreak: Completion Score

🔗

Related Work

Claw-Eval

Toward Trustworthy Evaluation of Autonomous Agents

The foundational benchmark Claw-Eval-Live extends. Claw-Eval-Live is the live, signal-driven sequel of the Claw-Eval line of work, refreshing tasks quarterly as enterprise workflow demand evolves.

Visit Claw-Eval →

Blog

📝

Coming Soon

Benchmark analysis, methodology deep-dives, and quarterly update notes will be posted here.

Frequently Asked Questions

What is Claw-Eval-Live?

Claw-Eval-Live is a living benchmark for evaluating AI agents on real-world enterprise workflows. It contains 105 tasks across 17 families, grounded in continuously updated ClawHub marketplace signals. Unlike static benchmarks, it is re-calibrated quarterly using fresh marketplace signals.

How is it different from other benchmarks?

Most benchmarks freeze at publication. Claw-Eval-Live uses an automated signal-to-task pipeline that re-ingests ClawHub marketplace signals every quarter, ensuring task distribution always reflects current market demand. For example, Workspace-Repair represents 27% of our benchmark vs ~8% in traditional benchmarks.

How often is it updated?

Every quarter (every 3 months). Each update re-runs the full automated pipeline on the latest ClawHub marketplace signals.

How are tasks scored?

Each task is scored 0.0–1.0 using rule-based extraction combined with structured LLM judging under explicit rubrics. Scores ≥ 0.80 count as PASS. Overall Completion Score is the raw mean across all 105 tasks. Ranking: primary = Pass Rate, tiebreak = Overall Completion Score.

What signals is Claw-Eval-Live built from?

Claw-Eval-Live is built from real-world ClawHub marketplace signals, primarily top skills ranked by downloads. The current release uses a top-500 snapshot to derive task families and weights, but the benchmark name itself is not tied to a fixed number.

Claw-Eval-Live: A Live Agent Benchmark forEvolving Real-World Workflow

Release Roadmap

Leaderboard

Overall Completion Score by Family

Task Browser

Why a Live Benchmark?

Market Demand vs Existing Benchmark Allocation

Automated Signal-to-Task Pipeline

Signal Collection

Pattern Clustering

Family Weighting

Task Seed Expansion

Discrimination-Optimized Selection

Fully Automated

Reproducible

Data-Backed

Quarterly Evolution

Family Demand Shift (Illustrative)

Release Timeline

Q2 2026 — v1.0 (Current)

Q3 2026 — v1.1

Q4 2026 — v1.2

Q1 2027 — v1.3

How Evaluation Works

Grading Evidence Sources

Scoring & Ranking

Related Work

Blog

Frequently Asked Questions

Claw-Eval-Live: A Live Agent Benchmark for
Evolving Real-World Workflow