# pinchbench

**Repository Path**: github_zoo/pinchbench

## Basic Information

- **Project Name**: pinchbench
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-08
- **Last Updated**: 2026-03-08

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 🦀 PinchBench

**Real-world benchmarks for AI coding agents**

[![Leaderboard](https://img.shields.io/badge/leaderboard-pinchbench.com-blue)](https://pinchbench.com)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

PinchBench measures how well LLM models perform as the brain of an [OpenClaw](https://github.com/openclaw/openclaw) agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.

Results are collected on a public leaderboard at **[pinchbench.com](https://pinchbench.com)**.

![PinchBench](pinchbench.png)

## Why PinchBench?

Most LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents:

- **Tool usage** — Can the model call the right tools with the right parameters?
- **Multi-step reasoning** — Can it chain together actions to complete complex tasks?
- **Real-world messiness** — Can it handle ambiguous instructions and incomplete information?
- **Practical outcomes** — Did it actually create the file, send the email, or schedule the meeting?

## Quick Start

```bash
# Clone the skill
git clone https://github.com/pinchbench/skill.git
cd skill

# Run benchmarks with your model of choice
./scripts/run.sh --model anthropic/claude-sonnet-4

# Or run specific tasks
./scripts/run.sh --model openai/gpt-4o --suite task_01_calendar,task_02_stock
```

**Requirements:**
- Python 3.10+
- [uv](https://docs.astral.sh/uv/) package manager
- A running OpenClaw instance

## What Gets Tested

PinchBench includes 23 tasks across real-world categories:

| Category | Tasks | What's tested |
|----------|-------|---------------|
| **Productivity** | Calendar, daily summaries | Event creation, time parsing, scheduling |
| **Research** | Stock prices, conferences, markets | Web search, data extraction, synthesis |
| **Writing** | Blog posts, emails, humanization | Content generation, tone, formatting |
| **Coding** | Weather scripts, file structures | Code generation, file operations |
| **Analysis** | Spreadsheets, PDFs, documents | Data processing, summarization |
| **Email** | Triage, search | Inbox management, filtering |
| **Memory** | Context retrieval, knowledge management | Long-term memory, recall |
| **Skills** | ClawHub, skill discovery | OpenClaw ecosystem integration |

Each task is graded automatically, by an LLM judge, or both — ensuring both objective and nuanced evaluation.

## Submitting Results

To get your results on the leaderboard:

```bash
# Register for an API token (one-time)
./scripts/run.sh --register

# Run benchmark — results auto-upload with your token
./scripts/run.sh --model anthropic/claude-sonnet-4
```

Skip uploading with `--no-upload` if you just want local results.

## Command Reference

| Flag | Description |
|------|-------------|
| `--model MODEL` | Model to test (e.g., `anthropic/claude-sonnet-4`) |
| `--suite SUITE` | `all`, `automated-only`, or comma-separated task IDs |
| `--runs N` | Number of runs per task for averaging |
| `--timeout-multiplier N` | Scale timeouts for slower models |
| `--output-dir DIR` | Where to save results (default: `results/`) |
| `--no-upload` | Skip uploading to leaderboard |
| `--register` | Request an API token for submissions |
| `--upload FILE` | Upload a previous results JSON |

## Contributing Tasks

We welcome new tasks! Check out [`tasks/TASK_TEMPLATE.md`](tasks/TASK_TEMPLATE.md) for the format. Good tasks are:

- **Real-world** — Something an actual user would ask an agent to do
- **Measurable** — Clear success criteria that can be graded
- **Reproducible** — Same task should produce consistent grading
- **Challenging** — Tests agent capabilities, not just LLM knowledge

## Links

- **Leaderboard:** [pinchbench.com](https://pinchbench.com)
- **OpenClaw:** [github.com/openclaw/openclaw](https://github.com/openclaw/openclaw)
- **Issues:** [github.com/pinchbench/skill/issues](https://github.com/pinchbench/skill/issues)

## License

MIT — see [LICENSE](LICENSE) for details.

---

*Claw-some AI agent testing* 🦞