# doc_extract

**Repository Path**: origin-zh/doc_extract

## Basic Information

- **Project Name**: doc_extract
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-27
- **Last Updated**: 2026-04-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# 文档解析器 v4.1

## 项目简介

专为大模型优化的多格式文档解析器。一个接口解析所有格式，返回 Markdown + 图片。

## 核心特点

- **一个接口**：`POST /api/v1/parse`，参数只有 `file` 和 `use_ocr`
- **智能路由**：PDF/Word/PPT/图片 默认 OCR，其他格式始终标准解析
- **双引擎**：标准解析（pdfplumber/python-docx 等）+ OCR（PP-StructureV3 / OCR-VL 1.5）
- **统一输出**：所有格式返回 `markdown` + `image_mapping`

## 支持格式

| 格式 | 扩展名 | 默认模式 |
|------|--------|----------|
| PDF | .pdf | OCR |
| Word | .docx | OCR |
| Word (旧) | .doc | OCR |
| PowerPoint | .pptx | OCR |
| PowerPoint (旧) | .ppt | OCR |
| Excel | .xlsx | 标准 |
| CSV | .csv | 标准 |
| Markdown | .md/.markdown | 标准 |
| 纯文本 | .txt | 标准 |
| 图片 | .png/.jpg/.jpeg/.bmp/.tiff | OCR |

## 快速开始

### 1. 安装

```bash
pip install -r requirements.txt
```

### 2. 配置 OCR（可选）

项目根目录新建 `.env`：

```bash
cp .env.example .env
```

编辑 `.env`：

```dotenv
OCR_API_BASE=http://192.168.0.189:18084
OCR_ENGINE=ocr_vl
OCR_TIMEOUT=120
```

本地开发模式下，OCR 请求会按次读取 `.env`，你改完地址后下一次 OCR 请求就会生效。

### 3. 启动

```bash
python app.py --reload          # 开发模式
python app.py --workers 4       # 生产模式
```

### 4. 调用

```python
import requests

# 默认调用（PDF/Word/PPT/图片自动走 OCR）
with open("report.pdf", "rb") as f:
    resp = requests.post("http://localhost:8000/api/v1/parse", files={"file": f})
    result = resp.json()
    print(result["markdown"][0][:200])

# 关闭 OCR，强制标准解析
with open("data.docx", "rb") as f:
    resp = requests.post(
        "http://localhost:8000/api/v1/parse?use_ocr=false",
        files={"file": f},
    )

# Excel/CSV/MD/TXT 无论传什么都走标准解析
with open("data.xlsx", "rb") as f:
    resp = requests.post("http://localhost:8000/api/v1/parse", files={"file": f})
```

```bash
# curl — PDF 默认 OCR
curl -X POST "http://localhost:8000/api/v1/parse" -F "file=@report.pdf"

# curl — 关闭 OCR
curl -X POST "http://localhost:8000/api/v1/parse?use_ocr=false" -F "file=@data.docx"
```

## API

### POST /api/v1/parse

| 参数 | 类型 | 必填 | 默认 | 说明 |
|------|------|------|------|------|
| file | File | 是 | - | 文档文件（最大 50MB） |
| use_ocr | bool | 否 | true | 是否使用 OCR（仅对 PDF/Word/PPT/图片 生效） |
| compress_images | bool | 否 | false | 是否压缩图片为 WebP |

**响应：**

```json
{
  "request_id": "550e8400-...",
  "status": "success",
  "processing_time_ms": 1250,
  "document": {"name": "report.pdf", "type": ".pdf", "pages": 4, "parse_mode": "ocr"},
  "parse_mode": "ocr",
  "markdown": ["# 第1页内容...", "# 第2页内容..."],
  "image_mapping": [{"imgs/img_1.jpg": "base64..."}, {}],
  "layout_visualizations": [{"layout_det_res": "base64..."}, {}]
}
```

### GET /health

```json
{"status": "healthy", "version": "4.1.0", "uptime_seconds": 3600, "supported_formats": [...]}
```

### OCR 配置说明

- 本地运行默认读取项目根目录 `.env`
- 如果系统环境变量里也设置了 `OCR_API_BASE` / `OCR_ENGINE` / `OCR_TIMEOUT`，环境变量优先
- Docker 方式继续使用 [docker/.env.example](/home/zhangxuan/project/ai/doc_extract/docker/.env.example)

## 前端测试

| 地址 | 说明 |
|------|------|
| http://localhost:8000 | 标准测试界面 |
| http://localhost:8000/unified-test | 统一测试页 |
| http://localhost:8000/docs | Swagger 文档 |

## 项目结构

```
doc_extract/
├── app.py                    # 启动入口
├── doc_parser_api.py         # API 实现（唯一解析接口）
├── document_parser.py        # 解析器调度
├── core/                     # 核心模块
│   ├── base.py              # BaseParser, LLMSegment
│   ├── middleware.py         # 请求追踪、耗时统计、文件验证
│   ├── image_cell_matcher.py # 图片-单元格匹配
│   └── wmf_converter.py     # WMF/EMF 转换
├── parsers/                  # 解析器
│   ├── pdf/word/ppt/excel/csv/md/text_parser.py
│   ├── ocr_parser.py        # OCR 解析
│   └── ocr_parse/           # OCR 客户端
├── frontend/                 # 前端测试页
└── data/                     # 测试数据
```

## 版本历史

- **v4.1**: 精简为单一解析接口 `POST /api/v1/parse`
- **v4.0**: 生产级 API 重构（中间件、统一错误、图片压缩）
- **v3.5**: 全功能版（OCR 集成、前端可视化）
- **v3.0**: 模块化重构

---

© 2024-2026 文档解析器项目