# semtools **Repository Path**: mirrors/semtools ## Basic Information - **Project Name**: semtools - **Description**: 一套高性能的 CLI 工具,用于文档处理和语义搜索,使用 Rust 构建,速度快、可靠性高 - **Primary Language**: Python - **License**: MIT - **Default Branch**: main - **Homepage**: https://www.oschina.net/p/semtools - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-12-03 - **Last Updated**: 2026-02-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SemTools > Semantic search and document parsing tools for the command line A high-performance CLI tool for document processing and semantic search, built with Rust for speed and reliability. - **`semtools parse`** - Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown format - **`semtools search`** - Local semantic keyword search using multilingual embeddings with cosine similarity matching and per-line context matching - **`semtools ask`** - AI agent with search and read tools for answering questions over document collections (defaults to OpenAI, but see the [config section](#configuration) to learn more about connecting to any OpenAI-Compatible API) - **`semtools workspace`** - Workspace management for accelerating search over large collections **NOTE:** By default, `parse` uses LlamaParse as a backend. Get your API key today for free at [https://cloud.llamaindex.ai](https://cloud.llamaindex.ai). `search` and `workspace` remain local-only. `ask` requires an OpenAI API key. ## Key Features - **Fast semantic search** using model2vec embeddings from [minishlab/potion-multilingual-128M](https://huggingface.co/minishlab/potion-multilingual-128M) - **Reliable document parsing** with caching and error handling - **Unix-friendly** design with proper stdin/stdout handling - **Configurable** distance thresholds and returned chunk sizes - **Multi-format support** for parsing documents (PDF, DOCX, PPTX, etc.) - **Concurrent processing** for better parsing performance - **Workspace management** for efficient document retrieval over large collections ## Installation Prerequisites: - For the `parse` subcommand: LlamaIndex Cloud API key - For the `ask` subcommand: OpenAI API key Install: You can install `semtools` via npm: ```bash npm i -g @llamaindex/semtools ``` Or via cargo: ```bash # install entire crate cargo install semtools # install only select features cargo install semtools --no-default-features --features=parse ``` Note: Installing from npm builds the Rust binaries locally during install if a prebuilt binary is not available, which requires Rust and Cargo to be available in your environment. Install from `rustup` if needed: `https://www.rust-lang.org/tools/install`. ## Quick Start Basic Usage: ```bash # Parse some files semtools parse my_dir/*.pdf # Search some (text-based) files semtools search "some keywords" *.txt --max-distance 0.3 --n-lines 5 # Ask questions about your documents using an AI agent semtools ask "What are the main findings?" papers/*.txt # Combine parsing and search semtools parse my_docs/*.pdf | xargs search "API endpoints" # Ask a question to a set of files semtools ask "Some question?" *.txt # Combine parsing with the ask agent semtools parse research_papers/*.pdf | xargs ask "Summarize the key methodologies" # Ask based on stdin content cat README.md | semtools ask "How do I install SemTools?" ``` Advanced Usage: ```bash # Combine with grep for exact-match pre-filtering and distance thresholding semtools parse *.pdf | xargs cat | grep -i "error" | semtools search "network error" --max-distance 0.3 # Pipeline with content search (note the 'xargs' on search to search files instead of stdin) find . -name "*.md" | xargs semtools parse | xargs semtools search "installation" # Combine with grep for filtering (grep could be before or after parse/search!) semtools parse docs/*.pdf | xargs semtools search "API" | grep -A5 "authentication" # Save search results from stdin search semtools parse report.pdf | xargs cat | semtools search "summary" > results.txt ``` Using Workspaces: ```bash # Create or select a workspace # Workspaces are stored in ~/.semtools/workspaces/ semtools workspace use my-workspace > Workspace 'my-workspace' configured. > To activate it, run: > export SEMTOOLS_WORKSPACE=my-workspace > > Or add this to your shell profile (.bashrc, .zshrc, etc.) # Activate the workspace export SEMTOOLS_WORKSPACE=my-workspace # All search commands will now use the workspace for caching embeddings # The initial command is used to initialize the workspace semtools search "some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10 # If documents change, they are automatically re-embedded and cached echo "some new content" > ./some_large_dir/some_file.txt semtools search "some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10 # If documents are removed, you can run prune to clean up stale files semtools workspace prune # You can see the stats of a workspace at any time semtools workspace status > Active workspace: arxiv > Root: /Users/loganmarkewich/.semtools/workspaces/arxiv > Documents: 3000 > Index: Yes (IVF_PQ) ``` ## CLI Help ```bash $ semtools parse --help A CLI tool for parsing documents using various backends Usage: semtools parse [OPTIONS] ... Arguments: ... Files to parse Options: -c, --config Path to the config file. Defaults to ~/.semtools_config.json -b, --backend The backend type to use for parsing. Defaults to `llama-parse` [default: llama-parse] -v, --verbose Verbose output while parsing -h, --help Print help ``` ```bash $ semtools search --help A CLI tool for fast semantic keyword search Usage: semtools search [OPTIONS] [FILES]... Arguments: Query to search for (positional argument) [FILES]... Files to search, optional if using stdin Options: -n, --n-lines How many lines before/after to return as context [default: 3] --top-k The top-k files or texts to return (ignored if max_distance is set) [default: 3] -m, --max-distance Return all results with distance below this threshold (0.0+) -i, --ignore-case Perform case-insensitive search (default is false) -j, --json Output results in JSON format -h, --help Print help ``` ```bash $ semtools workspace --help Manage semtools workspaces Usage: semtools workspace [OPTIONS] Commands: use Use or create a workspace (prints export command to run) status Show active workspace and basic stats prune Remove stale or missing files from store help Print this message or the help of the given subcommand(s) Options: -j, --json Output results in JSON format -h, --help Print help ``` ```bash $ semtools ask --help A CLI tool for document-based question-answering Usage: semtools ask [OPTIONS] [FILES]... Arguments: Query to prompt the agent with [FILES]... Files to search, optional if using stdin Options: -c, --config Path to the config file. Defaults to ~/.semtools_config.json --api-key OpenAI API key (overrides config file and env var) --base-url OpenAI base URL (overrides config file) -m, --model Model to use for the agent (overrides config file) --api-mode API mode to use: 'chat' or 'responses' (overrides config file) -j, --json Output results in JSON or text format -h, --help Print help ``` ## Configuration SemTools uses a unified configuration file at `~/.semtools_config.json` that contains settings for all CLI tools. You can also specify a custom config file path using the `-c` or `--config` flag on any command. ### Unified Configuration File Create a `~/.semtools_config.json` file with settings for the tools you use. All sections are optional - if not specified, sensible defaults will be used. (They `parse_kwargs` section is passed directly to LlamaParse, [see docs](https://api.cloud.llamaindex.ai/docs#/V2/parse_file_api_v2_parse_post) for available options.) ```json { "parse": { "api_key": "your_llama_cloud_api_key_here", "num_ongoing_requests": 10, "base_url": "https://api.cloud.llamaindex.ai", "parse_kwargs": { "tier": "agentic", "version": "latest", "disable_cache": false }, "check_interval": 5, "max_timeout": 3600, "max_retries": 10, "retry_delay_ms": 1000, "backoff_multiplier": 2.0 }, "ask": { "api_key": "your_openai_api_key_here", "base_url": null, "model": "gpt-4o-mini", "max_iterations": 20, "api_mode": "responses", // Can be responses or chat } } ``` Find out more about parsing configuration [on the dedicated documentation page](https://developers.llamaindex.ai/python/cloud/llamaparse/api-v2-guide/). See `example_semtools_config.json` in the repository for a complete example. ### Environment Variables As an alternative or supplement to the config file, you can set API keys via environment variables: ```bash # For parse tool export LLAMA_CLOUD_API_KEY="your_llama_cloud_api_key_here" # For ask tool export OPENAI_API_KEY="your_openai_api_key_here" ``` ### Configuration Priority Configuration values are resolved in the following priority order (highest to lowest): 1. **CLI arguments** (e.g., `--api-key`, `--model`, `--base-url`) 2. **Config file** (`~/.semtools_config.json` or custom path via `-c`) 3. **Environment variables** (`LLAMA_CLOUD_API_KEY`, `OPENAI_API_KEY`) 4. **Built-in defaults** This allows you to set common defaults in the config file while overriding them on a per-command basis when needed. ### Subcommand-Specific Configuration #### Parse Subcommand The `parse` subcommand requires a LlamaParse API key. Get your free API key at [https://cloud.llamaindex.ai](https://cloud.llamaindex.ai). Configuration options: - `api_key`: Your LlamaParse API key - `base_url`: API endpoint (default: "https://api.cloud.llamaindex.ai") - `num_ongoing_requests`: Number of concurrent requests (default: 10) - `parse_kwargs`: Additional parsing parameters - `check_interval`, `max_timeout`, `max_retries`, `retry_delay_ms`, `backoff_multiplier`: Retry and timeout settings #### Ask Subcommand The `ask` subcommand requires an OpenAI API key for the agent's LLM. Configuration options: - `api_key`: Your OpenAI API key - `base_url`: Custom OpenAI-compatible API endpoint (optional, for using other providers) - `model`: LLM model to use (default: "gpt-4o-mini") - `max_iterations`: Maximum agent loop iterations (default: 10) You can also override these per-command: ```bash semtools ask "What is this about?" docs/*.txt --model gpt-4o --api-key sk-... ``` ## Agent Use Case Examples - [Using Semtools with Coding Agents](examples/use_with_coding_agents.md) - [Using Semtools with MCP](examples/use_with_mcp.md) ## Future Work - [ ] More parsing backends (something local-only would be great!) - [ ] Improved search algorithms - [x] Built-in agentic search - [x] Persistence for speedups on repeat searches on the same files ## Contributing We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Acknowledgments - [LlamaIndex/LlamaParse](https://cloud.llamaindex.ai/) for document parsing capabilities - [model2vec-rs](https://github.com/MinishLab/model2vec-rs)for fast embedding generation - [minishlab/potion-multilingual-128M](https://huggingface.co/minishlab/potion-multilingual-128M) for an amazing default static embedding model - [simsimd](https://github.com/ashvardanian/simsimd) for efficient similarity computation