# knowledge-center **Repository Path**: nojun/knowledge-center ## Basic Information - **Project Name**: knowledge-center - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-02 - **Last Updated**: 2026-05-06 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Knowledge Center Multimodal knowledge base for documents, images, videos, and process-layer audio. The backend stores source files, extracts text and visual signals, writes PostgreSQL records and Milvus vectors, then serves semantic and keyword search. ## Architecture Core stack: - Backend: FastAPI, SQLAlchemy 2.0, Celery, Python 3.9 - Data stores: PostgreSQL, Redis, Milvus, MinIO - Embeddings: bge-m3 text embeddings, CLIP ViT-B/32 visual embeddings - Search enhancement: PostgreSQL `tsvector`, query expansion, optional HyDE, CrossEncoder reranking - Frontend: `web/` React + Ant Design legacy UI, `web2/` React + TypeScript + Tailwind newer UI Core flow: ```text Upload/API -> upload service -> MinIO object storage + Document + ProcessingTask -> Celery task -> process service -> DocumentChunk + PostgreSQL search_vector + Milvus v2 vectors -> search service -> vector/keyword retrieval + RRF + rerank ``` Sync flow: ```text DataSource -> sync batch -> folder scan -> sync task -> remote file download -> process service -> sync record ``` ## Module Documentation Read the module-local `_readme.md` before changing a core flow: - [Process module](app/services/process/_readme.md): parsing, chunking, embedding, AI description, and indexing - [Upload module](app/services/upload/_readme.md): upload validation, MinIO storage, DB records, and Celery dispatch - [Search module](app/services/search/_readme.md): text/image/hybrid retrieval, RRF, reranking, and result assembly - [Sync module](app/services/sync/_readme.md): SMB batch sync, folder/task state, and duplicate records - [Tasks module](app/tasks/_readme.md): Celery app, task lifecycle, ACK behavior, and operational risks When a structural module change lands, update the matching `_readme.md` in the same change. New module readmes should cover purpose, boundaries, architecture, core flows, component map, examples, risks, and improvement notes. ## Storage Model Default upload storage is MinIO. `FileStorageService()` creates a MinIO client and stores uploaded objects under: ```text {YYYY}/{MM}/{DD}/{uuid}.{ext} ``` For normal uploads, `Document.file_path` stores the MinIO object key and `Document.storage_type` is `minio`. The local filesystem branch is only used when `FileStorageService` is constructed with a `base_path`, mainly for tests or compatibility. Size limits from `app/core/config.py`: - Documents: 100MB - Images: 100MB in the current upload validation path - Videos: 10GB The process layer has an `AudioProcessor`, but the default upload validator does not currently admit audio extensions. Opening audio upload requires validation, MIME, size-policy, API test, and documentation updates. ## Vector Retrieval Model Current processing writes Milvus v2 records: ```text MILVUS_COLLECTION_V2 = multimodal_knowledge_v2 ├── text_embedding: 1024d bge-m3 └── visual_embedding: 512d CLIP ViT-B/32 ``` Legacy v1 config still exists: ```text MILVUS_COLLECTION = multimodal_knowledge └── embedding: 512d single-vector legacy field ``` Current search should use the v2 collection. Text search combines bge-m3 vector recall, optional query expansion and HyDE, CLIP visual recall when requested, PostgreSQL keyword recall, RRF fusion, CrossEncoder reranking, and document-level deduplication. ## Database Schema Current SQLAlchemy models live in `app/models/database.py`. Primary tables: - `users`: account and role data - `datasets`: dataset metadata and ownership - `documents`: source file metadata, storage reference, processing status, AI fields, and sync backreferences - `document_chunks`: text, OCR, transcript, frame, image, summary chunks, embeddings, timestamps, `search_vector`, and `thumbnail_key` - `processing_tasks`: upload/process task state - `data_sources`: external source configuration - `sync_batches`: sync batch lifecycle - `sync_folders`: scanned folder or file entries - `sync_tasks`: folder-level sync execution state - `sync_records`: path-hash-based sync history and duplicate detection - `access_logs`: access and search feedback logs Retired sync tables such as `batch_tasks` and `datasource_files` are not part of the active schema. ## Local Setup 1. Create virtual environment and install dependencies: ```bash python3 -m venv .venv .venv/bin/pip install -r requirements.txt ``` 2. Configure env: ```bash cp .env.example .env ``` 3. Start infrastructure: ```bash docker compose up -d ``` 4. Run migrations: ```bash PYTHONPATH=. .venv/bin/alembic upgrade head ``` 5. Run API: ```bash .venv/bin/uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload ``` 6. Run worker: ```bash .venv/bin/celery -A app.tasks.processing worker --loglevel=info --concurrency=2 ``` ## Frontend Legacy UI: ```bash cd web pnpm dev ``` Newer UI: ```bash cd web2 npm run dev ``` ## Tests Unit tests: ```bash .venv/bin/pytest tests/unit -v ``` Integration tests: ```bash RUN_INTEGRATION_TESTS=1 .venv/bin/pytest tests/integration -v ``` Focused examples: ```bash .venv/bin/pytest tests/unit/test_vector_retrieval.py -v .venv/bin/pytest tests/unit/test_search_api.py -v ```