# knowledge-center

**Repository Path**: nojun/knowledge-center

## Basic Information

- **Project Name**: knowledge-center
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-02
- **Last Updated**: 2026-05-06

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Knowledge Center

Multimodal knowledge base for documents, images, videos, and process-layer audio. The backend stores source files, extracts text and visual signals, writes PostgreSQL records and Milvus vectors, then serves semantic and keyword search.

## Architecture

Core stack:

- Backend: FastAPI, SQLAlchemy 2.0, Celery, Python 3.9
- Data stores: PostgreSQL, Redis, Milvus, MinIO
- Embeddings: bge-m3 text embeddings, CLIP ViT-B/32 visual embeddings
- Search enhancement: PostgreSQL `tsvector`, query expansion, optional HyDE, CrossEncoder reranking
- Frontend: `web/` React + Ant Design legacy UI, `web2/` React + TypeScript + Tailwind newer UI

Core flow:

```text
Upload/API
  -> upload service
  -> MinIO object storage + Document + ProcessingTask
  -> Celery task
  -> process service
  -> DocumentChunk + PostgreSQL search_vector + Milvus v2 vectors
  -> search service
  -> vector/keyword retrieval + RRF + rerank
```

Sync flow:

```text
DataSource
  -> sync batch
  -> folder scan
  -> sync task
  -> remote file download
  -> process service
  -> sync record
```

## Module Documentation

Read the module-local `_readme.md` before changing a core flow:

- [Process module](app/services/process/_readme.md): parsing, chunking, embedding, AI description, and indexing
- [Upload module](app/services/upload/_readme.md): upload validation, MinIO storage, DB records, and Celery dispatch
- [Search module](app/services/search/_readme.md): text/image/hybrid retrieval, RRF, reranking, and result assembly
- [Sync module](app/services/sync/_readme.md): SMB batch sync, folder/task state, and duplicate records
- [Tasks module](app/tasks/_readme.md): Celery app, task lifecycle, ACK behavior, and operational risks

When a structural module change lands, update the matching `_readme.md` in the same change. New module readmes should cover purpose, boundaries, architecture, core flows, component map, examples, risks, and improvement notes.

## Storage Model

Default upload storage is MinIO. `FileStorageService()` creates a MinIO client and stores uploaded objects under:

```text
{YYYY}/{MM}/{DD}/{uuid}.{ext}
```

For normal uploads, `Document.file_path` stores the MinIO object key and `Document.storage_type` is `minio`. The local filesystem branch is only used when `FileStorageService` is constructed with a `base_path`, mainly for tests or compatibility.

Size limits from `app/core/config.py`:

- Documents: 100MB
- Images: 100MB in the current upload validation path
- Videos: 10GB

The process layer has an `AudioProcessor`, but the default upload validator does not currently admit audio extensions. Opening audio upload requires validation, MIME, size-policy, API test, and documentation updates.

## Vector Retrieval Model

Current processing writes Milvus v2 records:

```text
MILVUS_COLLECTION_V2 = multimodal_knowledge_v2
├── text_embedding:   1024d bge-m3
└── visual_embedding: 512d CLIP ViT-B/32
```

Legacy v1 config still exists:

```text
MILVUS_COLLECTION = multimodal_knowledge
└── embedding: 512d single-vector legacy field
```

Current search should use the v2 collection. Text search combines bge-m3 vector recall, optional query expansion and HyDE, CLIP visual recall when requested, PostgreSQL keyword recall, RRF fusion, CrossEncoder reranking, and document-level deduplication.

## Database Schema

Current SQLAlchemy models live in `app/models/database.py`.

Primary tables:

- `users`: account and role data
- `datasets`: dataset metadata and ownership
- `documents`: source file metadata, storage reference, processing status, AI fields, and sync backreferences
- `document_chunks`: text, OCR, transcript, frame, image, summary chunks, embeddings, timestamps, `search_vector`, and `thumbnail_key`
- `processing_tasks`: upload/process task state
- `data_sources`: external source configuration
- `sync_batches`: sync batch lifecycle
- `sync_folders`: scanned folder or file entries
- `sync_tasks`: folder-level sync execution state
- `sync_records`: path-hash-based sync history and duplicate detection
- `access_logs`: access and search feedback logs

Retired sync tables such as `batch_tasks` and `datasource_files` are not part of the active schema.

## Local Setup

1. Create virtual environment and install dependencies:

```bash
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
```

2. Configure env:

```bash
cp .env.example .env
```

3. Start infrastructure:

```bash
docker compose up -d
```

4. Run migrations:

```bash
PYTHONPATH=. .venv/bin/alembic upgrade head
```

5. Run API:

```bash
.venv/bin/uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```

6. Run worker:

```bash
.venv/bin/celery -A app.tasks.processing worker --loglevel=info --concurrency=2
```

## Frontend

Legacy UI:

```bash
cd web
pnpm dev
```

Newer UI:

```bash
cd web2
npm run dev
```

## Tests

Unit tests:

```bash
.venv/bin/pytest tests/unit -v
```

Integration tests:

```bash
RUN_INTEGRATION_TESTS=1 .venv/bin/pytest tests/integration -v
```

Focused examples:

```bash
.venv/bin/pytest tests/unit/test_vector_retrieval.py -v
.venv/bin/pytest tests/unit/test_search_api.py -v
```