MetaSearch - 深度迭代检索增强系统

What is MetaSearch?

MetaSearch is an advanced Retrieval-Augmented Generation (RAG) system built upon the principle of deep iterative retrieval. It progressively refines searches, integrates multiple retrieval techniques, and leverages Large Language Models (LLMs) to provide comprehensive and contextually rich answers.

Key Features

Modular RAG Framework

Learn best practices for building LLM projects with a clean, modular architecture following community standards.

Deep Iterative Retrieval

Implements cutting-edge RAG algorithms, exploring information deeply through multiple search iterations.

Hybrid Retrieval Fusion

Combines vector search, keyword search (TF-IDF), and knowledge graph retrieval for broader coverage.

Intelligent Query Expansion

Dynamically generates sub-queries using LLMs to achieve greater breadth and depth in knowledge exploration.

Adaptive Search Control

Decides whether to continue searching based on the ratio of newly discovered information, optimizing efficiency.

Diversity Re-ranking

Utilizes Maximal Marginal Relevance (MMR) to balance relevance and diversity, providing more comprehensive results.

System Architecture

Core Components

Document Processing

Splits raw documents into manageable text chunks with context and summaries.

Index Construction

Builds Vector, TF-IDF, and Knowledge Graph indexes for efficient retrieval.

Retrieval Module

Combines multiple retrieval methods (vector, keyword, graph) to find relevant documents.

Query Expansion

Generates new sub-queries based on retrieved information using LLMs.

Deep RAG Orchestrator

Coordinates the iterative retrieval process and synthesizes the final answer.

Workflow

User Query Input

Initial Retrieval (Hybrid)

Generate Sub-queries (LLM)

Iterative Retrieval & Expansion

Re-rank & Synthesize

Generate Final Answer (LLM)

Project Directory Structure

MetaSearch/
├── config/           # Configuration files (YAML)
├── data/             # Data directory (raw, processed, indexes)
│   ├── raw/
│   ├── processed/
│   └── indexes/
├── deepsearch/       # Core library code
│   ├── indexing/     # Indexing logic (vector, tfidf, graph)
│   ├── llm/          # LLM interface wrappers
│   ├── preprocessing/# Document parsing and chunking
│   ├── rag/          # RAG pipeline implementation (standard, deep)
│   ├── retrieval/    # Retrieval strategies and re-ranking
│   └── utils/        # Utility functions
├── scripts/          # Helper scripts (downloading, processing)
├── app.py            # Main application entry point
├── requirements.txt  # Project dependencies
└── README.md         # Project documentation

Quick Start Guide

# Create and activate conda environment
conda create -n metasearch python=3.10 -y
conda activate metasearch

# Install dependencies (CUDA 11.8 recommended)
pip install -r requirements.txt
# For CUDA 11.8 support:
pip install torch==2.0.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Edit config/config.yaml to set model paths, API keys, etc.

# Recommended: Download Embedding and Reranker models (~2.3GB)
python scripts/download_models.py --embedding --reranker

# Optional: Download generation model (e.g., Qwen, requires more resources)
# python scripts/download_models.py --llm qwen

# Or, download all configured models
# python scripts/download_models.py --all

Ensure model paths in config.yaml match download locations.

Place your documents (e.g., .txt, .md, .pdf) into the data/raw/ directory.

# Process a single file
python scripts/process_documents.py --file data/raw/your_document.pdf

# Process all files in the raw directory
python scripts/process_documents.py --dir data/raw/

Processed chunks will be saved in data/processed/.

# Build all configured indexes using the default processed chunk file
python scripts/build_indexes.py --chunks data/processed/index_chunk.pkl

Indexes (FAISS, TF-IDF, etc.) will be saved in data/indexes/.

Interactive Mode:

python app.py --interactive

Start a chat-like interface in your terminal.

Single Query:

python app.py --query "What was the role of the Grand Secretariat in the Ming Dynasty?"

Run a single query and print the result.

Technical Deep Dive

Document Processing & Chunking

Raw documents are parsed and split into overlapping chunks. Each chunk stores:

content: The main text of the chunk.
chunk_id: A unique identifier.
parent_content: Optional larger context block.
abstract: LLM-generated summary (optional).
Metadata: Source document, page number, etc.

# Example config (config/config.yaml)
processing:
  chunk_size: 512      # Target size for each chunk
  overlap_size: 64     # Overlap between consecutive chunks
  generate_abstract: true # Whether to generate summaries

Hybrid Indexing

Multiple indexes capture different aspects of the data:

Vector Index (FAISS): Uses embeddings (e.g., BCE-Embedding) for semantic similarity search.
TF-IDF Index: Classic keyword-based retrieval, good for specific terms.
Knowledge Graph (Optional): Extracts entities and relationships for structured queries.

The retrieval module fuses results from enabled indexes.

Query Expansion Mechanism

Expands search scope iteratively:

LLM extracts key search terms/sub-questions from retrieved results.
Calculate relevance score of potential sub-queries against the original query.
Pool all candidate sub-queries across iterations.
Select top-k scoring sub-queries based on relevance and potential for new information.
These become inputs for the next retrieval iteration.

Deep Iterative Retrieval Loop

The core engine operates in a loop:

Start with the initial user query.
Perform standard RAG (retrieve, re-rank, generate response snippet).
Calculate Information Growth Rate (IGR): `len(new_chunk_ids) / len(existing_chunk_ids)`.
If IGR < threshold or max iterations reached, stop.
Else, use Query Expansion to generate new sub-queries for the next loop.
Finally, synthesize all gathered knowledge into a final, comprehensive answer.

# Example config (config/config.yaml)
deepsearch:
  max_iterations: 3          # Max number of retrieval loops
  growth_rate_threshold: 0.1 # Stop if less than 10% new info found
  extend_query_num: 3        # Number of sub-queries per iteration

Example Workflow: "Ming Dynasty Cabinet System"

Iteration 1

Input Query: "Ming Dynasty Cabinet System"
- Initial knowledge: Empty
- Existing chunk IDs: Empty set
Standard RAG Execution:
- Retrieve relevant chunks (e.g., IDs: {101, 102, 103, 104, 105}).
- Generate initial response snippet: "The Ming Cabinet originated during the Yongle era..."
- Add snippet to knowledge base.
- Update existing IDs: {101, 102, 103, 104, 105}
Calculate IGR: 5 (new) / 1 (total, conceptually) = 5.0. Since 5.0 > 0.1 (threshold), continue.
Expand Query:
- LLM analyzes snippet, suggests sub-queries: ["Grand Secretaries", "Cabinet powers", "Zhang Juzheng reforms", ...].
- Select top 3 relevant sub-queries for next iteration: e.g., ["Ming Grand Secretaries", "Evolution of Ming Cabinet", "Cabinet vs Imperial Power"].

Iteration 2

Input Queries: ["Ming Grand Secretaries", "Evolution of Ming Cabinet", "Cabinet vs Imperial Power"]
Standard RAG Execution (for each sub-query):
- Retrieve chunks for each query (e.g., new IDs discovered: {201, 202, 203, 204}).
- Generate response snippets for each sub-query.
- Add snippets to knowledge base.
- Update existing IDs: {101, 102, ..., 105, 201, ..., 204} (Total: 9)
Calculate IGR: 4 (new) / 5 (existing) = 0.8. Since 0.8 > 0.1, potentially continue (depending on max iterations).
Expand Query (if continuing): Generate and select next set of sub-queries.

Final Answer Synthesis

Aggregate & Format Knowledge:
- Collect all generated response snippets from all iterations.
- Re-rank all gathered chunks/snippets based on relevance to the *original* query using a re-ranking model.
Generate Comprehensive Answer:
- Construct a final prompt including the original query and the ranked, aggregated knowledge.
- Use the LLM to generate a coherent, structured final answer incorporating the diverse information gathered.

Suggested Learning Path

Beginner Path

Run the system (app.py --interactive) to see it in action.
Read app.py: Understand initialization and the main RAG call.
Explore deepsearch/preprocessing/: How are documents loaded and chunked?
Study deepsearch/indexing/vector_index.py: Basic vector index creation and search.
Look at deepsearch/llm/: How does the code interface with LLMs (local or API)?
Examine deepsearch/rag/standard_rag.py: The basic retrieve-then-generate pipeline.

Advanced Path

Dive into deepsearch/rag/deep_rag.py: Understand the iterative loop, IGR, and query expansion logic.
Study deepsearch/retrieval/: Explore hybrid retrieval and MMR re-ranking.
Compare different index implementations in deepsearch/indexing/.
Analyze the query expansion implementation in deepsearch/rag/query_expansion.py.
Experiment with config/config.yaml parameters (iterations, thresholds, models) and observe changes.
Try adding a new document type parser or a custom retrieval strategy.

Frequently Asked Questions

Join the Community

MetaSearch is an open-source educational project. Contributions of all kinds are welcome!

Found a Bug?

Help improve the project by reporting issues you encounter.

Report Issue

Have an Idea?

Share your innovations and improvements by contributing code.

Submit Pull Request

Like the Project?

Show your support and help others discover it by giving us a star!

Star on GitHub

Your Forks and PRs are welcome! ✨