1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
|
# Certificate Vector Search Demo
This is a single-file Python script that demonstrates how to process and search through certificate documents using vector embeddings. It’s not a full-blown project but rather a proof-of-concept for semantic search over certificate datasets. It’s designed to be simple and accurate, though not highly scalable out of the box.
---
## What It Does
1. **Processes Certificate Documents**:
- Takes a dataset of certificates (loaded from a JSON file).
- Extracts text from associated `.txt` files (e.g., reports, security targets, certificates).
- Splits the text into smaller, overlapping chunks (512 tokens max, with 128 tokens of overlap) to handle large documents.
2. **Generates Embeddings**:
- Uses the `sentence-transformers/all-MiniLM-L6-v2` model to convert each text chunk into a vector embedding. It's a small model that can be run on the CPU.
- Stores these embeddings in a SQLite database with vector support (thanks to `sqlite-vec`).
3. **Makes It Searchable**:
- You can query the database with a text string, and it’ll find the most semantically similar chunks of text from the certificates. The query is embedded with the aforementioned `sentence-transformers` model.
- Results include metadata (like certificate name, source type, etc.) and the actual text of the chunk.
4. **Provides a Web Interface**:
- A simple HTTP server lets you interact with the system via a browser.
- Submit a query, and it’ll show you the top results with snippets of the most relevant text.
---
## How It Works
### Dataset Processing
- The dataset is validated to ensure it contains the required files (`dataset.json` and `.txt` files).
- Text is extracted from the `.txt` files, split into chunks, and converted into embeddings.
### Database Setup
The SQLite database has three main tables:
- **`cert_chunks`**: Stores the vector embeddings, along with metadata like the certificate digest and chunk index.
- **`metadata`**: Stores certificate-level info (name, manufacturer, validity dates, etc.).
- **`chunk_texts`**: Stores the actual text of each chunk for easy retrieval.
The database is initialized with vector support using the `sqlite-vec` extension, which enables similarity searches.
### Querying
- When you submit a query, the system:
1. Converts the query text into an embedding.
2. Searches the database for chunks with the closest embeddings.
3. Groups chunks by unique document (digest x document type: report, security target) combination.
4. Ranks results using a weighted score (combining the closest match and average similarity).
5. Returns the top `k` results, including metadata and the text of the most relevant chunk.
### Web Interface
- A basic HTTP server runs locally, serving a simple HTML page where you can enter queries and see results.
- The user query is embedded on the back-end and a similarity search is performed.
- Results are displayed with the certificate name, source type, similarity score, and a snippet of the text.
---
## Scalability Notes
This demo uses [brute-force](https://github.com/asg017/sqlite-vec/issues/172#issuecomment-2608754427) linear search for simplicity and accuracy. While this works fine for small to medium datasets, it’s not scalable for large datasets. For better performance, you could integrate:
- **Approximate Nearest Neighbor (ANN) Search**: Plugins supporting approaches like [FAISS](https://github.com/maylad31/vector_sqlite) or [HNSW](https://github.com/nmslib/hnswlib) can speed up retrieval for large datasets, at the cost of some accuracy.
- **Vectorlite**: A lightweight SQLite [extension](https://github.com/1yefuwang1/vectorlite) for fast vector search, seems most reasonable as an upgrade.
---
## Mixed Search
In the future, it might be desirable to combine semantic search (vector embeddings) with keyword-based search (BM25 or TF-IDF) for hybrid retrieval.
## How to Use It
1. **Install Dependencies**:
```bash
pip install sec_certs sentence-transformers nltk sqlite-vec
```
2. **Run the Script**:
```bash
python vector_search.py --data-path /path/to/dataset --db-path /path/to/database.sqlite --force-rebuild --port 8080
```
- `--data-path`: Path to the dataset directory (should contain `dataset.json` and the `certs` folder).
- `--db-path`: Path to the SQLite database file.
- `--force-rebuild`: Rebuild the database from scratch (optional).
- `--port`: Port to run the HTTP server on (default is 8000).
3. **Search**:
- Open your browser and go to `http://localhost:8000`.
- Enter a query, and see the results!
---
## Code Overview
The script is a single Python file with the following key components:
- **`CertProcessor`**: Handles dataset processing, embedding generation, and database storage.
- **`query_similar_chunks`**: Queries the database for similar chunks and returns results with metadata.
- **`Server`**: A simple HTTP server that serves the web interface and handles search queries.
---
## Dependencies
- `sentence-transformers`: For generating text embeddings.
- `nltk`: For tokenizing text into words.
- `sqlite-vec`: For enabling vector operations in SQLite.
- `sec_certs`: For loading and processing the certificate dataset.
|