# Copilot Instructions for sec-certs ## Repository Overview **sec-certs** is a Python data scraping and analysis tool for security certificates from Common Criteria (CC) and FIPS 140-2/3 frameworks. The tool processes certification artifacts (PDFs, HTML), extracts data, matches to CVEs/CPEs, and provides datasets for security research. ### Tech Stack - **Language**: Python 3.10+ (tested on 3.10, 3.11, 3.12) - **Size**: ~75 Python source files (~13.5k LOC), ~36 test files - **Package Management**: uv with pinned requirements in `uv.lock` - **Key Dependencies**: BeautifulSoup4, pandas, spacy, pdftotext (requires Poppler), pikepdf, pytesseract, scikit-learn, matplotlib, networkx, pydantic - **Build System**: setuptools with setuptools-scm for versioning - **Testing**: pytest with custom markers (`slow`, `remote`) - **Linting**: Ruff (formatter + linter) and MyPy (type checking) - **Documentation**: Sphinx with myst-nb, hosted at sec-certs.org - **Distribution**: PyPI package and DockerHub image ## Critical Setup Requirements ### System Dependencies (REQUIRED) **ALWAYS install these system dependencies before pip packages. Code WILL fail without them:** - **Poppler** (≥20.x): Required by pdftotext library. Older 0.x versions WILL fail. - **Tesseract**: Required for OCR of malformed PDFs (with English, French, and German data). - **Java**: Required to parse tables in FIPS PDF documents. Must be in PATH. Used by `tabula-java` via `tabula-py`. Check the installation via: ```bash pdftotext -v tesseract --version java -version ``` #### Ubuntu/Debian Installation For Ubuntu/Debian systems, run:s ```bash sudo apt-get update sudo apt-get install -y \ build-essential \ libpoppler-cpp-dev \ pkg-config \ python3-dev \ tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra \ default-jdk ``` ### Python Environment Setup **The version file `src/sec_certs/_version.py` is auto-generated by setuptools-scm and must NOT be committed.** If missing during development, create a temporary version: `echo '__version__ = "dev"' > src/sec_certs/_version.py` **Development install (for testing and development):** ```bash # Create a virtual environment uv venv # Install all dependencies (including dev ones) and the project in editable mode uv sync --dev # ALWAYS download the spacy language model after install uv run spacy download en_core_web_sm # Optionally, you can activate the virtual environment and avoid all the "uv run" prefixes source .venv/bin/activate ``` Verify the installation (sec-certs and spacy language model) by importing the package: ```python import sec_certs._version print(sec_certs._version.__version__) import spacy print(spacy.load("en_core_web_sm")) ``` ## Build, Test, and Validation ### Running Tests **Basic test run (excludes remote/flaky tests):** ```bash uv run pytest tests -m "not remote" -v ``` **Test with coverage (as in CI):** ```bash uv run pytest --cov=sec_certs -m "not remote" --junitxml=junit.xml tests ``` **Test markers:** - `slow`: Tests that take significant time (run with `-m "slow"` or exclude with `-m "not slow"`) - `remote`: Tests requiring remote resources (flaky, run weekly via cron workflow) - `xfail`: Known flaky tests due to external server errors **Typical test runtime**: Fast tests complete in seconds. Full suite will take minutes. ### Linting and Code Quality **ALWAYS run these before committing. CI will fail if they don't pass.** **Using pre-commit (recommended):** ```bash uv run pre-commit install uv run pre-commit run --all-files ``` **Manual linting:** ```bash # Ruff linting (checks code style, imports, complexity) uv run ruff check . # Ruff with auto-fix uv run ruff check . --fix # Ruff formatting check uv run ruff format --check . # Ruff auto-format uv run ruff format . # MyPy type checking uv run mypy . ``` **Linting configuration**: See `pyproject.toml` for Ruff and MyPy settings. Target Python 3.10. Line length: 120. Notebooks (*.ipynb) are excluded from linting. ### Building Documentation ```bash cd docs uv run make html ``` Output goes to `docs/_build/html/`. Documentation uses Sphinx with myst-nb for Markdown and Jupyter notebooks. ### Building for Distribution ```bash uv build ``` This creates source and wheel distributions in `dist/`. ## Project Architecture ### Directory Structure ``` sec-certs/ ├── src/sec_certs/ # Main package source │ ├── dataset/ # Dataset classes (CCDataset, FIPSDataset, etc.) │ ├── sample/ # Certificate classes (CCCertificate, FIPSCertificate) │ ├── heuristics/ # Heuristic extractors and analyzers │ ├── model/ # ML models for matching and NLP │ ├── utils/ # Utility functions │ ├── serialization/ # JSON schemas and serialization │ ├── data/ # Embedded data (annotations, CPEs, etc.) │ ├── cli.py # Click-based CLI entrypoint │ ├── configuration.py # Pydantic config with env var support │ ├── rules.yaml # Regular expressions for cert parsing │ └── constants.py # Constants and enums ├── tests/ # Test suite │ ├── cc/ # Common Criteria tests │ ├── fips/ # FIPS 140 tests │ ├── data/ # Test fixtures and data │ └── conftest.py # Pytest configuration and fixtures ├── docs/ # Sphinx documentation source ├── notebooks/ # Jupyter notebooks (examples, analysis) ├── pyproject.toml # Package metadata, build config, tool settings ├── .pre-commit-config.yaml # Pre-commit hooks configuration ├── Dockerfile # Docker image for reproducible environment └── uv.lock # uv lockfile with pinned dependendices. ``` ### Key Files and Configurations - **pyproject.toml**: Package definition, dependencies, Ruff/MyPy/pytest config. Single source of truth for dependencies (unpinned). - **src/sec_certs/rules.yaml**: Regular expressions for extracting data from certificates. Add patterns here. - **src/sec_certs/configuration.py**: Runtime configuration using pydantic-settings. Reads from env vars with `SECCERTS_` prefix. - **.pre-commit-config.yaml**: Defines pre-commit hooks (ruff, mypy). Versions should match pyproject.toml. ### Main Components 1. **Datasets** (`src/sec_certs/dataset/`): - `CCDataset`, `FIPSDataset`, `ProtectionProfileDataset`: Main dataset classes - `CPEDataset`, `CVEDataset`: Auxiliary datasets from NVD - Load from JSON, web snapshots, or build from scratch 2. **Certificates** (`src/sec_certs/sample/`): - `CCCertificate`, `FIPSCertificate`: Individual certificate representations - Store metadata, extracted text, heuristics, references, CVEs 3. **CLI** (`src/sec_certs/cli.py`): - Entrypoint: `sec-certs {cc|fips|pp} {all|build|download|convert|analyze} [options]` - Actions: `all` (full pipeline), `download` (fetch certs), `convert` (PDFs to text), `analyze` (extract features) 4. **Heuristics** (`src/sec_certs/heuristics/`): - Extract certification metadata (dates, vendors, products, security levels) - CVE/CPE matching and vulnerability analysis ## CI/CD Pipelines ### GitHub Workflows (`.github/workflows/`) 1. **tests.yml** (runs on every push): - Tests on Python 3.10, 3.11, 3.12 (Ubuntu 22.04) - Installs system deps, test_requirements.txt, spacy model - Runs: `pytest --cov=sec_certs -m "not remote" tests` - Uploads coverage to Codecov 2. **pre-commit.yml** (runs on every push): - Runs pre-commit hooks (Ruff, MyPy) on all files - Fails if linting issues found 3. **docs.yml** (runs on push, release): - Builds Sphinx docs with `cd docs && make html` - Uploads to sec-certs.org on main branch or tag push 4. **release.yml** (triggered by GitHub release): - Builds package with `python -m build` - Publishes to PyPI - Builds multi-arch Docker image (amd64, arm64) and pushes to DockerHub 5. **cron.yml** (weekly, Wednesday midnight): - Runs remote/flaky tests with `-m "remote"` - Continue on error (expected to be flaky) ## Common Workflows ### Adding a New Feature 1. Create branch from `main` (only stable branch for PRs) 2. Make minimal code changes 3. Add tests in appropriate `tests/` subdirectory 4. Run linters: `uv run pre-commit run --all-files` or `uv run ruff check . && uv run mypy .` 5. Run tests: `uv run pytest tests -m "not remote" -v` 6. Update docs if public API changed 7. Commit and push (CI will validate) ### Updating Dependencies ```bash # Edit pyproject.toml to add/update dependency # Regenerate pinned requirements uv lock # Commit both pyproject.toml and requirements/*.txt changes ``` ### Working with Datasets **Loading pre-processed datasets (recommended):** ```python from sec_certs.dataset.cc import CCDataset dset = CCDataset.from_web() # Downloads from sec-certs.org ``` **Processing from scratch (requires full setup, takes hours, DO NOT DO THIS):** ```bash uv run sec-certs cc all -o ./dataset ``` ## Common Pitfalls and Gotchas 1. **Missing `_version.py`**: Auto-generated by setuptools-scm. Create manually for dev: `echo '__version__ = "dev"' > src/sec_certs/_version.py` 2. **Poppler version**: Ensure Poppler ≥20.x. Version 0.x will cause pdftotext failures. 3. **Spacy model**: ALWAYS run `python -m spacy download en_core_web_sm` after install. Code will fail without it. 4. **Java in PATH**: Required for FIPS table parsing. Verify with `java -version`. 5. **Test markers**: Exclude flaky remote tests with `-m "not remote"` for stable local testing. 6. **Default dataset location**: CLI creates `./dataset` by default. Add to .gitignore if working locally. 7. **Pre-commit hook behavior**: Pre-commit hooks warn about issues but don't auto-fix. Run `ruff check . --fix` to apply fixes. 8. **Long-running commands**: Full dataset processing (`sec-certs cc all`) takes hours. Use pre-processed datasets from web for analysis. ## Additional Resources - **README.md**: Quick start, installation, basic usage examples - **CONTRIBUTING.md**: Detailed contribution guidelines, release process, dependency management - **docs/installation.md**: System dependencies, multiple install methods - **docs/quickstart.md**: Quick usage examples for CC and FIPS datasets - **docs/user_guide.md**: Advanced topics (NVD datasets, reference context inference) - **notebooks/examples/**: Jupyter notebooks demonstrating dataset analysis - **Website**: https://sec-certs.org (dataset downloads, interactive docs) - **Documentation**: https://sec-certs.org/docs ## Trust These Instructions These instructions have been validated by examining repository structure, workflows, documentation, and testing commands. When working on this repository: 1. **Trust these build/test commands** - they are verified to work 2. **Follow the setup order** (system deps → python deps and install (uv sync) → spacy model) 3. **Only search/explore if** these instructions are incomplete or incorrect 4. **Refer to these instructions first** before trying alternative approaches If you encounter issues not covered here, check CONTRIBUTING.md and docs/ before extensive exploration.