{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true }, "pycharm": { "name": "#%% md\n" } }, "source": [ "# CC certificate id evaluation\n", "This notebook can be used to evaluate our heuristics for certificate id assignment\n", "and canonicalization.\n", "\n", "It looks at several aspects & issues:\n", "\n", "1. Certificates with no id assigned.\n", "2. Duplicate certificate id assignments (when two certificates get the same ID assigned).\n", "3. Certificates that have the same certification report document (an issue of the input data that we get\n", " that explains some of the duplicate certificate id assignments).\n", "4. Compares a random sample of certificates with assigned ground truth ID.\n", "5. Compares a random sample of certificates with assigned ground truth references and context." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "from sec_certs.dataset import CCDataset\n", "from sec_certs.cert_rules import cc_rules\n", "import csv\n", "import pandas as pd\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_schemes = len(cc_rules[\"cc_cert_id\"])\n", "num_scheme_rules = sum(len(rules) for rules in cc_rules[\"cc_cert_id\"].values())\n", "print(f\"\\\\newcommand{{\\\\numccschemes}}{{{num_schemes}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccschemeidrules}}{{{num_scheme_rules}}}\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "dset = CCDataset.from_web()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_ids = len(list(filter(lambda cert: cert.heuristics.cert_id, dset)))\n", "print(f\"\\\\newcommand{{\\\\numcccerts}}{{{len(dset)}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccids}}{{{num_ids}}}\")\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 1. Certificates with no id\n", "\n", "Here we report the number of certificates in our dataset that we have no certificate ID for." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "missing_id_dgsts = set()\n", "missing_id = []\n", "for cert in dset:\n", " if not cert.heuristics.cert_id:\n", " missing_id_dgsts.add(cert.dgst)\n", " missing_id.append((cert.dgst, cert.scheme))\n", "pd.DataFrame(missing_id, columns=[\"id\", \"scheme\"])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check manually evaluated missing\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing_manual = pd.read_csv(\"../../data/cert_id_eval/missing_ids.csv\")\n", "print(set(missing_manual.id) == missing_id_dgsts)\n", "print(set(missing_manual.id).difference(missing_id_dgsts))\n", "print(set(missing_id_dgsts).difference(missing_manual.id))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "num_missing_manual = missing_manual.shape[0]\n", "num_missing_manual_fixable = missing_manual.cert_id.count()\n", "num_missing_manual_unfixable = num_missing_manual - num_missing_manual_fixable\n", "print(f\"\\\\newcommand{{\\\\numccmissingid}}{{{num_missing_manual}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccmissingidfixable}}{{{num_missing_manual_fixable}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccmissingidunfixable}}{{{num_missing_manual_unfixable}}}\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing_manual.loc[missing_manual.cert_id.isnull()].reason.value_counts()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing_manual.loc[missing_manual.cert_id.notnull()].reason.value_counts()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Duplicate certificate id assignment\n", "\n", "Here we report the number of certificates in our dataset that have a duplicate certiticate\n", "ID assigned." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "id_mapping = {}\n", "for cert in dset:\n", " if cert.heuristics.cert_id is not None:\n", " c_list = id_mapping.setdefault(cert.heuristics.cert_id, [])\n", " c_list.append(cert.dgst)\n", "\n", "duplicate_id_dgsts = set()\n", "for idd, entries in id_mapping.items():\n", " if len(entries) > 1 and idd:\n", " print(idd, entries)\n", " duplicate_id_dgsts.update(entries)\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 3. Duplicate report documents\n", "\n", "Some certificates have erroneously uploaded certificate reports, here we check their\n", "hashes and report such duplicates in the input data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "duplicate_docs = {}\n", "\n", "for cert in dset:\n", " if cert.state.report_pdf_hash is not None:\n", " r_list = duplicate_docs.setdefault(cert.state.report_pdf_hash, [])\n", " r_list.append(cert.dgst)\n", "\n", "duplicate_doc_dgsts = set()\n", "for hash, entries in duplicate_docs.items():\n", " if len(entries) > 1:\n", " print(hash, entries)\n", " for entry in entries:\n", " duplicate_doc_dgsts.add(entry)\n", "\n", "duplicate_ids_due_doc = duplicate_doc_dgsts.intersection(duplicate_id_dgsts)\n", "duplicate_ids_issue = duplicate_id_dgsts.difference(duplicate_doc_dgsts)\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The following prints the amount of certificate id duplicates that are not due to input data (and are really our problem)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "for id in duplicate_ids_issue:\n", " print(id, dset[id].heuristics.cert_id)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"\\\\newcommand{{\\\\numccduplicateid}}{{{len(duplicate_id_dgsts)}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccduplicateidcolission}}{{{len(duplicate_ids_due_doc)}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccduplicateidissue}}{{{len(duplicate_ids_issue)}}}\")\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Check manually evaluated duplicates" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "duplicate_manual = pd.read_csv(\"../../data/cert_id_eval/duplicate_ids.csv\")\n", "print(set(duplicate_manual.id) == duplicate_id_dgsts)\n", "print(set(duplicate_manual.id).difference(duplicate_id_dgsts))\n", "print(set(duplicate_id_dgsts).difference(duplicate_manual.id))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "duplicate_manual[duplicate_manual.result == \"tp\"].reason.value_counts()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "duplicate_manual[duplicate_manual.result == \"fp\"].reason.value_counts()\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The following cell lists those duplicates that were fixed by changes since manual analysis." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## 4. Manually assigned ground truth comparison (cert_id)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "correct = set()\n", "possible = set()\n", "impossible = set()\n", "with open(\"../../data/cert_id_eval/truth.csv\", \"r\") as f:\n", " reader = csv.DictReader(f)\n", " for line in reader:\n", " if line[\"id\"] not in dset.certs:\n", " continue\n", " else:\n", " cert = dset[line[\"id\"]] # and (line[\"cert_id\"] or cert.heuristics.cert_id is not None)\n", " if line[\"cert_id\"] != cert.heuristics.cert_id:\n", " print(line[\"id\"], line[\"cert_id\"], cert.heuristics.cert_id, line[\"source\"], line[\"possible\"])\n", " if line[\"possible\"] == \"y\":\n", " possible.add(line[\"id\"])\n", " else:\n", " impossible.add(line[\"id\"])\n", " else:\n", " correct.add(line[\"id\"])\n", "print(len(correct), len(possible), len(impossible))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "correct = set()\n", "incorrect = set()\n", "with open(\"../../data/cert_id_eval/random.csv\", \"r\") as f:\n", " reader = csv.DictReader(f)\n", " for line in reader:\n", " cert = dset[line[\"id\"]]\n", " if line[\"cert_id\"] != cert.heuristics.cert_id:\n", " print(line[\"id\"], line[\"cert_id\"], cert.heuristics.cert_id)\n", " incorrect.add(line[\"id\"])\n", " else:\n", " correct.add(line[\"id\"])\n", "print(len(correct), len(incorrect))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "print(f\"\\\\newcommand{{\\\\numccideval}}{{{len(correct) + len(incorrect)}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccidevalcorrect}}{{{len(correct)}}}\")\n", "print(f\"\\\\newcommand{{\\\\numccidevalincorrect}}{{{len(incorrect)}}}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Manually assigned ground truth comparison (references)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "manual_references = pd.read_csv(\"../../data/cert_id_eval/random_references.csv\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"The referenced cert is a...\")\n", "print(manual_references[manual_references.reason != \"self\"].reason.value_counts())\n", "print(\"... in the current cert.\")\n", "print(\"Total refs:\", sum(manual_references.reason != \"self\"))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"\\\\newcommand{{\\\\numCcRefEval}}{{{manual_references.id.nunique()}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalNotSelf}}{{{sum(manual_references.reason != 'self')}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalComponent}}{{{sum(manual_references.reason == 'component used')}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalRecertification}}{{{sum(manual_references.reason == 'basis of recertification')}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalUsedEval}}{{{sum(manual_references.reason == 'basis of eval')}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalIsUsed}}{{{sum(manual_references.reason == 'basis for')}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalPrevVersion}}{{{sum(manual_references.reason == 'previous version')}}}\")\n", "\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalInReport}}{{{sum((manual_references.location == 'report') & (manual_references.reason != 'self'))}}}\")\n", "print(f\"\\\\newcommand{{\\\\numCcRefEvalInTarget}}{{{sum((manual_references.location == 'target') & (manual_references.reason != 'self'))}}}\")\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "vscode": { "interpreter": { "hash": "a5b8c5b127d2cfe5bc3a1c933e197485eb9eba25154c3661362401503b4ef9d4" } } }, "nbformat": 4, "nbformat_minor": 4 }