diff options
| author | Adam Janovsky | 2023-10-20 10:14:51 +0200 |
|---|---|---|
| committer | Adam Janovsky | 2023-10-20 10:14:51 +0200 |
| commit | 36d48ebf09e41d02502c18682ee23e60ef9b2eec (patch) | |
| tree | bfa17883a57f24d3a60bcb0258b67ed76de326e2 /src/sec_certs/data/reference_annotations | |
| parent | 4ae9bec3f7ce7c8a4666d16dc23601a4ef000aba (diff) | |
| download | sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.tar.gz sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.tar.zst sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.zip | |
improve write-up of reference annotation methodology
Diffstat (limited to 'src/sec_certs/data/reference_annotations')
| -rw-r--r-- | src/sec_certs/data/reference_annotations/readme.md | 61 |
1 files changed, 30 insertions, 31 deletions
diff --git a/src/sec_certs/data/reference_annotations/readme.md b/src/sec_certs/data/reference_annotations/readme.md index b0d978e6..b53697f8 100644 --- a/src/sec_certs/data/reference_annotations/readme.md +++ b/src/sec_certs/data/reference_annotations/readme.md @@ -1,17 +1,39 @@ # Reference annotations -This folder contains data and the methodology (presented below) related to learning the reference annotations. +This folder contains data related to learning the reference annotations. This document also describeds the utilized methodology. - The folder [split](split) contains split of the CC Dataset to `train/valid/test` splits for learning. -- The csv file [manually_annotated_references.csv](./manually_annotated_references.csv) contains manually acquired labels to references obtained with the methodology outlined below. -- The folder `adam` contains manual annotations created by Adam -- The folder `jano` contains manual annotations created by Jano -- The folder `conflicts` contains conflicting annotations between Adam and Jano, as weel as their resolution -- The contents of the `final` folder can thus be obtained by taking annotations either from `adam` or `jano` folder and masking them by `resolution_label` from `conflicts` folder. +- The csv file [outdated_manually_annotated_references.csv](./outdated_manually_annotated_references.csv) contains manually acquired labels to references obtained with **old methodology** for the sake of paper 1 submission. +- The folder [adam](adam/) contains manual annotations created by Adam +- The folder [jano](jano/) contains manual annotations created by Jano +- The folder [conflicts](conflicts/) contains conflicting annotations between Adam and Jano, as weel as their resolution +- The contents of the [final](final/) folder can thus be obtained by taking annotations either from `adam` or `jano` folder and masking them by `resolution_label` from `conflicts` folder. + +## Reference classification methodology + +### Data splits and manual annotations + +1. Two co-authors independently inspect identical set of 100 random certificates and capture the observed relations into reference taxonomy to form the annotation guidelines. See [reference taxonomy](#reference-taxonomy) below. +2. We split all certificates for which we register a direct outgoing reference in either security target or certification report into `train/valid/test` splits in `30/20/50` fashion (see [split](split/)). +3. We sample 100 train, 100 valid, 200 test pairs of reference instances (represented by `(dgst, canonical_reference_keyword)` pairs) for manual annotations. +4. Two co-authors independently assign each of these instances with a single label from the reference taxonomy. +5. We measure the inter-annotator agreement with Cohen's Kappa and percentage, see [inter-annotator agreement](#inter-annotator-agreement). +6. We resolve conflicts in the annotations in a meeting held by the co-authors. We use this consensual annotations for training and evaluation described below. + +### Supervised learning of the annotations + +1. For each pair `(dgst, referenced_cert_id)`, we recover the relevant text segments both from certification report and security target that mention the `referenced_cert_id`. +2. We apply text processing on the segments (e.g., unify re-certification vs. recertification, etc.) +3. We train a baseline model based on TF-IDF (or count vectorization in general), random forest and a soft-voting layer on top of that. + - Random forest classifies single segment with a probability of a given label. + - Soft voting compares probabilities of the given labels on all segments, takes their square and chooses the maximum. +4. We train a sentence transformer with the same soft-voting layer on top of that. +5. Finetune hyperparameters. +6. We evalute the results on the test set using weighted F1 score. ### Reference taxonomy -After manually inspecting random certificates, we have identified the following reference meanings: +After manually inspecting ~100 random certificates, we have identified the following reference meanings: - **Component used**: The referenced certificate is a component used in the examined certificate (e.g., IC used by a smartcard). Some evaluation results were likely shared/re-used. - **Component shared**: The referenced certificate shares some components with the examined certificate. Some evaluation results were likely shared/re-used. @@ -27,30 +49,7 @@ These can be further merged into the following super-categories: - **Previous version**: `previous_version` and `recertification` - **None**: `None` or `irrelevant` -### Reference classification methodology - -**Data splits and manual annotations**: - -1. Two authors inspect random certificates (~100) and capture the observed relations into reference taxonomy -2. Split all certificates for which we register a direct outgoing reference in either security target or certification report into `train/valid/test` splits in `30/20/50` fashion (see [split](split/)). -3. Sample 100 train, 100 valid, 200 test pairs of `(dgst, canonical_reference_keyword)` for manual annotations. -4. Two co-authors independently assign each of these pairs with a single label from the reference taxonomy. -5. Measure the inter-annotator agreement with Cohen's Kappa. -6. Resolve conflicts in the annotations in a meeting held by the co-authors. Use this consensual annotations for training and evaluation described below. - -**Learning the annotations**: - -1. For each pair `(dgst, referenced_cert_id)`, recover the relevant segments both from certification report and security target that mention the `referenced_cert_id` -2. Apply text processing on the segments (e.g., unify re-certification vs. recertification, etc.) -3. Train a baseline model based on TF-IDF (or count vectorization in general), random forest, and a soft-voting layer on top of that. - - Random forest classifies single segment to a probability of a given label - - Soft voting compares probabilities of the given labels on all segments, takes their square and chooses the maximum. -4. Train a sentence transformer with the same soft-voting layer on top of that. -5. Finetune hyperparameters. -6. Evaluate on test set. - - -## Inter-annotator agreement +### Inter-annotator agreement The inter-annotator agreement is measured both with Cohen's Kappa and with percentage. The results are as follows: |
