improve write-up of reference annotation methodology

author: Adam Janovsky 2023-10-20 10:14:51 +0200
committer: Adam Janovsky 2023-10-20 10:14:51 +0200
commit: 36d48ebf09e41d02502c18682ee23e60ef9b2eec (patch)
tree: bfa17883a57f24d3a60bcb0258b67ed76de326e2 /src/sec_certs/data/reference_annotations
parent: 4ae9bec3f7ce7c8a4666d16dc23601a4ef000aba (diff)
download: sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.tar.gz
sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.tar.zst
sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.zip
1 files changed, 30 insertions, 31 deletions
diff --git a/src/sec_certs/data/reference_annotations/readme.md b/src/sec_certs/data/reference_annotations/readme.md
index b0d978e6..b53697f8 100644
--- a/src/sec_certs/data/reference_annotations/readme.md
+++ b/src/sec_certs/data/reference_annotations/readme.md
@@ -1,17 +1,39 @@
 # Reference annotations
 
-This folder contains data and the methodology (presented below) related to learning the reference annotations.
+This folder contains data related to learning the reference annotations. This document also describeds the utilized methodology.
 
 - The folder [split](split) contains split of the CC Dataset to `train/valid/test` splits for learning.
-- The csv file [manually_annotated_references.csv](./manually_annotated_references.csv) contains manually acquired labels to references obtained with the methodology outlined below.
-- The folder `adam` contains manual annotations created by Adam
-- The folder `jano` contains manual annotations created by Jano
-- The folder `conflicts` contains conflicting annotations between Adam and Jano, as weel as their resolution
-- The contents of the `final` folder can thus be obtained by taking annotations either from `adam` or `jano` folder and masking them by `resolution_label` from `conflicts` folder.
+- The csv file [outdated_manually_annotated_references.csv](./outdated_manually_annotated_references.csv) contains manually acquired labels to references obtained with **old methodology** for the sake of paper 1 submission.
+- The folder [adam](adam/) contains manual annotations created by Adam
+- The folder [jano](jano/) contains manual annotations created by Jano
+- The folder [conflicts](conflicts/) contains conflicting annotations between Adam and Jano, as weel as their resolution
+- The contents of the [final](final/) folder can thus be obtained by taking annotations either from `adam` or `jano` folder and masking them by `resolution_label` from `conflicts` folder.
+
+## Reference classification methodology
+
+### Data splits and manual annotations
+
+1. Two co-authors independently inspect identical set of 100 random certificates and capture the observed relations into reference taxonomy to form the annotation guidelines. See [reference taxonomy](#reference-taxonomy) below.
+2. We split all certificates for which we register a direct outgoing reference in either security target or certification report into `train/valid/test` splits in `30/20/50` fashion (see [split](split/)).
+3. We sample 100 train, 100 valid, 200 test pairs of reference instances (represented by `(dgst, canonical_reference_keyword)` pairs) for manual annotations.
+4. Two co-authors independently assign each of these instances with a single label from the reference taxonomy.
+5. We measure the inter-annotator agreement with Cohen's Kappa and percentage, see [inter-annotator agreement](#inter-annotator-agreement).
+6. We resolve conflicts in the annotations in a meeting held by the co-authors. We use this consensual annotations for training and evaluation described below.
+
+### Supervised learning of the annotations
+
+1. For each pair `(dgst, referenced_cert_id)`, we recover the relevant text segments both from certification report and security target that mention the `referenced_cert_id`.
+2. We apply text processing on the segments (e.g., unify re-certification vs. recertification, etc.)
+3. We train a baseline model based on TF-IDF (or count vectorization in general), random forest and a soft-voting layer on top of that.
+    - Random forest classifies single segment with a probability of a given label.
+    - Soft voting compares probabilities of the given labels on all segments, takes their square and chooses the maximum.
+4. We train a sentence transformer with the same soft-voting layer on top of that.
+5. Finetune hyperparameters.
+6. We evalute the results on the test set using weighted F1 score.
 
 ### Reference taxonomy
 
-After manually inspecting random certificates, we have identified the following reference meanings:
+After manually inspecting ~100 random certificates, we have identified the following reference meanings:
 
 - **Component used**: The referenced certificate is a component used in the examined certificate (e.g., IC used by a smartcard). Some evaluation results were likely shared/re-used.
 - **Component shared**: The referenced certificate shares some components with the examined certificate. Some evaluation results were likely shared/re-used.
@@ -27,30 +49,7 @@ These can be further merged into the following super-categories:
 - **Previous version**: `previous_version` and `recertification`
 - **None**: `None` or `irrelevant`
 
-### Reference classification methodology
-
-**Data splits and manual annotations**:
-
-1. Two authors inspect random certificates (~100) and capture the observed relations into reference taxonomy
-2. Split all certificates for which we register a direct outgoing reference in either security target or certification report into `train/valid/test` splits in `30/20/50` fashion (see [split](split/)).
-3. Sample 100 train, 100 valid, 200 test pairs of `(dgst, canonical_reference_keyword)` for manual annotations.
-4. Two co-authors independently assign each of these pairs with a single label from the reference taxonomy.
-5. Measure the inter-annotator agreement with Cohen's Kappa.
-6. Resolve conflicts in the annotations in a meeting held by the co-authors. Use this consensual annotations for training and evaluation described below.
-
-**Learning the annotations**:
-
-1. For each pair `(dgst, referenced_cert_id)`, recover the relevant segments both from certification report and security target that mention the `referenced_cert_id`
-2. Apply text processing on the segments (e.g., unify re-certification vs. recertification, etc.)
-3. Train a baseline model based on TF-IDF (or count vectorization in general), random forest, and a soft-voting layer on top of that.
-    - Random forest classifies single segment to a probability of a given label
-    - Soft voting compares probabilities of the given labels on all segments, takes their square and chooses the maximum.
-4. Train a sentence transformer with the same soft-voting layer on top of that.
-5. Finetune hyperparameters.
-6. Evaluate on test set.
-
-
-## Inter-annotator agreement
+###  Inter-annotator agreement
 
 The inter-annotator agreement is measured both with Cohen's Kappa and with percentage. The results are as follows:
author	Adam Janovsky	2023-10-20 10:14:51 +0200
committer	Adam Janovsky	2023-10-20 10:14:51 +0200
commit	36d48ebf09e41d02502c18682ee23e60ef9b2eec (patch)
tree	bfa17883a57f24d3a60bcb0258b67ed76de326e2 /src/sec_certs/data/reference_annotations
parent	4ae9bec3f7ce7c8a4666d16dc23601a4ef000aba (diff)
download	sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.tar.gz sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.tar.zst sec-certs-36d48ebf09e41d02502c18682ee23e60ef9b2eec.zip