src/sec_certs/data/reference_annotations/readme.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

# Reference annotations

This folder contains data related to learning the reference annotations. This document also describeds the utilized methodology.

- The folder [split](split) contains split of the CC Dataset to `train/valid/test` splits for learning.
- The csv file [outdated_manually_annotated_references.csv](./outdated_manually_annotated_references.csv) contains manually acquired labels to references obtained with **old methodology** for the sake of paper 1 submission.
- The folder [adam](adam/) contains manual annotations created by Adam
- The folder [jano](jano/) contains manual annotations created by Jano
- The folder [conflicts](conflicts/) contains conflicting annotations between Adam and Jano, as weel as their resolution
- The contents of the [final](final/) folder can thus be obtained by taking annotations either from `adam` or `jano` folder and masking them by `resolution_label` from `conflicts` folder.

## Reference classification methodology

### Data splits and manual annotations

1. Two co-authors independently inspect identical set of 100 random certificates and capture the observed relations into reference taxonomy to form the annotation guidelines. See [reference taxonomy](#reference-taxonomy) below.
2. We split all certificates for which we register a direct outgoing reference in either security target or certification report into `train/valid/test` splits in `30/20/50` fashion (see [split](split/)).
3. We sample 100 train, 100 valid, 200 test pairs of reference instances (represented by `(dgst, canonical_reference_keyword)` pairs) for manual annotations.
4. Two co-authors independently assign each of these instances with a single label from the reference taxonomy.
5. We measure the inter-annotator agreement with Cohen's Kappa and percentage, see [inter-annotator agreement](#inter-annotator-agreement).
6. We resolve conflicts in the annotations in a meeting held by the co-authors. We use this consensual annotations for training and evaluation described below.

### Supervised learning of the annotations

1. For each pair `(dgst, referenced_cert_id)`, we recover the relevant text segments both from certification report and security target that mention the `referenced_cert_id`.
2. We apply text processing on the segments (e.g., unify re-certification vs. recertification, etc.)
3. We train a baseline model based on TF-IDF (or count vectorization in general), random forest and a soft-voting layer on top of that.
    - Random forest classifies single segment with a probability of a given label.
    - Soft voting compares probabilities of the given labels on all segments, takes their square and chooses the maximum.
4. We train a sentence transformer with the same soft-voting layer on top of that.
5. Finetune hyperparameters.
6. We evalute the results on the test set using weighted F1 score.

### Reference taxonomy

After manually inspecting ~100 random certificates, we have identified the following reference meanings:

- **Component used**: The referenced certificate is a component used in the examined certificate (e.g., IC used by a smartcard). Some evaluation results were likely shared/re-used.
- **Component shared**: The referenced certificate shares some components with the examined certificate. Some evaluation results were likely shared/re-used.
- **Evaluation reused**: The evaluation results of the referenced certificate were used for evaluation of the examined certificate, due to reasons that could not be resolved.
- **Re-evaluation**: The examined certificate is a re-evaluation of the referenced certificate. For definition of re-evaluation, see [Assurance Continuity: CCRA Requirements](https://www.commoncriteriaportal.org/files/operatingprocedures/CCDB-011-v2.2-2021-Sep-30-Final-Assurance_Continuity.pdf).
- **Previous version**: The product in the referenced certificate is a previous version of the product in the examined certificate and the re-certification is not explicitly mentioned.
- **None**: The annotator could not assign any of the previous contexts.
- **Irrelevant**: The reference is irrelevant to the studied certificate (typo, left-out reference from a template, ...)

These can be further merged into the following super-categories:

- **Some sub-component relationship** `component_used`, `component_shared`, and `evaluation_reused`
- **Previous version**: `previous_version` and `re-evaluation`
- **None**: `None` or `irrelevant`

The complete formal codebook can be found at [Overleaf](https://www.overleaf.com/read/dmcpzcfpgrdg#2d938a).

###  Inter-annotator agreement

The inter-annotator agreement is measured both with Cohen's Kappa and with percentage. The results are as follows:

| Cohen's Kappa | Percentage |
| ------------- | ---------- |
| 0.71          | 0.82       |

The code used to measure the agreement is stored in `notebooks/cc/reference_annotations/inter_annotator_agreement.ipynb`.