Results
LeAK
We developed a corpus consisting of court decisions from tenancy and transport law. Each document in this dataset has been anonymised and realistically pseudonymised. We fine-tuned GottBERT, a German pretrained language model, using this corpus and got a F1=84% for all anonymised data points on our test set which indicates great potential for anonymisation. Especially, we could reach a Recall of 96% for PII text spans. Further evaluation scores are stated in the table below.
Anonymised data points | Recall according to risk | |||||
Precision | Recall | F1 | High risk | Medium risk | Low risk | |
GottBERT (Scheible et al. 2020) | 0.80 | 0.90 | 0.84 | 0.96 | 0.80 | 0.89 |
We also tested our fine-tuned GottBert model on the original and not pseudonymised data. To prevent any risk of data disclosures, we conducted this test in a secured environment. The table below shows that we achieved a fairly high recall for PII text spans (Recall up to 98%) . Having said that, being able to detect 98% of high risk data points still means that in 100 thousand documents, 2 thousand may presents privacy issues. That is the reason for us to continue working on our anonymisation models.
Anonymised data points | Recall according to risk | |||||
Precision | Recall | F1 | High risk | Medium risk | Low risk | |
Landlord-tenant regulations | 0.79 | 0.90 | 0.84 | 0.96 | 0.76 | 0.89 |
Transport law | 0.85 | 0.90 | 0.87 | 0.98 | 0.81 | 0.87 |
In order to get more robustness, we also built a joint learning model which is trained to solve three different tasks simultaneously using the same corpus, namely, spans detection, entity classification and risk prediction. We achieved a Recall of around 98.8% for PII text spans and a total F1=96.56% (Precision=96.88, Recall=96.24) across both domains. This results show a substantial improvement compared to our previous approach. To evaluate the generalisability of the multitask model, we additionally ran further tests using verdicts from 11 law domains from higher regional courts (Oberlandesgericht). The following table presents the evaluation results.
Domain | Anonymised data points | Recall by risk level | ||||
Precision | Recall | F1 | High | Medium | Low | |
Bankensachen | 0.89 | 0.88 |
0.88 | 0.84 | 0.88 |
0.89 |
Bausachen | 0.87 | 0.88 | 0.87 | 0.91 | 0.95 | 0.84 |
Beschwerdeverfahren | 0.78 | 0.81 | 0.80 | 0.81 | 0.52 | 0.84 |
Familiensachen | 0.86 | 0.86 | 0.86 | 0.91 | 0.68 | 0.85 |
Handelssachen | 0.87 |
0.91 | 0.89 |
0.91 | 0.91 | 0.91 |
Immaterialgüter | 0.70 | 0.69 | 0.69 |
0.73 | 0.62 | 0.73 |
Kapitalanlagesachen | 0.79 | 0.85 | 0.82 | 0.91 | 0.89 | 0.81 |
Kostensachen | 0.79 | 0.77 | 0.78 | 0.83 | 0.87 | 0.74 |
Schiedssachen | 0.88 | 0.84 | 0.86 | 0.92 | 0.66 | 0.84 |
Verkehrsunfallsachen | 0.84 | 0.90 | 0.87 | 0.87 | 0.93 | 0.91 |
Zivilsachen | 0.83 | 0.84 | 0.83 | 0.82 | 0.85 | 0.85 |
Avg. | 0.83 | 0.84 | 0.83 | 0.86 | 0.80 | 0.84 |
AnGer
See our recent posters, presentations and panel contributions on the Publications page for preliminary results obtained in the AnGer project.