• Skip navigation
  • Skip to navigation
  • Skip to the bottom
Simulate organization breadcrumb open Simulate organization breadcrumb close
Logo des Lehrstuhls für Korpus- und Computerlinguistik
  • FAUTo the central FAU website
  1. Friedrich-Alexander-Universität
  2. Philosophische Fakultät und Fachbereich Theologie
  3. Department Germanistik und Komparatistik
Suche öffnen
  • Campo
  • StudOn
  • FAUdir
  • Jobs
  • Map
  • Help
  1. Friedrich-Alexander-Universität
  2. Philosophische Fakultät und Fachbereich Theologie
  3. Department Germanistik und Komparatistik

Logo des Lehrstuhls für Korpus- und Computerlinguistik

Navigation Navigation close
  • Research
    • Methodological foundations of corpus research and digital humanities
    • Corpus tools and language technology
    • Collocations, multiword expressions and corpus-based discourse analysis
    • Further research
    • All publications
    Research
  • Projects
    • RC21
    • PING
    • NormRechts
    • LeAK & AnGer
    • Past Projects
    Projects
  • Resources
    • Corpus Access
    • Web Apps
    • Software & Data
    Resources
  • Teaching
    • Informationen für Erstsemester
    • Rund um den Studiengang
    • Lehrveranstaltungen
    • Oberseminar CL
    • CIP-Pool und Bibliothek
    • FSI Computerlinguistik
    • Arbeiten am Lehrstuhl
    Teaching
  • Team
    • Lead
    • Administrative Office
    • Research Assistants
    Team
  • Blog
  1. Home
  2. Projects
  3. LeAK & AnGer

LeAK & AnGer

In page navigation: Projects
  • RC21
  • LeAK & AnGer
    • Team
    • Publications
    • Resources
    • Results
  • NormRechts
  • PING
  • Past Projects

LeAK & AnGer

Automatic Anonymisation and Pseudonymisation of Court Decisions

German courts are legally required to publish their verdict, but according to estimates, only a small amount (under 3%) of all court decisions are published per year. The transparency and availability of such documents do not only grant legal professionals and the general public access to crucial information source, but is also of importance for legal-tech industry and the digitalisation of government institutions. The main reason for this is the need of a time-consuming and yet manually anonymisation process. Also, to the best of our knowledge there is no high-quality German corpus available to train a fully automatic model and  most existing tools are still semi-automatic systems. Hence, they only provide a support for the manually anonymisation process. In LeAK and AnGer, we want to fill these gaps by working on the creation of high-quality annotated training data using verdicts from different law domains and developing a prototype for a fully automatic anonymisation.

AnGer

  • Start date: 01.01.2023
  • End date: 31.12.2025
  • Funding source: Bundesministerium für Billdung und Forschung (BMBF)

Motivated by the promising results from our prior works in LeAK, the main objectives of the AnGer project lie in the further development of our automatic anonymisation system and the creation of a new high-quality annotated dataset consisting of court decisions from higher regional court (Oberlandesgericht). Moreover, the models generalisation across different law domains is still a challenging task which should be addressed in this project. We are working on different data augmentation techniques, as well as robust neural networks which help to enhance the robustness of our current prototype. Especially, data augmentation can also help to generate more training samples for domains that lack of training data. Further, according to our findings in LeAK, domain adaptation is still required in certain domains. Thus, we create learning curves for AG and OLG datasets to visualise the learning quality during each training step and to analyse the amount of data needed for a robust domain adaption. We also work on an approach for continuous domain adaptation without retraining using all data. Finally, we want to carry out legal-tech studies and conduct de-anonymisation experiments with the annotated gold standard.

LeaK

  • Start date: 01.04.2020
  • End date: 31.03.2022
  • Funding source:  Bayerisches Staatsministerium der Justiz (StMJ)

This project aims at exploring the feasibility of a fully automatic anonymisation system for German court decision. One of our key contributions is the development  of a high-quality manually annotated gold standard for verdicts from regional districts (Amtsgericht). To ensure the absence of privacy-related information, at least six people have to work on the same document. Therefore, each text in our dataset is annotated independently by four student annotators, who have to identify text spans that need to be anonymised, information categories (names, addresses, jobs or dates) as well as risk levels (high, middle and low)  and  adjudicated by two new annotators in the subsequent step. Text anonymisation is thus approached as Named Entity Recognition (NER). Another essential aspect of the project is the systematic evaluation of different automatic approaches for automatic anonymisation  using existing NER taggers, as well as  fine-tuning several Large Language Models (LLMs) on our gold standard. Especially, we are also working on a multitask architecture in order to enhance the robustness of the anonymisation system. In the course of these experiments, the transferability of our prototype should also be validated across documents from other law domains of  higher regional court (Oberlandesgericht) . Preliminary results of this experiment indicate the need of domain adaption in order to achieve good performance on court decisions from other text domains.

 

 

Our current AnGer team

 

 

Computational Corpus Linguistics
Prof. Dr. Stephanie Evert

Bismarckstraße 6
91054 Erlangen
Germany
  • Imprint
  • Privacy
  • Accessibility
  • RSS Feed
  • Twitter
  • YouTube
Up