Corpus tools and language technology

We develop algorithms and software tools for the automatic linguistic annotation, efficient indexing, flexible query and quantitative analysis of large text corpora. These tools form the basis of innovative research in the digital humanities as well as practical and commercial applications in language technology.

Project funding

Reconstructing Arguments from Noisy Text (RANT)
(01/2018 – 12/2020)
RogTCS – text clustering for the analysis of open questions in market research
(03/2013 – 09/2014)

Key publications

Evert, Stefan; Greiner, Paul; Baigger, João Filipe; Lang, Bastian (2016). A distributional approach to open questions in market research. Computers in Industry 78, 16–28.
Evert, Stefan and Hardie, Andrew (2015). Ziggurat: A new data model and indexing format for large annotated text corpora. In Proceedings of the 3rd Workshop on the Challenges in the Management of Large Corpora (CMLC-3), pages 21–27, Lancaster, UK.
☞ specification & further information
Proisl, Thomas (2018). SoMeWeTa: A part-of-speech tagger for German social media and web texts. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.
☞ source code (GitHub)
Kabashi, Besim and Proisl, Thomas (2018). Albanian part-of-speech tagging: Gold standard and evaluation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.

Events

Empirikom Shared Task (EmpiriST 2015) on tokenization and POS tagging of German web corpora and computer-mediated communication