Corpus tools and language technology
We develop algorithms and software tools for the automatic linguistic annotation, efficient indexing, flexible query and quantitative analysis of large text corpora. These tools form the basis of innovative research in the digital humanities as well as practical and commercial applications in language technology.
Project funding
- Reconstructing Arguments from Noisy Text (RANT)
(01/2018 – 12/2020) - RogTCS – text clustering for the analysis of open questions in market research
(03/2013 – 09/2014)
Key publications
- Evert, Stefan; Greiner, Paul; Baigger, João Filipe; Lang, Bastian (2016). A distributional approach to open questions in market research. Computers in Industry 78, 16–28.
- Evert, Stefan and Hardie, Andrew (2015). Ziggurat: A new data model and indexing format for large annotated text corpora. In Proceedings of the 3rd Workshop on the Challenges in the Management of Large Corpora (CMLC-3), pages 21–27, Lancaster, UK.
☞ specification & further information - Proisl, Thomas (2018). SoMeWeTa: A part-of-speech tagger for German social media and web texts. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.
☞ source code (GitHub) - Kabashi, Besim and Proisl, Thomas (2018). Albanian part-of-speech tagging: Gold standard and evaluation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.
Events
- Empirikom Shared Task (EmpiriST 2015) on tokenization and POS tagging of German web corpora and computer-mediated communication