Together with Rafael Ehren, Kuba Waszczuk and Laura Kallmeyer, I’m currently organizing a shared task at KONVENS 2021 that aims at disambiguating literal and idiomatic occurrence of German verbal multi-word expressions, so-called verbal idioms (VIDs), in running text. An example would be the VID Das Konzert fiel ins Wasser (‘The concert was canceled’) as opposed to literal counterparts such as Das Handy fiel ins Wasser (‘The mobile phone fell into the water’).
This sort of disambiguation is a well-known challenge for NLP applications such as parsing or machine translation. But the development and testing of disambiguation approaches generally suffers from the lack of data sets that are balanced in terms of literal and idiomatic occurrences, as literal occurrences are quite rare in most VID. Therefore, we have assembled a data set (by merging COLF-VID and data from the German SemEval-2013 task 5b) which shows a reasonably low idiomaticity rate and allows for a meaningful comparison of disambiguation systems.
The call for participation (and as of today the training data) can be found in GitHub and Codalab: