Computational Learning and Computational Linguistics
research group
University of Geneva

Current projects

The Grammatical Basis of Corpus Frequencies

Current computational linguistic work shows great interest in extending successful probabilistic modelling to multilingual approaches. Many classical applications and tasks in natural language processing, such as tagging, parsing and recovering of semantic representations, are being investigated in a multi-lingual perspective. The computational approaches are motivated by the desire to extend language technology robustly and automatically to many languages, but in so doing they also provide compelling computational models of problems that are being studied from a linguistic point of view. The properties of these computational models might shed light on some of the properties of the generative processes underlying natural language. In this research project, we investigate the probabilistic and statistical properties of a set of well-documented typological distributions of sentential word order and word order in the noun phrase. The fundamental research questions we ask is the following: why are the frequency distributions of these phenomena as they are?


The Parlance Project

PARLANCE is a three-year research project that aims to design and build mobile applications that approach human performance in conversational interaction.


Inducing Semantic Representations from Multiple Data Sources (FNS No. 138140)

In this project we develop models of semantic analysis that are trained on both corpora annotated with semantic roles and on parallel corpora which are not annotated. Building on monolingual Bayesian models of semantic role labelling, we exploit results on topic modelling to induce classes that allow us to label verbs that are not annotated in the training data. We exploit results from Bayesian multilingual models of part-of-speech tagging and syntactic parsing to model jointly multiple languages and their semantic roles. The model is intended to be tested monolingually on any of the languages in the parallel corpus, even if there was no annotated corpus for that language used in training.


Improving the Coherence of Machine Translation Output by Modeling Intersentential Relations (FNS/Sinergia)

The project aims at extending the current statistical sentence-to-sentence machine translation approach to modeling intersentential dependencies (ISDs) expressed by pronouns, verb tense/mode/aspect, discourse connectives, and politeness/style/register, which influence the perceived coherence of a translated text.


Former projects

Corpus-based Explorations of Cross-linguistic Syntactic and Semantic Role Parallelism (FNS No. 122643)

Significant advances in human-computer linguistic interaction, such as dialogues or speech-based systems, will require the development of adaptive sytems. Recent statistical data-driven techniques in language processing aim at acquiring the needed adaptivity by modelling the grammar and meaning of large quantities of naturally occurring text. The modelling of another aspect of human-like behaviour, the ability to learn different languages given the environment, is also fundamental and has received a lot of attention in the form of research on machine translation and other uses of parallel multi-lingual text.

In this project, we investigate a recent area of research that is common to both machine translation and dialogue systems for human computer interaction: the automatic projection of grammatical and meaning annotations across languages. So, for example, given the pair of sentences, She blames the government for her failure and its most natural French translation, Elle attribue son échec au gouvernement, and given that the English side of the pair bears the following semantic annotation, [JUDGE She] blames [ EVALUEE the government] [REASON for her failure], we want to automatically project the annotation to the French sentence obtaining [JUDGE Elle ] attribue [REASON son échec] [EVALUEE au gouvernement]. In particular, from a linguistic and theoretical point of view, the cross-linguistic validity of the annotation remains to be explored for more languages, concentrating on divergent constructions in a corpus-based approach. The exploration needs to be extended to more languages and to correspondences that span more than two languages at a time. From a computational point of view, we need to investigate methods that project both the grammatical and the meaning labelling at the same time on the basis of word-aligned corpora.


Computational Learning in Adaptive Systems for Spoken Conversation (CLASSiC)

The overall goal of the CLASSiC project is to facilitate the rapid deployment of accurate and robust spoken dialogue systems that can learn from experience.


Sampling and Regularization for Latent Structure Models with Feature Induction (FNS No. 125137)

Supervised machine learning has been extremely successful in a wide range of domains, including structure prediction problem. In this paradigm, knowledge from domain experts is exploited by annotating data with representations which are both learnable and useful for solving a given task. But the design of such representations and the annotation of sufficient data have become a bottleneck in system development. Latent variable models also use supervised learning, but they do not require that the representation used for annotation be sufficient to learn the task. Instead, domain knowledge is used to specify the structure of the additional representation which is needed, and automatic methods are used to learn the best annotation of this form for solving the task. To be successful, machine learning methods for latent variable models need solutions to two difficult problems, inference (calculating probabilities from the model) and learning (calculating a model from data). For structure prediction problems, particularly in the field of natural language processing, there has recently been some success in solving these problems, using two alternative approaches. Latent Probabilistic Context Free Grammar methods are relatively easy to perform inference with, but learning has proved difficult, only recently achieving good results with the "split-and-merge" method of Petrov et al. (ACL 2006). The latent feature-vector methods of Titov and Henderson (ICML 2007) have had relatively little trouble with learning, but inference is intractable to do exactly. However, approximate inference methods have achieved good empirical results (Titov and Henderson, ACL 2007; Titov and Henderson, CoNLL 2007; Henderson et al., CoNLL 2008).

This project continues our previous line of research in latent feature-vector models for complex structure prediction problems, which has focused on mean-field approximation methods. This project investigates two new directions, sampling methods for approximate inference, and regularisation for both approximate inference and learning. These advances in latent variable models for structure prediction will make a contribution to both machine learning research and specific structure processing domains, in particular natural language processing. Our methods are evaluated primarily on various types of natural language parsing, with varying degrees of complexity.

We anticipate continuing our track record of state-of-the-art performance on this topic, and expect that the results of this project will lead to broader adoption of our methods outside of natural language parsing.


Learning Verb Meaning with Hidden Grammars (FNS No. 114044)

Recent developments in natural language understanding have focused on data-driven statistical techniques, which exploit the current large amounts of available text as a learning resource to learn the meaning of a sentence. Natural language understanding has immediate applications in question-answering and information extraction. For example, an automatic flight reservation system processing the sentence "I want to book a flight from Geneva to New York" will need to know that "from Geneva" is the origin of the flight and "to New York" is the destination. The applications are many-fold and extremely relevant, in particular question-answering for dialogue systems and information extraction for biological data. Technically, the solution of this problems requires us to be able to assign semantic roles (origin, destination, in the given example) automatically to any kind of text.

We propose a method that is based on automatically inferring a hidden grammatical structure, which is the unobserved cause of the meaning of the sentence. We model therefore the problem as a problem of inferring hidden causes of observed events in uncertain environments. These methods hold promise for both the accuracy and the coverage of the results.