A Project coordinated by IIIA.
Principal investigator:
Collaborating organisations:
Funding entity:
Funding call:
Project #:
Total funding amount:
IIIA funding amount:
Duration:
Extension date:
The SPECULA project develops theoretically-grounded machine learning methods for natural language based on spectral learning. Current machine learning methods for Natural Language Processing are strongly based on supervised approaches, and this limits the ability to apply natural language models to new textual domains, or to unique information needs of an application. The fundamental limitation is that there does not exist a universal representation of natural language that enables effective generalization. While deep learning methods have achieved significant progress in this area, there are still fundamental open questions to resolve: what class of models should we use to capture the structure and meaning of natural language? and how should we use such representations for specific natural language tasks?
We will study these questions theoretically and empirically using the paradigm of grammar induction and transduction, which encompasses two processes. By grammar induction we refer to unsupervised machine learning methods that learn the structure of natural language from a large and representative collection of sentences. By transduction we refer to the process by which we can transform the unsupervised structures we learned into human-designed linguistic representations of languages, such as syntactic-semantic trees. By composing the two approaches, we obtain models that predict the linguistic structure of sentences using an intermediate grammatical representation, which is learned from textual data. Under this induction and transduction paradigm, our research will focus on unsupervised learning of grammars for natural language, and we will be testing if the grammars we learn are sufficiently rich to solve NLP tasks via transduction processes.
The technical objectives of the project are to develop formulations of grammar induction and transduction that are efficient and that scale to large collections of natural language. The main technical workhorse is spectral learning, which offers formal tools to reduce the problems of grammar induction and transduction into low-rank matrix learning problems. In doing so, we expect to improve our understanding of grammatical induction for natural language, and to establish connections between deep and spectral approaches to grammatical induction.