M1 or M2

Laboratoire de Sciences Cognitives et Psycholinguistique


Pavillon Jardin
29 rue d'Ulm
75005 Paris, FRANCE

Language acquisition across cultures
Length of internship
2-7 months
Francais, English, Spanish, ...

Babies start learning their language well before they start speaking, and they do so at an amazing speed. You may think this is because parents help infants by speaking clearly and helpfully. But it turns out that this may not be true in all cultures. In fact, some have said adults rarely, if ever, talk to babies until they themselves talk. Do babies actually learn language differently across cultures? May we be wrong regarding the universal validity of postulated language acquisition algorithms? Or has the extent of diversity perhaps been over-estimated by previous work?

To answer these fascinating (and unsettling) questions, we work on mainly two fronts. First, we try to more accurately document the “real” input to language infants experience across a range of cultures and social settings. Naturalistic observations are extremely useful to describe the experiences available to infants in their everyday life in a completely ecological manner. Nowadays, it has even become relatively easy to do this via small recording devices worn by the child throughout one or more typical days. These very long recordings are analyzed using a combination of expert linguistic annotation and automatized big data analyses. To document young children’s early achievements, we carry out field-adapted psycholinguistic experiments. Second, we engage in tightly controlled laboratory experiments and computational modeling to try to test the plausibility and promise of diverse theories or mechanistic pathways. Across our projects, we are committed to improving our scientific methods to improve the robustness and generalizability of our results.

An internship in our team will typically allow you to gain experience in one or more of the following methods/disciplines:

  • developmental science
  • computational modeling
  • speech technology
  • linguistics
  • psycholinguistics
  • experimental psychology
  • anthropology
  • meta-analyses

More specific examples of internships :

Words, WORDS, syllables: How should we count?

Children whose parents talk to them more end up with better language skills. But what is helpful in that extra experience? Is it the overall quantity of speech, or the number of different words, or the number of different concepts, or...? Figuring this out is important not just for our theories of language acquisition, but also for potential applications. In this project, you would investigate the usefulness of syllable versus word counts when quantifying infants’ language experiences.  We invite candidates with a background in linguistics, speech sciences, and/or engineering. The goals and methods will be adapted to the candidate’s strengths and learning objectives.

Speech technologies for under-resourced languages

Speech technology and NLP tools have soared in importance as the GAFA started adapting their systems to about 100 major languages (English, French, etc.), in an attempt to reach more people with their services. Those languages were actually the easy part of democratizing speech services. The hard part now is to create systems that work for the typologically diverse, 6,900 other languages! As a result, there is enormous demand in the market today for people who can build speech and text systems for under-resourced languages. This is exactly what you would learn in this internship, working on Tsimane, a linguistic isolate spoken in Bolivia. You would start with a large corpus containing audio and video recordings, some of which have been transcribed and even translated into Spanish. Your task would be to clean up the textual annotations and force-align text and audio, using standard and state of the art recipes (e.g., on Kaldi).

Requirements: Good programming skills; experience with speech and phonetics is a plus. 

State-of-the-art diarization

Diarization is the task of automatically determining speaker turns in an audio recording of a conversation (or more commonly stated: deciding who spoke when). Diarization is a topic attracting a lot of interest at present, since there are many real-life applications depending on this (e.g., movie subtitling, detection of speech from wearables). The DiHARD Challenge yielded a ranking of state-of-the-art systems, with Brno University of Technology scoring first on raw input (track 2, Diez et al., 2018). This internship aims to reproduce their top-ranking system which involves using i-vectors, x-vectors (neural networks) and variational Bayes methods, and integrate it to an existing diarization framework (LeFranc et al., 2018) in order to apply it to a specific domain. Given the high technical skills required, only long internships will be considered. Required skills: bash and python programming; good spoken and written English. The candidate must be eager to learn in a collaborative environment. Ideally, the candidate would have knowledge/experience in speech processing techniques (feature extraction, modeling, i-vectors, etc.), neural networks (Keras and/or Tensorflow) and/or other machine learning techniques.
Requirements: Good programming skills and excellent math skills are required.

Language data galore: Updating The Language Goldmine (stage bibliographique)

The Language Goldmine is a repository listing hundreds of data sources ranging from phonotactic frequency estimations to affect/emotion ratings of words.
We are looking for students interested in languages and language psychology who want to gain experience in database management and github. Specifically, you would help us detect new data sources, classify them, and add them to the database. There will be optional opportunities to increase your visibility and improve your network via public announcements of new additions.
Requirements: Interest in linguistics and computational modeling.

Hear who's talking? Modeling effects of speaker variation on infant word recognition

Some babies spend most of their waking time with only one person; others hear speech from many different speakers. Even if the two babies heard the very same words, their experiences would not be equivalent. In fact, although for us adults the word “dog” said by a girl and “dog” said by a grandfather mean the same, careful analysis reveals that the two words sound very different depending on who speaks them. So are there effects on how babies learn language depending on how many people speak to them? Should babies who only hear mommy end up being overly strict, only understanding "dog" when mommy says it? Or rather might babies who hear lots of speakers end up so confused that they will be unable to realize that "dog" and "bog" mean different things? Is more less or more? You’ll find out in this internship!
Requirements: Some background in programming (any language welcome); interest  in gaining skills in python and R while learning more about cognitive modeling, phonetic acoustics, and early language acquisition.

Le langage chez les bébés. Une approche interdisciplinaire

A l'âge de six mois, les bébés peuvent à peines tenir leur biberon, cependant ils connaissent les mots "main" et "biberon". Dans ce projet, nous explorons l'acquisition du langage chez les nourrissons avec des techniques comportementales qui permettent de découvrir leurs connaissances cachées, et des approches computationnelles qui permettent de décrire l'input à l'acquisition du langage ainsi que les propriétés du système d'apprentissage.

Predicting vocabulary comprehension with infant word segmentation models across languages

Infants achieve a comprehension vocabulary of several words early on, just by listening to their caregivers’ speech. This speech has no explicit information on where words start and end. However, infants somehow manage to segment words out of it their caregivers’ input. It has been proposed that infants might do so, also by tracking distributional cues. Previous research has implemented several word segmentation algorithms using these cues, in order to investigate the segmentability of infants’ input. 
In this study, we will take a closer look to whether these algorithms are cognitively plausible, by comparing their output to what is actually understood by infants. We will ask whether these models manage to segment vocabulary which is supposedly known by infants according to Communicative Development Inventory reports (CDIs). We are working on large database of diverse languages (CHILDES, and AcqDiv), and we can check whether these models are cross-linguistically valid. 
Requirements: Some background in programming (python, bash, and R are not necessary but desirable) and text processing; interest in learning more about cognitive modeling, cross-linguistic comparisons, and early language acquisition.