Giter VIP home page Giter VIP logo

wave-to-syntax's Introduction

Wave to Syntax: Probing spoken language models for syntax

This repo hosts the code for

@misc{shen2023wave,
      title={Wave to Syntax: Probing spoken language models for syntax}, 
      author={Gaofei Shen and Afra Alishahi and Arianna Bisazza and Grzegorz Chrupała},
      year={2023},
      eprint={2305.18957},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Installation

Clone repo and set up and activate a virtual environment with conda:

conda create --name wav2syn --file requirements.txt
conda activate wav2syn

The exact configuration of the conda environment used to conduct the experiments can be found in spec-file.txt

Pre-processing

modify the dataset paths within and run python preprocessing.py to generate a dataset csv file for the textCorpus and Corpus classes.

preprocessing.py also uses the stanza parser to save all the constituency trees for each utterance in the dataset.

Datasets

This project have implemented the embedding extraction script for LibriSpeech and SpokenCOCO corpus. You can download the two corpora from the links here SpokenCOCO LibriSpeech.

After downloading and extracting the datasets, read the main() function in preprocessing.py and change the root directories of the datasets and splits you want to use.

Running preprocessing.py generates dataset csv files that can be understood by the probing scripts.

The preprocessing script also makes the bag-of-words model at the same time.

Models

This repo uses Huggingface Hub to load models.

If you would like to replicate findings with the FaST-VGS family of models, please check instructions on https://github.com/jasonppy/FaST-VGS-Family.

Extract embeddings from spoken language model

Feature extraction have been implemented in embedding_generation.py. It might be desirable to modify embedding_generation.py to limit what model you want to investigate on. The script will save the extracted features including the meanpooled layerwise embedding, the treedepth, the annotation, the audio path, audio length and wordcount to a .pt file under embeddings.

Running TreeDepth probe

Run

python treedepthprobe.py >> treedepth.out

Running TreeKernel probe

Run

python treekernel_prep.py
python treekernelprobe.py >> treekernel.out

wave-to-syntax's People

Contributors

techsword avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.