Giter VIP home page Giter VIP logo

Hi there πŸ‘‹

I'm StΓ©phan Tulkens! I'm a computational linguistics/AI person. I am currently working as a machine learning engineer/NLP scientist at Metamaze, where I work with transformers and generative AI models to automate document processing.

I got my Phd at CLiPS at the University of Antwerpen under the watchful eyes of Walter Daelemans (Computational Linguistics) and Dominiek Sandra (Psycholinguistics). The topic of my Phd was the way people process orthography during reading. You can find a copy here. Before that I studied computational linguistics (Ma), philosophy (Ba) and software engineering (Ba)

My goal is always to make things as fast and small as possible. I like it when simple models work well, and I love it when simple models get close in accuracy to big models. I do not believe absolute accuracy is a metric to be chased, and I think we should always be mindful of what a model computes or learns from the data.

I’m currently working on πŸƒβ€β™‚οΈ:

  • reach: a library for loading and working with word embeddings.
  • piecelearn: a library that trains a subword tokenizer and embeddings on the same corpus, giving you open vocabulary embeddings.
  • unitoken: a library for easy pre-tokenization.
  • hashing_split: a library for hash-based data splits (stable splits!)

Other stuff I made (most of it from my Phd) πŸ•:

  • wordkit: a library for working with orthography
  • old20: calculate the orthographic levenshtein distance 20 metric.
  • metameric: fast interactive activation networks in numpy.
  • humumls: load the UMLS database into a mongodb instance. Fast!
  • dutchembeddings: word embeddings for dutch (back when this was a cool thing to do)

My research interests πŸ€–:

  • Tokenizers, specifically subword tokenizers.
  • Embeddings, specifically static embeddings (so old-fashioned! πŸ’€), and how to combine these in meaningful ways.
  • String similarity, and how to compute it without using dynamic programming.

Contact:

Stephan Tulkens's Projects

argilla icon argilla

✨ Argilla: Open-source data platform for LLMs and Human Feedback

conch icon conch

Unsupervised concept extraction from clinical text

diora icon diora

Deep Inside-Outside Recursive Autoencoder

lrec2018 icon lrec2018

Code for the experiments in the LREC 2018 paper "WordKit: a Python Package for Orthographic and Phonological Featurization"

mteb icon mteb

MTEB: Massive Text Embedding Benchmark

old20 icon old20

Calculate Yarkoni, Baloto & Yap's OLD20.

opendutchwordnet icon opendutchwordnet

This repo provides a python module to work with Open Dutch WordNet. It was created using python 3.4.

orst icon orst

A pixel sorting program, written in python 3.x.

piecelearn icon piecelearn

Learning BPE embeddings by first learning a segmentation model and then training word2vec

plate icon plate

holographic reduced representations

rd icon rd

representation distance

reach icon reach

Load embeddings and featurize your sentences.

ruly icon ruly

A short script to generate stuff based on binary cellular automata.

somber icon somber

Recursive Self-Organizing Map/Neural Gas.

spacy_conll icon spacy_conll

Parse text with spaCy and print the output in CoNLL-U format

tacosdetection icon tacosdetection

Contains the supplementary materials from the paper: "A Dictionary-based Approach to Racism Detection in Dutch Social Media", under review for the TACOS workshop at LREC 2016.

torchic icon torchic

Simple linear thing in Torch, with a scikit-learn compatible API.

transformers icon transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

unitoken icon unitoken

Tokenization across languages. Useful as preprocessing for subword tokenization.

vicinage icon vicinage

Fast implementations of various string- and vector-based neighborhood metrics

wavesom icon wavesom

Base part of the global space model.

wubi icon wubi

Python 3 code for transliterating corpora of chinese characters to wubi encoding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.