Giter VIP home page Giter VIP logo

kaleidophon / token2index Goto Github PK

View Code? Open in Web Editor NEW
49.0 5.0 5.0 1.28 MB

A lightweight but powerful library to build token indices for NLP tasks, compatible with major Deep Learning frameworks like PyTorch and Tensorflow.

Home Page: http://token2index.readthedocs.io

License: GNU General Public License v3.0

Python 100.00%
indexing token nlp pytorch tensorflow numpy w2i t2i stoi itos i2t i2w deeplearning seq2seq deep-learning python transformers rnn rnns transformer

token2index's Introduction

โšก ๐Ÿ“‡ token2index: A lightweight but powerful library for token indexing

Build Documentation Status Coverage Status Compatibility License: GPL v3 Code style: black

token2index is a small yet powerful library facilitating the fast and easy creation of a data structure mapping tokens to indices, primarily aimed at applications for Natural Language Processing. The library is fully tested, and does not require any additional requirements. The documentation can be found here, some feature highlights are shown below.

Who / what is this for?

This class is written to be used for NLP applications where we want to assign an index to every word in a sequence e.g. to be later used to look up corresponding word embeddings. Building an index and indexing batches of sequences for Deep Learning models using frameworks like PyTorch or Tensorflow are common steps but are often written from scratch every time. This package provides a ready-made package combining many useful features, like reading vocabulary files, building indices from a corpus or indexing entire batches in one single function call, all while being fully tested.

โœจ Feature Highlights

  • Building and extending vocab

    One way to build the index from a corpus is using the build() function:

    >>> from t2i import T2I
    >>> t2i = T2I.build(["colorless green ideas dream furiously", "the horse raced past the barn fell"])
    >>> t2i
    T2I(Size: 13, unk_token: <unk>, eos_token: <eos>, pad_token: <pad>, {'colorless': 0, 'green': 1, 'ideas': 2, 'dream': 3, 'furiously': 4, 'the': 5, 'horse': 6, 'raced': 7, 'past': 8, 'parn': 9, 'fell': 10, '<unk>': 11, '<eos>': 12, '<pad>': 13})

    The index can always be extended again later using extend():

    >>> t2i = t2i.extend("completely new words")
    T2I(Size: 16, unk_token: <unk>, eos_token: <eos>, pad_token: <pad>, {'colorless': 0, 'green': 1, 'ideas': 2, 'dream': 3, 'furiously': 4, 'the': 5, 'horse': 6, 'raced': 7, 'past': 8, 'barn': 9, 'fell': 10, 'completely': 13, 'new': 14, 'words': 15, '<unk>': 16, '<eos>': 17, '<pad>': 18})

    Both methods and index() also work with an already tokenized corpus in the form of

      [["colorless", "green", "ideas", "dream", "furiously"], ["the", "horse", "raced", "past", "the", "barn", "fell"]]    
    
  • Easy indexing (of batches)

    Index multiple sentences at once in a single function call!

    >>> t2i.index(["the green horse raced <eos>", "ideas are a dream <eos>"])
    [[5, 1, 6, 7, 12], [2, 11, 11, 3, 12]]

    where unknown tokens are always mapped to unk_token.

  • Easy conversion back to strings

    Reverting indices back to strings is equally as easy:

    >>> t2i.unindex([5, 14, 16, 3, 6])
    'the new <unk> dream horse'
  • Automatic padding

    You are indexing multiple sentences of different length and want to add padding? No problem! index() has two options available via the pad_to argument. The first is padding to the maximum length of all the sentences:

    >>> padded_sents = t2i.index(["the green horse raced <eos>", "ideas <eos>"], pad_to="max")
    >>> padded_sents
    [[5, 1, 6, 7, 12], [2, 12, 13, 13, 13]]
    >>> t2i.unindex(padded_sents)
    [['the green horse raced <eos>', 'ideas <eos> <pad> <pad> <pad>']]

    Alternatively, you can also pad to a pre-defined length:

    >>> padded_sents = t2i.index(["the green horse <eos>", "past ideas <eos>"], pad_to=5)
    >>> padded_sents
    [[5, 1, 6, 12, 13], [8, 2, 12, 13, 13]]
    >>> t2i.unindex(padded_sents)
    [['the green horse <eos> <pad>', 'past ideas <eos> <pad> <pad>']]
  • Vocab from file

    Using T2I.from_file(), the index can be created directly by reading from an existing vocab file. Refer to its documentation here for more info.

  • Fixed memory size

    Although the defaultdict class from Python's collections package also posses the functionality to map unknown keys to a certain value, it grows in size for every new key. T2I memory size stays fixed after the index is built.

  • Support for special tokens

    To enable flexibility in modern NLP applications, T2I allows for an arbitrary number of special tokens (like a masking or a padding token) during init!

    >>> t2i = T2I(special_tokens=["<mask>"])
    >>> t2i
    T2I(Size: 3, unk_token: <unk>, eos_token: <eos>, pad_token: <pad>, {'<unk>': 0, '<eos>': 1, '<mask>': 2, '<pad>': 3})
  • Explicitly supported programmer laziness

    Too lazy to type? The library saves you a few keystrokes here and there. instead of calling t2i.index(...) you can directly call t2i(...) to index one or multiple sequences. Furthermore, key functions like index(), unindex(), build() and extend() support strings or iterables of strings as arguments alike.

๐Ÿ”Œ Compatibility with other frameworks (Numpy, PyTorch, Tensorflow)

It is also ensured that T2I is easily compatible with frameworks like Numpy, PyTorch and Tensorflow, without needing them as requirements:

Numpy

>>> import numpy as np
>>> t = np.array(t2i.index(["the new words are ideas <eos>", "the green horse <eos> <pad> <pad>"]))
>>> t
array([[ 5, 15, 16, 17,  2, 18],
   [ 5,  1,  6, 18, 19, 19]])
>>> t2i.unindex(t)
['the new words <unk> ideas <eos>', 'the green horse <eos> <pad> <pad>']

PyTorch

>>> import torch
>>> t = torch.LongTensor(t2i.index(["the new words are ideas <eos>", "the green horse <eos> <pad> <pad>"]))
>>> t
tensor([[ 5, 15, 16, 17,  2, 18],
    [ 5,  1,  6, 18, 19, 19]])
>>> t2i.unindex(t)
['the new words <unk> ideas <eos>', 'the green horse <eos> <pad> <pad>']

Tensorflow

>>> import tensorflow as tf
>>> t = tf.convert_to_tensor(t2i.index(["the new words are ideas <eos>", "the green horse <eos> <pad> <pad>"]), dtype=tf.int32)
>>> t
tensor([[ 5, 15, 16, 17,  2, 18],
    [ 5,  1,  6, 18, 19, 19]])
>>> t2i.unindex(t)
['the new words <unk> ideas <eos>', 'the green horse <eos> <pad> <pad>']

๐Ÿ“ฅ Installation

Installation can simply be done using pip:

pip3 install token2index

๐ŸŽ“ Citing

If you use token2index for research purposes, please cite the library using the following citation info:

@misc{ulmer2020token2index,
    title={token2index: A lightweight but powerful library for token indexing},
    author={Ulmer, Dennis},
    journal={https://github.com/Kaleidophon/token2index},
    year={2020}
}

token2index's People

Contributors

kaleidophon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

token2index's Issues

Build vocab from tokenized corpus, i.e. List[List[str]]

Describe the solution you'd like
Allow the T2I.build method to accept List[List[str]].

Example:

tokenized_corpus = [
    ["This", "is", "a", "sentence"],
    ["This", "is", "another", "sentence"]
]
T2I.build(tokenized_corpus)

Currently it can be implemented as follows, but it would be nice to have it supported automatically:

# flatten List[List[str]] with a generator to avoid memory usage
gen = (i for sent in tokenized_corpus for i in sent)
T2I.build(gen)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.