Giter VIP home page Giter VIP logo

fastdna's Introduction

Table of contents

Introduction

fastDNA is a library for classification of short DNA sequences. It is adapted from the fastText library.

Requirements

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

  • (g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make.

Building fastDNA

$ git clone https://github.com/rmenegaux/fastDNA.git
$ cd fastDNA
$ make

This will produce object files for all the classes as well as the main binary fastdna.

For a trial run:

$ cd test
$ sh test.sh

This should train and evaluate a small model on the toy dataset provided.

DNA short read classification

In order to train a dna classifier using the method described in 1, use:

$ ./fastdna supervised -input train.fasta -labels labels.txt -output model

where train.fasta is a FASTA file containing the full reference genomes and labels.txt is a text file containing the genome labels (one label per line). This will output two files: model.bin and model.vec.

Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

$ ./fastdna test model.bin test.fasta test_labels.txt n

where test.fasta is a FASTA file containing the DNA fragments to be classified, and test_labels.txt contains the labels for each of the fragments.

The argument n is optional, and is equal to 1 by default.

In order to obtain the n most likely labels for a set of reads, use:

$ ./fastdna predict model.bin test.fasta n

or use predict-prob to also get the probability for each label

$ ./fastdna predict-prob model.bin test.fasta n

Doing so will print to the standard output the n most likely labels for each line. The argument n is optional, and equal to 1 by default.

If you want to compute vector representations of DNA sequences, please use:

$ ./fastdna print-word-vectors model.bin < text.fasta

This assumes that the text.fasta file contains the DNA sequences that you want to get vectors for. The program will output one vector representation per sequence in the file.

To write the vectors to a file, redirect the output as so:

$ ./fastdna print-word-vectors model.bin < text.fasta > vectors.txt

To get vectors from standard input, just type

$ ./fastdna print-word-vectors model.bin

Press Enter, then type the sequence and finish with Ctrl+D (Linux, Mac) or Ctrl+Z (Windows)

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fastdna quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fastdna test model.ftz test.fasta test_labels.txt

The quantization procedure follows the steps described in 3.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fastdna supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -noise              mutation rate (/100,000)[0]
  -length             length of fragments for training [200]
  -epoch              number of epochs [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -loadModel          pretrained model for supervised learning []
  -saveOutput         whether output params should be saved [false]
  -freezeEmbeddings   model does not update the embedding vectors [false]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            whether embeddings are finetuned if a cutoff is applied [false]
  -qnorm              whether the norm is quantized separately [false]
  -qout               whether the classifier is quantized [false]
  -dsub               size of each sub-vector [2]

Python

Most use cases are covered in the python script fdna.py.

To reproduce the results from the paper, download the data then run:

python fdna.py -train -train_fasta /path/to/train_large_fasta -train_labels /path/to/train_large_labels \
    -eval -test_fasta /path/to/test_large_fasta -test_labels /path/to/test_large_labels \
    -k 13 -d 100 -noise 4 -e 200

NB: Best parameters for classification tasks are k=14, d=50, noise=4

Full usage:

python fdna.py --help
usage: fdna.py [-h] [-train] [-quantize] [-predict] [-eval] [-predict_quant]
               [-train_fasta TRAIN_FASTA] [-train_labels TRAIN_LABELS]
               [-test_fasta TEST_FASTA] [-test_labels TEST_LABELS]
               [-output_dir OUTPUT_DIR] [-model_name MODEL_NAME]
               [-threads THREADS] [-d D] [-k K] [-e E] [-lr LR] [-noise NOISE]
               [-L L] [-freeze] [-pretrained_vectors PRETRAINED_VECTORS]
               [-verbose VERBOSE]

train, predict and/or quantize fdna model

optional arguments:
  -h, --help            show this help message and exit
  -train                train model
  -quantize             quantize model
  -predict              make predictions
  -eval                 make and evaluate predictions
  -predict_quant        make and evaluate predictions with quantized model
  -train_fasta TRAIN_FASTA
                        training dataset, fasta file containing full genomes
  -train_labels TRAIN_LABELS
                        training labels, text file containing as many labels
                        as there are training genomes
  -test_fasta TEST_FASTA
                        testing dataset, fasta file containing reads
  -test_labels TEST_LABELS
                        testing dataset, text file containing as many labels
                        as there are reads
  -output_dir OUTPUT_DIR
                        output directory
  -model_name MODEL_NAME
                        optional user-defined model name
  -threads THREADS      number of threads
  -d D                  embedding dimension
  -k K                  k-mer length
  -e E                  number of training epochs
  -lr LR                learning rate
  -noise NOISE          level of training noise, percent of random mutations
  -L L                  training read length
  -freeze               freeze the embeddings
  -pretrained_vectors PRETRAINED_VECTORS
                        pretrained vectors .vec files
  -verbose VERBOSE      output verbosity, 0 1 or 2

The python scripts require numpy and scikit-learn for evaluating predictions.

Data

The data used in the paper is available here: http://projects.cbio.mines-paristech.fr/largescalemetagenomics/.

The small and large datasets used in the paper can be found here

References

Continuous Embedding of DNA reads, and application to metagenomics

[1] R. Menegaux, J. Vert, Continuous Embedding of DNA reads, and application to metagenomics

@article{menegaux2018continuous,
  title={Continuous Embedding of DNA reads and application to metagenomics},
  author={Menegaux, Romain and Vert, Jean-Philippe},
  journal={bioRxiv preprint 335943},
  year={2018}
}

Enriching Word Vectors with Subword Information

[2] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Bag of Tricks for Efficient Text Classification

[3] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

FastText.zip: Compressing text classification models

[4] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

License

fastText is BSD-licensed. An additional patent grant is also provided

fastdna's People

Contributors

rmenegaux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fastdna's Issues

[std::invalid_argument] Cannot be opened for training

Hey,

I wanted to use fastDNA but I have an issue which block me.

I prepared merged_training.fasta file which contains many FASTA samples from NCBI. I also created labels.txt file with lables (one by one in new line).

Input:

./fastdna supervised -input merged_training.fasta -labels labels.txt -output model

Output:

libc++abi.dylib: terminating with uncaught exception of type std::invalid_argument: labels.txt cannot be opened for training!
Abort trap: 6

I'm sure that both files are correct with right chmod and permissions.

Quantizing model using python implementation

Hello,

i have trained several models using your code but they are all really large file sizes. To shrink their memory footprint I am using the command:

python fastDNA/fdna.py -quantize -model_name ~/path_to_model -train_fasta ~/path_to_train_fasta

but it returns the error:

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
Traceback (most recent call last):
File "fastDNA/fdna.py", line 163, in
ft.quantize(model_path, args.train_fasta)
File "fastDNA/python/fastDNA/fastDNA.py", line 67, in quantize
return timed_check_call(command, name=readout, verbose=self.verbose)
File "fastDNA/python/fastDNA/fastDNA.py", line 13, in timed_check_call
subprocess.check_call(command, shell=True)
File "/home/secuser/anaconda3/envs/bio_glove/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'fastDNA/fastdna quantize -input /home/secuser/Rob/bio-predictor-development/fasta_files/25_10_21/train_bf10_C100_L500.fasta -output /home/secuser/Rob/bio-predictor-development/fast_dna/output/models/fdna_k14_d50_e500_lr0.1_n4.0 -qnorm' returned non-zero exit status 134.

I have tried this with various models and their respective training files; all returns the same error. Model sizes are ~20GB.

print-word-vectors on kallisto branch

Hi,

i am trying your software on the kallisto branch (which seems very promising) and have a couple of questions:

  • at the moment, the need to build and load in memory a kallisto index for training can (very) quickly become unusable for larger DBs due to RAM limitations. Do you have any plans on improving / changing that part ?

  • Whereas most of of the fastdna methods takes a loadIndex parameter (on the kallisto branch), the print-word-vectors does not. I just want to make sure that the embeddings outputted by this method are the contig embeddings presented in the related paper.

Thanks

Segmentation fault with hs and ns loss functions

Hello,

I noticed that whenever I tried to use the hs or ns loss functions I end up with a segmentation fault right after the data is read in and before training begins (the training progress bar never appears). I am able to produce this error on the test data provided in fastDNA/test/train. I tried compiling on a few different systems and with different versions of the gcc compiler but keep reproducing the error. I don't get this same error with the hs or ns loss functions when I just run fasttext on its own, albeit with different data. Any idea as to what is causing this issue?

Thanks,
Chris

Segmentation fault with supervised

I am trying to build a model with your proposed data set(small one) with the following command
../fastDNA/fastdna supervised -input train_small-db.fasta -labels train_small-db.species-level.taxid -output model

Read sequence n1564, NC_010515
Number of sequences: 1564
Number of labels: 193
Number of words: 0
Segmentation fault

print-sentence-vectors commented out

In the README.md, it says that one can get the embedding vectors using fastdna print-sentence-vectors; however, I note that this is commented out in the source code. I'd like to generate the embedding vectors, and don't really need the classification.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.