Giter VIP home page Giter VIP logo

evolocity's People

Contributors

brianhie avatar dependabot[bot] avatar dongspy avatar samsledje avatar seyonechithrananda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

evolocity's Issues

Counter is not defined in preprocessing/featurize_seqs.py

Error message below:

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in seqs_to_anndata(seqs)
    132             if key not in obs:
    133                 obs[key] = []
--> 134             obs[key].append(Counter([
    135                 meta[key] for meta in seqs[seq]
    136             ]).most_common(1)[0][0])

NameError: name 'Counter' is not defined

Fasta file seems to require metadata?

Passing in a simple fasta file with no metadata in the header, just >some_name seems to fail parsing. Many Fasta files don't have this data. Changing the header to >name=some_seq|attr=some_attr seems to be needed. I think at the very least this could be in the documentation?

fasta_fname = 'all_sequences.fasta'
adata = evo.pp.featurize_fasta(fasta_fname)

Gives this error:

IndexError                                Traceback (most recent call last)
/var/folders/bc/2q6dqwns31s514lsw0zmyyf00000gp/T/ipykernel_59585/710921688.py in <module>
      1 fasta_fname = 'all_sequences.fasta'
----> 2 adata = evo.pp.featurize_fasta(fasta_fname)

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in featurize_fasta(fname, model_name, mkey, embed_batch_size, use_cache, cache_namespace)
    256         for record in SeqIO.parse(f, 'fasta'):
    257             fields = record.id.split('|')
--> 258             meta = {
    259                 field.split('=')[0]: field.split('=')[1]
    260                 for field in fields

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in <dictcomp>(.0)
    257             fields = record.id.split('|')
    258             meta = {
--> 259                 field.split('=')[0]: field.split('=')[1]
    260                 for field in fields
    261             }

IndexError: list index out of range

Preprocessing utils require os to be imported

Installed from Pip.

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in populate_embedding(model, seqs, namespace, use_cache, batch_size, verbose)
     69 
     70     if use_cache:
---> 71         mkdir_p('target/{}/embedding'.format(namespace))
     72         embed_prefix = ('target/{}/embedding/{}_512'
     73                         .format(namespace, model.name_,))

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/utils.py in mkdir_p(path)
     66 def mkdir_p(path):
     67     try:
---> 68         os.makedirs(path)
     69     except OSError as exc:  # Python >2.5
     70         if exc.errno == errno.EEXIST and os.path.isdir(path):

NameError: name 'os' is not defined

How to save results?

I am trying to use methods in Scanpy to save results. However, it seems that it has not been implemented in evolocity yet.
My code is

adata.isbacked
adata.filename = 'cytochrome_final.h5ad'
adata.write_csvs('cytochrome_final_csvs', )

It returns:

Traceback (most recent call last):
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/utils.py", line 209, in func_wrapper
    return func(elem, key, val, *args, **kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 149, in write_not_implemented
    f"Failed to write value for {key}, "
NotImplementedError: Failed to write value for uns/model, since a writer for type <class 'evolocity.tools.fb_model.FBModel'> has not been implemented yet.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 28, in <module>
    adata.filename = 'cytochrome_final.h5ad'
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_core/anndata.py", line 1085, in filename
    self.write(filename, force_dense=True)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_core/anndata.py", line 1918, in write_h5ad
    as_dense=as_dense,
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 118, in write_h5ad
    write_attribute(f, "uns", adata.uns, dataset_kwargs=dataset_kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 130, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 294, in write_mapping
    write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs=dataset_kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 130, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/utils.py", line 216, in func_wrapper
    ) from e
NotImplementedError: Failed to write value for uns/model, since a writer for type <class 'evolocity.tools.fb_model.FBModel'> has not been implemented yet.

Above error raised while writing key 'uns/model' of <class 'h5py._hl.files.File'> from /.

KeyError: 'J'

Hi @brianhie! Thanks for the great documentation and tutorials surrounding your evolocity work.

I'm getting a keyerror when trying to generate the velocity graph from the influenza A nucleoprotein tutorial. I believe this is because the token J is an ambiguous amino acid that ESM-1B isn't trained for. What I am confused about is why this is occurring when downloading the models is still in progress. Could this be because I am using a newer version of the ESM package (fair-esm=2.0.0)?

KeyError Traceback (most recent call last)
Cell In[8], line 1
----> 1 evo.tl.velocity_graph(adata)

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/velocity_graph.py:503, in velocity_graph(adata, model_name, mkey, score, seqs, vkey, n_recurse_neighbors, random_neighbors_at_max, mode_neighbors, include_set, copy, verbose)
501 if verbose:
502 logg.msg('Computing likelihoods...')
--> 503 vgraph.compute_likelihoods(vocabulary, model)
504 if verbose:
505 print('')

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/velocity_graph.py:322, in VelocityGraph.compute_likelihoods(self, vocabulary, model)
319 return
321 for seq in iterator:
--> 322 y_pred = predict_sequence_prob(
323 seq, vocabulary, model, verbose=self.verbose
324 )
326 if self.score == 'lm' or self.score == 'edgerand':
327 self.seq_probs[seq] = np.array([
328 y_pred[i + 1, (
329 vocabulary[seq[i]]
(...)
332 )] for i in range(len(seq))
333 ])

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/velocity_graph.py:89, in predict_sequence_prob(seq_of_interest, vocabulary, model, verbose)
87 if 'esm' in model.name_:
88 from .fb_semantics import predict_sequence_prob_fb
---> 89 return predict_sequence_prob_fb(
90 seq_of_interest, model.alphabet_, model.model_,
91 model.repr_layers_, verbose=verbose,
92 )
93 elif model.name_ == 'tape':
94 from .tape_semantics import predict_sequence_prob_tape

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/fb_semantics.py:27, in predict_sequence_prob_fb(seq, alphabet, model, repr_layers, batch_size, verbose)
21 data_loader = torch.utils.data.DataLoader(
22 dataset, collate_fn=alphabet.get_batch_converter(),
23 batch_sampler=batches
24 )
26 with torch.no_grad():
---> 27 for batch_idx, (labels, strs, toks) in enumerate(data_loader):
28 if torch.cuda.is_available():
29 toks = toks.to(device="cuda", non_blocking=True)

File ~/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py:628, in _BaseDataLoaderIter.next(self)
625 if self._sampler_iter is None:
626 # TODO(pytorch/pytorch#76750)
627 self._reset() # type: ignore[call-arg]
--> 628 data = self._next_data()
629 self._num_yielded += 1
630 if self._dataset_kind == _DatasetKind.Iterable and
631 self._IterableDataset_len_called is not None and
632 self._num_yielded > self._IterableDataset_len_called:

File ~/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py:671, in _SingleProcessDataLoaderIter._next_data(self)
669 def _next_data(self):
670 index = self._next_index() # may raise StopIteration
--> 671 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
672 if self._pin_memory:
673 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File ~/anaconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:61, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
59 else:
60 data = self.dataset[possibly_batched_index]
---> 61 return self.collate_fn(data)

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:266, in BatchConverter.call(self, raw_batch)
264 batch_size = len(raw_batch)
265 batch_labels, seq_str_list = zip(*raw_batch)
--> 266 seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list]
267 if self.truncation_seq_length:
268 seq_encoded_list = [seq_str[:self.truncation_seq_length] for seq_str in seq_encoded_list]

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:266, in (.0)
264 batch_size = len(raw_batch)
265 batch_labels, seq_str_list = zip(*raw_batch)
--> 266 seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list]
267 if self.truncation_seq_length:
268 seq_encoded_list = [seq_str[:self.truncation_seq_length] for seq_str in seq_encoded_list]

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:250, in Alphabet.encode(self, text)
249 def encode(self, text):
--> 250 return [self.tok_to_idx[tok] for tok in self.tokenize(text)]

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:250, in (.0)
249 def encode(self, text):
--> 250 return [self.tok_to_idx[tok] for tok in self.tokenize(text)]

KeyError: 'J'

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3000,) + inhomogeneous part.

File ~/tools/miniconda3/lib/python3.9/site-packages/evolocity/preprocessing/featurize_seqs.py:59, in embed_seqs(model, seqs, namespace, verbose)
     54     seqs_fb = sorted([ seq for seq in seqs ])
     55     embedded = embed_seqs_fb(
     56         model.model_, seqs_fb, model.repr_layers_, model.alphabet_,
     57         use_cache=False, verbose=verbose,
     58     )
---> 59     X_embed = np.array([
     60         embedded[seq][0]['embedding'] for seq in seqs_fb
     61     ])
     62 else:
     63     raise ValueError('Model {} not supported for sequence embedding'
     64                      .format(model.name_))

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3000,) + inhomogeneous part.

When runing the below cold, the error occurs. The reason for the error is that the lengths of the input sequences are not the same, preventing NumPy from merging them.

import evolocity
adata = evolocity.pp.featurize_fasta(fasta_fname, use_cache=False)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.