brianhie / evolocity Goto Github PK

View Code? Open in Web Editor NEW

81.0 81.0 18.0 179.38 MB

Evolutionary velocity with protein language models

Home Page: https://evolocity.readthedocs.io

License: MIT License

Python 98.45% Shell 1.55%

evolocity's People

Contributors

Stargazers

Watchers

Forkers

jeffreyruffolo superxiang samsledje duolinwang gianhiltbrunner mattmorts-sci s-kyungyong beichengao raphkuhn seyonechithrananda jinyuansun dongspy hyzhou1990 zhengzha2000 zzgw aakarshv1

evolocity's Issues

Issues with long proteins (>1024 residues) with ESM_1b?

Hello, wondering if evolocity works with long proteins (>1024 residues) when embedding with ESM_1b - since the ESM repo reports issues with these proteins:
facebookresearch/esm#49

Although I see that in the preprint that evolocity was used with e.g. Spike which is above 1024 residues.. So, not an issue?

Counter is not defined in preprocessing/featurize_seqs.py

Error message below:

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in seqs_to_anndata(seqs)
    132             if key not in obs:
    133                 obs[key] = []
--> 134             obs[key].append(Counter([
    135                 meta[key] for meta in seqs[seq]
    136             ]).most_common(1)[0][0])

NameError: name 'Counter' is not defined

Fasta file seems to require metadata?

Passing in a simple fasta file with no metadata in the header, just >some_name seems to fail parsing. Many Fasta files don't have this data. Changing the header to >name=some_seq|attr=some_attr seems to be needed. I think at the very least this could be in the documentation?

fasta_fname = 'all_sequences.fasta'
adata = evo.pp.featurize_fasta(fasta_fname)

Gives this error:

IndexError                                Traceback (most recent call last)
/var/folders/bc/2q6dqwns31s514lsw0zmyyf00000gp/T/ipykernel_59585/710921688.py in <module>
      1 fasta_fname = 'all_sequences.fasta'
----> 2 adata = evo.pp.featurize_fasta(fasta_fname)

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in featurize_fasta(fname, model_name, mkey, embed_batch_size, use_cache, cache_namespace)
    256         for record in SeqIO.parse(f, 'fasta'):
    257             fields = record.id.split('|')
--> 258             meta = {
    259                 field.split('=')[0]: field.split('=')[1]
    260                 for field in fields

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in <dictcomp>(.0)
    257             fields = record.id.split('|')
    258             meta = {
--> 259                 field.split('=')[0]: field.split('=')[1]
    260                 for field in fields
    261             }

IndexError: list index out of range

Apply evolocity to other language models?

ESM-1b is currently the most widely used protein language model. I want to use evolocity to test other language models. Do you have any suggestions？

Preprocessing utils require os to be imported

Installed from Pip.

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/featurize_seqs.py in populate_embedding(model, seqs, namespace, use_cache, batch_size, verbose)
     69 
     70     if use_cache:
---> 71         mkdir_p('target/{}/embedding'.format(namespace))
     72         embed_prefix = ('target/{}/embedding/{}_512'
     73                         .format(namespace, model.name_,))

~/Library/Python/3.8/lib/python/site-packages/evolocity/preprocessing/utils.py in mkdir_p(path)
     66 def mkdir_p(path):
     67     try:
---> 68         os.makedirs(path)
     69     except OSError as exc:  # Python >2.5
     70         if exc.errno == errno.EEXIST and os.path.isdir(path):

NameError: name 'os' is not defined

How to save results?

I am trying to use methods in Scanpy to save results. However, it seems that it has not been implemented in evolocity yet.
My code is

adata.isbacked
adata.filename = 'cytochrome_final.h5ad'
adata.write_csvs('cytochrome_final_csvs', )

It returns:

Traceback (most recent call last):
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/utils.py", line 209, in func_wrapper
    return func(elem, key, val, *args, **kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 149, in write_not_implemented
    f"Failed to write value for {key}, "
NotImplementedError: Failed to write value for uns/model, since a writer for type <class 'evolocity.tools.fb_model.FBModel'> has not been implemented yet.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 28, in <module>
    adata.filename = 'cytochrome_final.h5ad'
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_core/anndata.py", line 1085, in filename
    self.write(filename, force_dense=True)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_core/anndata.py", line 1918, in write_h5ad
    as_dense=as_dense,
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 118, in write_h5ad
    write_attribute(f, "uns", adata.uns, dataset_kwargs=dataset_kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 130, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 294, in write_mapping
    write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs=dataset_kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/functools.py", line 840, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/h5ad.py", line 130, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/gpfs/share/home/2101111835/anaconda3/envs/evolocity/lib/python3.7/site-packages/anndata/_io/utils.py", line 216, in func_wrapper
    ) from e
NotImplementedError: Failed to write value for uns/model, since a writer for type <class 'evolocity.tools.fb_model.FBModel'> has not been implemented yet.

Above error raised while writing key 'uns/model' of <class 'h5py._hl.files.File'> from /.

Error with colab notebook for nucleoprotein

There was a strange KeyError "J" showing up when I ran the notebook, please see the fig attached. Thanks!

KeyError: 'J'

Hi @brianhie! Thanks for the great documentation and tutorials surrounding your evolocity work.

I'm getting a keyerror when trying to generate the velocity graph from the influenza A nucleoprotein tutorial. I believe this is because the token J is an ambiguous amino acid that ESM-1B isn't trained for. What I am confused about is why this is occurring when downloading the models is still in progress. Could this be because I am using a newer version of the ESM package (fair-esm=2.0.0)?

KeyError Traceback (most recent call last)
Cell In[8], line 1
----> 1 evo.tl.velocity_graph(adata)

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/velocity_graph.py:503, in velocity_graph(adata, model_name, mkey, score, seqs, vkey, n_recurse_neighbors, random_neighbors_at_max, mode_neighbors, include_set, copy, verbose)
501 if verbose:
502 logg.msg('Computing likelihoods...')
--> 503 vgraph.compute_likelihoods(vocabulary, model)
504 if verbose:
505 print('')

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/velocity_graph.py:322, in VelocityGraph.compute_likelihoods(self, vocabulary, model)
319 return
321 for seq in iterator:
--> 322 y_pred = predict_sequence_prob(
323 seq, vocabulary, model, verbose=self.verbose
324 )
326 if self.score == 'lm' or self.score == 'edgerand':
327 self.seq_probs[seq] = np.array([
328 y_pred[i + 1, (
329 vocabulary[seq[i]]
(...)
332 )] for i in range(len(seq))
333 ])

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/velocity_graph.py:89, in predict_sequence_prob(seq_of_interest, vocabulary, model, verbose)
87 if 'esm' in model.name_:
88 from .fb_semantics import predict_sequence_prob_fb
---> 89 return predict_sequence_prob_fb(
90 seq_of_interest, model.alphabet_, model.model_,
91 model.repr_layers_, verbose=verbose,
92 )
93 elif model.name_ == 'tape':
94 from .tape_semantics import predict_sequence_prob_tape

File ~/anaconda3/lib/python3.10/site-packages/evolocity/tools/fb_semantics.py:27, in predict_sequence_prob_fb(seq, alphabet, model, repr_layers, batch_size, verbose)
21 data_loader = torch.utils.data.DataLoader(
22 dataset, collate_fn=alphabet.get_batch_converter(),
23 batch_sampler=batches
24 )
26 with torch.no_grad():
---> 27 for batch_idx, (labels, strs, toks) in enumerate(data_loader):
28 if torch.cuda.is_available():
29 toks = toks.to(device="cuda", non_blocking=True)

File ~/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py:628, in _BaseDataLoaderIter.next(self)
625 if self._sampler_iter is None:
626 # TODO(pytorch/pytorch#76750)
627 self._reset() # type: ignore[call-arg]
--> 628 data = self._next_data()
629 self._num_yielded += 1
630 if self._dataset_kind == _DatasetKind.Iterable and
631 self._IterableDataset_len_called is not None and
632 self._num_yielded > self._IterableDataset_len_called:

File ~/anaconda3/lib/python3.10/site-packages/torch/utils/data/dataloader.py:671, in _SingleProcessDataLoaderIter._next_data(self)
669 def _next_data(self):
670 index = self._next_index() # may raise StopIteration
--> 671 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
672 if self._pin_memory:
673 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File ~/anaconda3/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:61, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
59 else:
60 data = self.dataset[possibly_batched_index]
---> 61 return self.collate_fn(data)

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:266, in BatchConverter.call(self, raw_batch)
264 batch_size = len(raw_batch)
265 batch_labels, seq_str_list = zip(*raw_batch)
--> 266 seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list]
267 if self.truncation_seq_length:
268 seq_encoded_list = [seq_str[:self.truncation_seq_length] for seq_str in seq_encoded_list]

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:266, in (.0)
264 batch_size = len(raw_batch)
265 batch_labels, seq_str_list = zip(*raw_batch)
--> 266 seq_encoded_list = [self.alphabet.encode(seq_str) for seq_str in seq_str_list]
267 if self.truncation_seq_length:
268 seq_encoded_list = [seq_str[:self.truncation_seq_length] for seq_str in seq_encoded_list]

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:250, in Alphabet.encode(self, text)
249 def encode(self, text):
--> 250 return [self.tok_to_idx[tok] for tok in self.tokenize(text)]

File ~/anaconda3/lib/python3.10/site-packages/esm/data.py:250, in (.0)
249 def encode(self, text):
--> 250 return [self.tok_to_idx[tok] for tok in self.tokenize(text)]

KeyError: 'J'

errno is required when directories already exist

import errno

is required in
evolocity/preprocessing/utils.py

In order for exception to be properly caught.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3000,) + inhomogeneous part.

File ~/tools/miniconda3/lib/python3.9/site-packages/evolocity/preprocessing/featurize_seqs.py:59, in embed_seqs(model, seqs, namespace, verbose)
     54     seqs_fb = sorted([ seq for seq in seqs ])
     55     embedded = embed_seqs_fb(
     56         model.model_, seqs_fb, model.repr_layers_, model.alphabet_,
     57         use_cache=False, verbose=verbose,
     58     )
---> 59     X_embed = np.array([
     60         embedded[seq][0]['embedding'] for seq in seqs_fb
     61     ])
     62 else:
     63     raise ValueError('Model {} not supported for sequence embedding'
     64                      .format(model.name_))

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3000,) + inhomogeneous part.

When runing the below cold, the error occurs. The reason for the error is that the lengths of the input sequences are not the same, preventing NumPy from merging them.

import evolocity
adata = evolocity.pp.featurize_fasta(fasta_fname, use_cache=False)

Random walk groupby and scipy sparse indexing

Random walk only works with groupby, fix to allow for a random walk on the complete graph:

evolocity/evolocity/tools/random_walk.py

Line 90 in 07d8ea1

if not node_subset[root_node]:
Scipy sparse indexing is broken, perhaps due to updates in underlying package but still investigating:

evolocity/evolocity/tools/random_walk.py

Line 106 in 07d8ea1

prob = scipy.special.softmax(T[paths[w, t], :].toarray().ravel())

brianhie / evolocity Goto Github PK

evolocity's People

Contributors

Stargazers

Watchers

Forkers

evolocity's Issues

Issues with long proteins (>1024 residues) with ESM_1b?

Counter is not defined in preprocessing/featurize_seqs.py

Fasta file seems to require metadata?

Apply evolocity to other language models?

Preprocessing utils require os to be imported

How to save results?

Error with colab notebook for nucleoprotein

KeyError: 'J'

errno is required when directories already exist

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3000,) + inhomogeneous part.

Random walk groupby and scipy sparse indexing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent