douweschulte / pdbtbx Goto Github PK
View Code? Open in Web Editor NEWA library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
Home Page: https://crates.io/crates/pdbtbx
License: MIT License
A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.
Home Page: https://crates.io/crates/pdbtbx
License: MIT License
It would be great if the library supported ways to get close atoms to specified points. Proposed is to do this by making an r*tree of all atoms upon request from the user fn create_rtree(self: &PDB)
. For this Atom should implement Point. Caution should be taken to avoid mutability problems as the rtree will be disentangled from the PDB hierarchy.
https://docs.rs/rstar/0.8.2/rstar/struct.RTree.html
Update the image and the planning based on the following link.
https://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html
There is a method that does exactly what the implementation in remove_child_by
does, so it could be that this is faster, in which case it should be used. If it is equal use this method as well, because it is cleaner.
CONNECT provides connectivity information between atoms. This needs to be supported in the parser, IR, and save systems. I guess that having a vec
of serial numbers (Vec<usize>
) is a right way of saving the information. But thought needs to go into changing the serial number of an Atom and how that should be updated in the whole PDB struct.
Internally charge
is represented using [char; 2]
which can be changed to isize
. The format if pretty easy but somewhat of a pain to parse and save. In the PDB file the charge is saved like this 1+
, 2-
,
(no charge). So to parse this the chars have to be reverted. And to save this needs to happen again.
Also the atom struct needs adequate getters/setters of which the last variant needs to do bounds checking on the new charge [-9..10]
.
Hi again!
Your fix for the TER statements works fine but I realized that another issue cropped up: I use the occupancy and b-value columns to assign atom to regions in QM/MM calculations. As such, most of the values are necessarily zero. I realize that this is contrary to the intended use of the b-value column but this behavior breaks my use case.
I'm not sure what a good solution might be here. Maybe b-values of 0.00 could be allowed (with warnings shown perhaps) with StrictnessLevel::Loose for open_pdb?
I'm currently in the process of refactoring my code to use this crate and it would be a pity if this excluded the possibility.
If a number is too big (takes 5 chars in a place where 4 is the max) it now just overflows and causes trouble with the file format. This should be handled. I suppose generating a warning while at the same time using modulo to set it back to a valid number. Also PDB.renumber()
should never introduce numbers that are too big.
This will allow semantics that looks more idiomatic for some people, such as:
for chain in &mut model { // instead of model.chains_mut()
for residue in &mut chain { // instead of chain.residues_mut()
....
}
}
and
chain[256] = some_residue
[The answers to this stackoverflow question] seem to help to achieve this.
Make sure all ENDMDL records are properly placed when saving a pdb.
Having the option to join/extend two PDBs is quite handy. So a &mut PDB.join(PDB) in the fashion of the other join functions would be cool, otherwise adding an &mut PDB.extend(IntoIterator) would be cool, as long as it is ergonomic to use from the position of two PDB variables in the top scope, and with the minimum amount of cloning structs.
If I open and save pTLS-6484.pdb
(an ensemble crystallographic pdb file) it crashes Chimera, but the original file does not. So clearly there is something different in the output. Also it would be cool to know what has to be enforced in terms of Atom list equality between the models, right now it enforces that every model has the exact same Atoms (except different position etc) which is not true for the mentioned PDB file.
Having unit tests, especially in combination with continuous integration is needed if this project will become more used.
Some tests that can be written:
Supporting serde could be a great addition to this library, maybe in the way of a optional thing (like in https://docs.rs/rstar/0.8.2/rstar/struct.RTree.html).
To keep an eye on performance and see when thing unnecessarily digress it would be cool to have continuous benchmarks upon pushing to the server. There is a GitHub action made for this: https://github.com/marketplace/actions/continuous-benchmark. But keeping in mind that lots of behaviour is not yet implemented the run times will get higher over time.
Proposal for benchmarks
open
pdb.apply_transformation,
rotation x 90°pdb.remove_atom_by
, every odd numbered atomsave
pdb.atoms
, calculate average B factorvalidate
pdb.clone
All benchmarks should be run on 1ubq.pdb
and pTLS-6484.pdb
to give an idea of the impact of the PDB size.
Hi,
I have a test PDB file. Excerpt shown here:
ATOM 26 N TYR 3 71.828 47.686 89.455 0.00 0.00 N
ATOM 27 H TYR 3 71.781 46.672 89.403 0.00 0.00 H
ATOM 28 CA TYR 3 72.947 48.274 90.203 0.00 0.00 C
ATOM 29 HA TYR 3 73.190 49.241 89.761 0.00 0.00 H
ATOM 30 CB TYR 3 72.523 48.510 91.664 0.00 0.00 C
ATOM 31 HB2 TYR 3 73.354 48.961 92.204 0.00 0.00 H
ATOM 32 HB3 TYR 3 71.713 49.240 91.671 0.00 0.00 H
ATOM 33 CG TYR 3 72.079 47.277 92.434 0.00 0.00 C
When I call pdb.atom(26)
I find that, in fact, The H atom of TYR is returned. Is this intentional? I must say I find this counterintuitive. If unintentional this is probably due to 0- vs. 1-based indexing in Rust and PDB files respectively.
It would be nice if the validation detected atoms being way too close to each other.
Hi,
I saw that you updated the docs with the version upgrade. If I'm reading the rtree
docs correctly (and from my own experience), the locate_within_distance
takes a squared distance as input so in your example the atoms within 12.5^0.5 would be located.
You may want to check up on that and adjust tests accordingly.
MMTF is a more compact and easier to parse (apperently) file format for molecular structures. See https://mmtf.rcsb.org/ for more details. Also there already is a (WIP) crate to parse these structures: https://crates.io/crates/mmtf. It could be interesting to implement, but it does not have high priority (I have never seen these files 'in the wild').
I realized that the AtomWithHierarchy
struct comes with the limitation of containing only immutable references to everything. So I came across the issue that if I want to replace iterations over, e.g., all residues and then all atoms within those residues with just the iteration over AtomWithHierarchy
s, I cannot mutate those (which is something i frequently need to do).
Something like:
pub struct AtomWithHierarchyMut<'a> {
/// The Chain containing this Atom
pub chain: &'a mut Chain,
/// The Residue containing this Atom
pub residue: &'a mut Residue,
/// The Conformer containing this Atom
pub conformer: &'a mut Conformer,
/// This Atom
pub atom: &'a mut Atom,
}
And a method (or two) to instantiate this?
Would it be feasible to create a method like atoms_with_hierarchy_mut
to remedy this? I haven't tried implementing this because I was afraid I had overlooked some obvious problem with borrowing this would raise and I wanted to discuss it first.
I added a CITATON.cff file to help other people cite the software and recognise all contributors. If there is anything you would like to add/change or if you do not want to be listed please let me know.
@TianyiShi2001 @DocKDE
current authors section:
authors:
- affiliation: Utrecht University
family-names: Schulte
given-names: Douwe
orcid: https://orcid.org/0000-0003-0594-0993
- family-names: Shi
given-names: Tianyi
- name: DocKDE
I think I have reported this earlier and misinterpreted this somewhat.
I noticed that when validate_pdb()
is called on a sample PDB of mine and the resulting vec printed out, a lot of warnings are shown that say:
LooseWarning: Atom b factor out of bounds
Atom 24628 has a b factor which is out of bounds, max is 999.99 min is 0.00.
However, none of my b factors are actually below 0.00 or above 999.99.
After some thought I think that the pointers from children to its parent are suboptimal. And completely removing them would solve some issues I foresee in the future (eg .remove()
on an element while iterating...). So I guess I should vote in favour of removing them altogether, but at the same time I like the idea of having acces to some information, like the backbone()
function. But that could be solved by saving a boolean in the Atom. And I guess that function (and some like that I would like to provide) are much less important then the safety and stability of the library.
So removing all pointers and the functions around it would be my proposal, but @TianyiShi2001 I would like some input if you have some? Do you think those pointers serve a purpose or are they just forced by me as I wanted to copy CCTBX?
Apparently this is a thing...
https://proteopedia.org/wiki/index.php/Unusual_sequence_numbering
Example: 5LCC
As documentation is very important for reuse of code and especially important for libraries it would be great if people using this library can help make the docs readable and understandable. As the main writer of the library it is hard to write documentation that is sufficiently clear for starters with this library.
Feel free to open a PR (Pull Request) with changes to the docs or point grammar errors etc out here.
To help people using the library in depth documentation for each function is needed along with structs etc. Also documentation to tricky parts of the code internals would be nice.
It would be best if those can be automatically compiled by cargo doc
so use doc comments (https://doc.rust-lang.org/rust-by-example/meta/doc.html).
I suggest using thiserror for error handling. There's a great article explaining why it's more ergonomic and efficient than manual implementation (which convinced me to use thiserror in my crates): https://nick.groenen.me/posts/rust-error-handling/
rust-bio have also adopted thiserror recently.
This probably never comes up when using "regular" PDB files but when solvating a protein (usually with water) it frequently happens that the number of residues exceeds 9999. At first I thought there was a bug in my code but it seems the pdbtbx open_pdb function is to blame. This is the output I get after opening such a PDB file and printing it back to stdout:
ATOM 1 N HIM 1 20.872 37.523 43.475 0.00 0.00 N
ATOM 2 H1 HIM 1 21.537 38.222 43.773 0.00 0.00 H
ATOM 3 H2 HIM 1 21.347 36.635 43.400 0.00 0.00 H
ATOM 4 CA HIM 1 20.442 37.855 42.095 0.00 0.00 C
ATOM 5 HA HIM 1 20.697 37.031 41.428 0.00 0.00 H
ATOM 6 CB HIM 1 18.932 38.092 42.024 0.00 0.00 C
ATOM 7 HB2 HIM 1 18.594 38.695 42.867 0.00 0.00 H
ATOM 8 HB3 HIM 1 18.678 38.597 41.092 0.00 0.00 H
ATOM 9 CG HIM 1 18.135 36.826 42.056 0.00 0.00 C
ATOM 10 ND1 HIM 1 18.100 35.999 43.156 0.00 0.00 N
ATOM 11 CE1 HIM 1 17.336 34.954 42.900 0.00 0.00 C
ATOM 12 HE1 HIM 1 17.192 34.166 43.639 0.00 0.00 H
ATOM 13 NEM HIM 1 16.857 35.084 41.678 0.00 0.00 N
ATOM 14 CD2 HIM 1 17.352 36.239 41.124 0.00 0.00 C
ATOM 15 HD2 HIM 1 17.072 36.495 40.102 0.00 0.00 H
ATOM 16 C HIM 1 21.229 39.064 41.640 0.00 0.00 C
ATOM 17 O HIM 1 21.045 40.168 42.148 0.00 0.00 O
ATOM 18 CME HIM 1 15.954 34.127 41.011 0.00 0.00 C
ATOM 19 HM1 HIM 1 15.474 34.610 40.161 0.00 0.00 H
ATOM 20 HM2 HIM 1 15.192 33.791 41.715 0.00 0.00 H
ATOM 21 HM3 HIM 1 16.527 33.267 40.662 0.00 0.00 H
ATOM 32543 O HIM 1 11.318 29.317 3.009 0.00 0.00 O
ATOM 32544 H1 HIM 1 11.670 28.720 2.350 0.00 0.00 H
ATOM 32545 H2 HIM 1 10.434 28.992 3.180 0.00 0.00 H
ATOM 22 N THR 2 22.144 38.833 40.709 0.00 0.00 N
ATOM 23 H THR 2 22.246 37.909 40.315 0.00 0.00 H
ATOM 24 CA THR 2 23.072 39.863 40.292 0.00 0.00 C
ATOM 25 HA THR 2 22.590 40.827 40.129 0.00 0.00 H
ATOM 26 CB THR 2 24.177 40.038 41.352 0.00 0.00 C
ATOM 27 HB THR 2 23.704 40.224 42.316 0.00 0.00 H
I think what happened is that all residues with serial number > 9999 (all waters in my case) got wrapped around to serial number 1 again and were then appended to the already existing residue with that serial number (retaining the atom serial number). Would it be possible to fix this?
I know that in cases like this there should be an insertion code or something to distinguish residues but unfortunately I have no say in the creating of these files.
If an atom is shared between conformers (it's alt_loc is blank) it is now copied to all conformers as is. But it's occupancy should reflect the fact that it is now present multiple times. So the occupancy should be set to one over the amount of conformers.
Sometimes the symmetry in the CRYST line is provided as a Hall symbol instead of Hermann Mauguin. It would be great if the program can detect this and still handle it.
Looking up the elementary number && radius would be handy functionality. Obviously it should have all elements (up to 118) and return Option<>s.
I really like fancy (helpful) error messages so I think it would be really helpful to add those to the parser (and rest of the project). In general a sort of inner library (sub-module) needs to be constructed to help with presenting the information in a way that is as clear as possible to users as to programmers of the library itself.
I think the Rust approach is quite good (see below) where first a header is with the concise error (along with the level error/warning), followed by the line (in a parser, or something else for the structs) where the error is visually underlined/shown, followed by some explanation and/or help to help resolve the issue as fast as possible for the users.
error[E0616]: field `unit_cell` of struct `pdb::PDB` is private
--> src\parser\parser.rs:124:25
|
124 | pdb.unit_cell = Some(UnitCell::new(a, b, c, alpha, beta, gamma));
| ^^^^^^^^^ private field
error: aborting due to previous error
I wrote something like this in an earlier project (https://git.science.uu.nl/d.schulte/research-project-amino-acid-alignment/-/blob/master/source/ParseBatchfiles/ErrorMessage.cs) so this can be blatantly translated into Rust (licenses are all fine).
Currently space groups are represented as a char followed by a vec of usize. Which proves to be wrong for many types of space groups, so the best would be to save it as a string, while at the same time saving it as the index in the IToC Vol A table (reference_tables::get_index_for_symbol
).
As there is a pdb and mmCIF open and save options, it will now be needed to document what is supported in each of those functions. Also this can be used to show what will have to be supported for version 1.0.
(thanks to Oleg for raising the issue!)
I would propose to have a minimum of 1 decimal and a maximum of 5 decimals.
It would be really nice if a 'human friendly' find method could be built in. This should allow for the selection of all matching hierarchies based on the given information. Every level in the hierarchy should have the option to present no information, the index, or any of the info of that level. An additional variant which gives mutable hierarchies is also useful.
//PDB
pub fn find(&'a self, model: ModelFind, chain: ChainFind, residue: ResidueFind, conformer: ConformerFind, atom: AtomFind) ->
Vec<AtomConformerResidueChainModel<'a>> {
//find all matching models and dispatch the find code to those models using their comparable find methods
//integrate the model info in the resulting hierarchies
}
//eg ModelFind
pub enum ModelFind {
NoInfo,
Index(usize),
SerialNumber(usize),
}
In the international tables for crystallography vol A are some tables which provide the transformations for every space group to get the full unit cell construct. A way to retrieve those in the code would be cool. But there are some problems: I cannot find the tables (no online access and no tables ready to be found), issues with copyright? (think about it a little more).
Okay, so I have the following situation:
pub fn find_contacts(&self, level: u8) {
let mut vdw_radii = HashMap::new();
vdw_radii.insert("H".to_string(), 1.54);
vdw_radii.insert("C".to_string(), 1.90);
vdw_radii.insert("N".to_string(), 1.79);
vdw_radii.insert("O".to_string(), 1.71);
vdw_radii.insert("P".to_string(), 2.23);
vdw_radii.insert("S".to_string(), 2.14);
vdw_radii.insert("CL".to_string(), 2.06);
vdw_radii.insert("NA".to_string(), 2.25);
vdw_radii.insert("CU".to_string(), 2.17);
let mut table = Table::new();
table.set_format(*format::consts::FORMAT_NO_LINESEP_WITH_TITLE);
table.set_titles(row![
"Atom ID 1",
"Atom Name 1",
"Residue Name 1",
"Atom ID 2",
"Atom Name 2",
"Residue Name 2",
"Distance"
]);
let mut tree = RTree::new();
for residue in &self.residues {
for atom in &residue.atoms {
tree.insert(PointWithData::new(atom, atom.coords))
}
}
for residue in &self.residues {
for atom in &residue.atoms {
let radius: f64 = match level {
0 => 1.0,
1 => {
let rad = vdw_radii
.get(&atom.element.to_uppercase())
.expect("No Radius found for given element.");
rad * rad
}
_ => panic!("Too high level given"),
};
let contacts = tree.locate_within_distance(atom.coords, radius);
for item in contacts {
if item.data.atom_id < atom.atom_id
&& !(item.data.res_id == atom.res_id && item.data.res_name == atom.res_name)
&& !(item.data.atom_name == "C"
&& atom.atom_name == "N"
&& item.data.res_id + 1 == atom.res_id)
{
let distance = item.data.calc_distance(atom);
table.add_row(row![
bByFd =>
item.data.atom_id,
item.data.atom_name,
item.data.res_name,
atom.atom_id,
atom.atom_name,
atom.res_name,
format!("{:.2}", distance)
}
}
}
}
if table.len() > 1 {
if level == 0 {
println!("\nClash Analysis");
} else if level == 1 {
println!("\nContact Analysis");
}
table.printstd();
}
}
This function builds an R-Tree from Atoms and uses that to conduct searches for other Points that are spatially nearby. The idea is to use such information to find clashes of atoms that are too close by (had a problem with such a case recently). In order to filter out results from atoms that belong to the same residue (and thus are close to each other by construction) I use the information about the respective residue that I stored in the Atom structs.
During the process of refactoring my code with pdbtbx I couldn't think of a straightforward way to do this. In most cases I can get what I need by just iterating over residues and the atoms within them but here I only have the Points in the R-Tree as results which are Atom structs. I would have to do a separate iteration over the PDB struct to determine which residues these belonged to and then go from there.
So for cases such as this I would propose to add a residue_serial_number
and residue_name
field to the Atom struct (or maybe re-add since I think you had something like this at some point). If the respective fields of the Residue struct are kept, this should not break existing code or cause trouble with ownership, yes? I realize this would necessitate changing the instantiation methods and unit test and I tried implementing this myself but failed making the macro errors that cropped up go away...
What are your thoughts on this?
When creating a Vec<char>
from a string, s.chars().collect()
is the standard practice. The .chars()
method works on any UTF-8 encoded string and it takes time to find the boundary for each character. However, we know that PDB files contain ASCII characters only, therefore it will be faster to transmute a string directly to bytes using as_bytes()
and use vectors/arrays/slices of bytes instead of char
s throughout the crate. This practice is used, for example in rust-bio and seq-io
Based on the strictnesslevel different records should be omitted. I propose to only give the following with Strict:
The following should only be generated if a DatabaseReference is present:
The following should only be generated in medium and up:
As &mut Struct is automatically dereferenced to *mut Struct the set_{parent}pointer functions are unnecessary and should be removed. The set{parent} functions should accept *mut Struct instead.
As some operations over iterators van benefit greatly from multithreading git would be great if the library supported Rayon iterators. Problems could arise in combination with #57 though so that should be looked into.
As outlined in merge request #10, a fast way to remove children of a struct (PDB/Model/Chain/Residue) is really nice to have. The merge request implemented this for Chains. It would be nice to extend this to the other structs.
Okay, so I have what might be construed as a weird edge case. I am only working with the atoms section of PDB files, the header and all other info is of no interest to me. Also I usually run some kind of preprocessing on them to add missing hydrogens, remove alternative conformations and so on. I just tried to read such a file and save it again, however it was mangled in the process:
Input file:
ATOM 1 N HIE 1 66.397 49.061 85.017 0.00 1.00 N
ATOM 2 H1 HIE 1 66.306 48.101 84.696 0.00 1.00 H
ATOM 3 H2 HIE 1 67.181 49.491 84.536 0.00 1.00 H
ATOM 4 CA HIE 1 66.603 49.087 86.441 2.00 1.00 C
ATOM 5 HA HIE 1 67.052 50.039 86.723 0.00 1.00 H
ATOM 6 CB HIE 1 65.332 48.876 87.271 0.00 1.00 C
ATOM 7 HB2 HIE 1 64.794 47.999 86.927 0.00 1.00 H
ATOM 8 HB3 HIE 1 65.620 48.701 88.303 0.00 1.00 H
ATOM 9 CG HIE 1 64.499 50.108 87.226 0.00 1.00 C
ATOM 10 ND1 HIE 1 64.139 50.737 86.058 0.00 1.00 N
...
Output file:
ATOM 1 N HIE 1 66.397 49.061 85.017 0.00 1.00 N
TER 1 HIE 1
ATOM 2 H1 HIE 1 66.306 48.101 84.696 0.00 1.00 H
TER 2 HIE 1
ATOM 3 H2 HIE 1 67.181 49.491 84.536 0.00 1.00 H
TER 3 HIE 1
ATOM 4 CA HIE 1 66.603 49.087 86.441 2.00 1.00 C
TER 4 HIE 1
ATOM 5 HA HIE 1 67.052 50.039 86.723 0.00 1.00 H
TER 5 HIE 1
ATOM 6 CB HIE 1 65.332 48.876 87.271 0.00 1.00 C
TER 6 HIE 1
So apparently a TER statement was inserted after each atom. I assume this isn't intended behavior and it probably is a result of me using unusual PDB files. I just wanted to bring this to your attention because I'm not sure if this is something you'd want to support or not.
In principle I wouldn't mind helping out with a pull request but I'm only just learning Rust and your PDB parsing function looks a bit daunting to me :D
However, if I can help (with some directions maybe) I will.
Best!
I am not sure the current placement (of separate Hetero Chains) is flexible enough to handle all valid ways to use hetero atoms. I propose to take a look at other PDB packages to see how they handle Hetero atoms and look again at the PDB file format to see if another placement would be better.
Okay, last one for today:
I noticed the atom serial number in the PDB struct wraps around at 9999 even if that's not the case in the input PDB file. Is this standard behavior? Because I seem to recall that 5 columns are reserved for this value so it could contain larger numbers, no?
Is that something that could be changed with a separate method or an argument?
Sorry for pestering you lately.
Have a great weekend!
Having examples which show how a library works and how people can use it for their own projects is a very important part of documentation. I would like to invite anyone to add more examples to the documentation, especially if you have used this library for one of your own projects.
A great place to initially add an example would be to the integration tests /tests/
folder in this way it will automatically be run every time the code is updated so it is guaranteed to keep running. As soon as an example is added we can discuss the best place to put it like the readme or docs or something else.
I think calling it rust-pdb
would be pretty lame, so @TianyiShi2001 do you have any ideas? I think about something stressing the crystallographic or protein part as there apparently is a Microsoft data structure also called pdb getting high results when googling for rust pdb
. With a good name I think we are ready to publish the first version on crates.io with the remark that it still is pretty experimental and so can (and probably will) break between versions.
I recently tried to point the openPDB
function at a random text file and it seems no error is thrown when that happens. I realize this is not an intended use of the function at all but I thought it might be worth a thought to add a warning or breaking error for such cases. Something like "No data could be parsed from file, please make sure it is a valid PDB file" maybe.
Would that maybe be reasonable?
Hello!
I recently started working on a project involving reading and editing PDB files for a very specific purpose. Somewhere along the way I became aware of this crate and am considering switching to using it because it's much more feature-complete than what I have written.
However, a question came to mind: Is there a way to access information across different hierarchies in the Structure? E.g. get a list of atoms from a given residue or determine which residue a given atom belongs to? I have not seen this functionality in the docs but I may have missed it.
Great work by the way!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.