Output details
8 - Chemistry
University of York
BUCCANEER
Since 2008 I undertook the following research enabling new functionality in the Buccaneer software:
1. Auto-built protein models typically contain fragments separated by poorly resolved flexible loop regions. There may be hundreds of fragments belonging to tens of chains of different types, so allocation of fragments to chains is a massive combinatorial problem with potentially billions of solutions. I developed a tree search algorithm to evaluate possible assemblies of the fragments and then to score them for compactness and consistency, along with a pruning method to eliminate unproductive branches and make the problem computationally tractable. I tested the resulting method against 24 known structures with starting data of varying quality to establish that the method is robust and significantly reduces the manual rebuilding required.
2. Automated model building is frequently a rate-limiting step in structure solution. I therefore developed code to allow the calculation to be spread across multiple processor cores. Automatic parallelisation procedures can lead to differing and unreproducible results, so I implemented the threading algorithms by hand with careful ordering of the computational steps to ensure exact reproducibility. In addition I developed provisional cacheing strategies to eliminate recalculation of intermediate results. The resulting methods were benchmarked against competing software and gave an order of magnitude improvement, so that refinement is now the rate-limiting step.
3. Side chains are hard to see at low resolution, so additional sources of sequence information as a 'prior' probability are helpful. I developed a statistical scoring scheme enabling known selenium atom positions in selenomethionine phasing to increase the probability of placing a methionine near that position. A similar approach allows a partial or molecular replacement model to be used as a restraint on probable residue types. Both methods were tested against a library of 55 known structures with starting data of varying quality.