********************************************************************* * * * PDB_SELECT * * REPRESENTATIVE LIST OF PROTEIN DATA BANK CHAIN IDENTIFIERS * * * * UWE HOBOHM & CHRIS SANDER * * * ********************************************************************* -----> AUTO-FTP script (sep 2000) - see below !! <----- -----> NEW ALGORITHM from release November 1999 - see below !! <----- DO YOU WANT TO SUBSCRIBE TO PDB_SELECT ? ********************************************************************* If you want to get an alert when a new list has been made, please send an email with subject or message: subscribe pdb_select or unsubscribe pdb_select to uwe.hobohm@tg.fh-giessen.de AIM OF THE PDB_SELECT LISTS ********************************************************************** The representative lists of protein chains are intended for anyone interested in working with currently known protein structures. They are intended to save time and effort by offering a representative selection that is currently about a factor of fifteen smaller than the entire database. Typical uses are introductory browsing, analysis of protein architecture, development of prediction methods, and model building by modular construction. To use the lists, you need access to data sets from the Protein Data Bank and to software that reads protein structure files (see below). REDISTRIBUTION AND COPYRIGHT ****************************************************************************** This file may be redistributed freely to anyone provided it is not altered in content or abridged. The lists may not be sold as such or as part of a service without a license agreement with the authors. Copyright by Chris Sander and Uwe Hobohm, August 1992 and later. In work making use of the list, please cite one of the following: U.Hobohm, M.Scharf, R.Schneider, C.Sander: "Selection of a representative set of structures from the Brookhaven Protein Data Bank", Protein Science 1 (1992),409-417. U.Hobohm & C.Sander: "Enlarged representative set of protein structures", Protein Science 3 (1994) 522 WHERE TO OBTAIN THE LATEST LIST ****************************************************************************** Retrieve the most recent list from the EMBL fileserver at URL http://homepages.fh-giessen.de/~hg12640/pdbselect Questions and/or remarks to: uwe.hobohm@tg.fh-giessen.de NEW ALIGNMENT ALGORITHM from release November 1999 ****************************************************************************** Instead of using Smith & Waterman alignment the much faster Huang-Miller algo is used from November 1999 on (Xiaoqui Huang & Webb Miller "A Time-Efficient, Linear-Space Local Similarity Algorithm", Advances in Applied Mathematics, 12: 337-357, 1991). This has to be considered when tiny differences in percent-identity and alignment- length between this and other alignment algorithms are found. Parameters used were 12,4,3 for gap open penalty, gap elongation penalty and Abagyan-sigma, resp. DIFFERENT AMINO ACID EXCHANGE MATRIX used from release November 1999 ****************************************************************************** Instead of the Dayhoff-matrix as in previous releases we use the more sensitive Blosum65 matrix from November 1999 on. Since more relationships are detected by using Blosum65 compared to Dayhoff, the resulting pdb_select list is shorter (cleaner). NEW SELECTION ALGORITHM from release June 1998 ****************************************************************************** This selection is not any more generated using the greedy algorithm as described in U.Hobohm, M.Scharf, R.Schneider, C.Sander, Protein Science 3 (1992),409-417, since the growth of the PDB became prohibitive regarding resources of computer memory and speed. Instead, to produce the PDB_SELECT-list, the following procedure was applied: 1. Excluded from the selection are models and chains with - a length of less than 30 residues. - number of non standard amino acid residues (including chain breaks) more than 5% chain length - a resolution above 3.0 Angstrom. - an R-factor above 30%. and some chains of known inferior quality. 2. De-compose all PDB entries into a list of PDB chains. 3. In a pre-scan, from two chain sequences with identical amino acid sequence, remove the one with lower quality. 4. Sort chains by quality, with lowest quality chains on top. Quality is defined as "resolution plus R-factor/20". The following chains obtain an artificial quality of 99: - NMR chains - number of residues without side chain coordinates is < 90% chain length - number of residues without backbone coordinates is < 90% chain length - the content of (ALA plus GLY) is more than 40% chain length 5. Do an all-against-all Huang-Miller sequence alignment (X.Huang & W.Miller Adv.Appl.Math.12(91)337). If the distance between two chains exceeds zero, remove the chain with lower quality from the PDB_SELECT-list. Distance D is defined by a function derived by Abagyan+Batalov (JMB 273(1997)355-368), namely: T = (pow(A, -0.124))*31 + S*18.2*(pow(A,-0.305)) A is length of alignment, S (number of standard deviations) is set to 2, T is the Hssp-threshold, pow(A,B) means A to the power of B D = (P - T) - (L - 25) P is percent sequence identity along the alignment, L is the PDB_SELECT-limit, for instance for the 25% list L = 25, for the 95% list L = 95 If distance D is above zero the two sequences are considered as related, and one is removed. Please be aware that by applying the Abagyan-function the "25%-PDB_SELECT-list" - which was a list with no chain pair of more than 25% sequence identity - in fact becomes a "about-(22% - 45%)-PDB_SELECT-list" (see Fig.1 in Abagyan-paper), with higher similarities allowed for short alignments. This does not apply to the 90% list. Chains which cannot be aligned over more than 19 residues are considered to be unrelated, irrespective of their identity. 6. Try to put removed chains back in (applied to 25%-list). In the course of the all-against-all comparison chain A may have been removed because of similarity with chain B, but B has been removed later because of similarity with chain C. If A has no similarity with any other chain in the list, A is put back in. ALIGNMENTS SHORTER THAN 20 RESIDUES PDB_Select is aimed at selecting unique structures. To avoid rejecting too many chains which share short sequence patches, two chains aligning over less than 20 residues are considered as unrelated, irrespective of their percent identity. If you feel this is too generous, let me know. HOMOLOGS A list of homologous chains is in file 'homologs.25', with percent identity, alignment length and Abagyan-distance given in column 3,4 and 5, resp. This is essentially a log during list processing, including all relationships encountered during processing. Unique chains (i.e. those without any relationship) thus do not appear in the homologs-file. DISCREPANCIES BETWEEN LIST AND PDB-ENTRY Due to ongoing changes at PDB it may happen on rare occasions that a chain in the list has been removed or renamed by PDB after processing of PDB_Select. These discrepancies are unavoidable and can only be reduced by improved pre-release quality control at PDB and increased release frequency of PDB_Select BUGS ******************************************************************************* - Since R-factor and resolution are given in unformatted form in PDB-files, the automatic PDB file parsing program may in some few cases find an incorrect value for R-factor and/or resolution. - Models are indicated in the field EXPDTA in recent PDB files. However, this is not true for older PDB files. Thus, we cannot guarantee that our automatic PDB text parsing program excludes all models. PLEASE BE SO KIND AS TO REPORT ANY FUNNIES TO US - THIS WILL HELP TO IMPROVE THE NEXT RELEASE OF THE LIST COLUMNS ******************************************************************************* thrsh : threshold (percent identity cutoff) ID : PDB-identifier (last letter: chain identifier) naa : number of amino acid residues (standard plus non-standard) Res : resolution Rfac : R-factor Methd : Method (X : X-ray structure, N : NMR structure) n_sid : number of residues with side chain coordinates n_bck : number of residues with backbone coordinates n_naa : number of non standard amino acid residues n_hlx : number of residues in helical conformation (DSSP-assignment) n_bta : number of residues in beta conformation (DSSP-assignment) compnd : compound AUTO-FTP ******************************************************************************* A selection of PDB-files may be retrieved using an autoftp script. 1. prepare a list of files "pdblist", one filename each line, for example: ## begin of file (dont copy this line; dont use chain identifiers) pdb9xim.ent.Z pdb9msi.ent.Z pdb9icy.ent.Z ## end of file (dont copy this line) 2. prepare a loop-script "doit", for example (if someone knows how to put this loop inside the ftp-part please let me know): ## begin of file (dont copy this line) #!/usr/bin/sh for i in `cat pdblist` do echo $i autoftp $i done ## end of file (dont copy this line) 3. prepare an autoftp-script "autoftp", for example: ## begin of file (dont copy this line) ftp -n ftp.rcsb.org <<*eof quot user ftp quot pass guest bin cd pub/pdb/data/structures/all/pdb get $1 *eof ## end of file (dont copy this line) 4. make "doit" and "autoftp" executable: chmod +x doit chmod +x autoftp 5. retrieve files: doit 6. decompress files: gunzip *.Z