Empirically Determined Protein Structures
Source*
Sequence Identity <=
Taxon
Resolution
Entries

July, 2003

Entries

February, 2004

Entries

Sept, 2005

OCA No limit All All (Note 1)
22,000
25,000
33,000
  Represen. All "
5,500
5,600
5,600
  No Limit Human (Note 3) "
4,420
4,520
6,020
  Represen. Human "
1,100
1,100
1,100
           
           
RCSB 50%
(Note 2)
   
6,357
 6,915 9,744
All "human"
(note 4)
4,454 6,814
50% "human" 1,159 1,828
  70%    
7,154
 
  90%    
8,005
 
           

 
 
 

Redundancy Among Protein Structures (according to RCSB)

Search
Entries
Entries non-redundant
Redundant
All (Note 2)
24,168
6,940 (<50%)
71%
   
7,779 (<70%)
68%
   
8,637 (<90%)
64%
       
"structural genomics"
483
393 (<70%)
19%
       
       
Deposition before 1980
65
43 (<70%)
34%
Deposition 1980-1989
434
208 (<70%)
52%
1990-1994
2,659
952 (<70%)
64%
Jan 1 - June 10, 1995
494
282 (<70%)
43%
1995-1999
8,950
3,414 (<70%)
62%
2000-2002
9,187
4,013 (<70%)
56%
2003
2,642
1,435 (<70%)
46%
Deposition after Sept 15, 2003 (on Feb 7, 2004)
479
331 (<70%)
31%
Dep. after Aug 10, 2003 (on 2/7/04) and

"structure not structural genomics"

480
318 (<70%)
34%
1980-2004
24,085
7,751 (<70%)
68%

 

Note 1: At OCA, limiting resolution (e.g. to 3.5 Å) excludes 3,600 NMR results (2/04). These include 860 representative structures, and 240 representative human structures. The exclusion of NMR accounts for the big drop between unlimited resolution (25,000) and 0.1-3.5 Å (20,000); only about 200 entries have resolutions >3.5 Å. Therefore, more accurate results (including both NMR and X-ray) are obtained when a resolution limit is not enforced.

Note 2: At RCSB in July, 2003, one could put "e" in the text query slot to find all entries. This was no longer working in February, 2004. However, one could still find all entries with "deposited after 1970". Better to use "released after 1970" to avoid thousands of on-hold entries!

(In Feb 2004, due to a bug in OCA, it is important to set the upper date to a week before "today", and request only "released" entries in order to avoid including unreleased entries.)

OCA gives similar results for "human", "homo", or "man". While it was problematic to get a count for human entries at RCSB in early 2003, by 2/04 "human" and "homo" gave similar results in the new (?) "Source" slot. Details follow.

Note 3: OCA (2/04) reports 5,791 for "human", and 5,100 for "homo" or "homo sapiens" or "man" ("Organism" slot). When "man", "homo", or "homo sapiens" is limited to eukaryota, 4,520. "Human" limited to eukaryota gives 4,601. Counts in the table above are for "homo" limited to eukaryota.

Note 4: At RCSB (2/04):

Source contains "human" gets 4,911 entries.

Source contains "homo" gets 4,486.

Source contains "homo sapiens" gets 4,460.

Source matches exactly "homo sapiens" gets 4,450.

Source matches exactly "homo" gets 0.


Non-Redundant Chains (Dunbrack, Hobohm)

 
Source* Sequence Identity <= Chains Resol- ution <= R Factor <= Chain length Non X-Ray Alpha Carbon Only Entries/Chains?
(if Entries, within?)
Dunbrack 25% 1718 2.0A 0.3 20-10K Excluded Excluded
(7/03) 25% 2656 3.5A 0.3 20-10K Excluded Excluded
 " 25% 2666 3.5A 0.5 20-10K Excluded Excluded
9/05 " 3,300 " " 40-10K " " Entries/no
 (7/03) 50% 4272 3.5A 0.5 20-10K Excluded Excluded
 " 50% 5042 3.5A 0.5 20-10K Included Excluded
 " 50% 5089 3.5A 0.5 20-10K Included Included
 " 75% 6158 3.5A 0.5 20-10K Included Included
 " 100% 38760 3.0A 0.5 20-10K Included Included
               
Hobohm 25% 1999 3.0A 0.3 >30 Excluded? Excluded?
(4/03) 90% 6254 " " " " "
               

* Sources:

Dunbrack: http://dunbrack.fccc.edu/PISCES.php

Hobohm: http://homepages.fh-giessen.de/~hg12640/pdbselect/

OCA: http://bioportal.weizmann.ac.il/oca-bin/ocamain

RCSB: http://www.pdb.org