Empirically Determined Protein Structures

Source*

Sequence Identity <=

Taxon

Resolution

Entries

July, 2003

Entries

February, 2004

OCA

No limit

All

All (Note 1)

22,000

25,000

 

Represen.

All

"

5,500

5,600

 

No Limit

Human (Note 3)

"

4,420

4,520

 

Represen.

Human

"

1,100

1,100

           
           

RCSB

50%

   

6,357

 
 

70%

   

7,154

 
 

90%

   

8,005

 
           

 

 

Redundancy Among Protein Structures (according to RCSB)

Search

Entries

Entries non-redundant

Redundant

All (Note 2)

24,168

6,940 (<50%)

71%

   

7,779 (<70%)

68%

   

8,637 (<90%)

64%

       

"structural genomics"

483

393 (<70%)

19%

       
       

Deposition before 1980

65

43 (<70%)

34%

Deposition 1980-1989

434

208 (<70%)

52%

1990-1994

2,659

952 (<70%)

64%

Jan 1 - June 10, 1995

494

282 (<70%)

43%

1995-1999

8,950

3,414 (<70%)

62%

2000-2002

9,187

4,013 (<70%)

56%

2003

2,642

1,435 (<70%)

46%

Deposition after Sept 15, 2003 (on Feb 7, 2004)

479

331 (<70%)

31%

Dep. after Aug 10, 2003 (on 2/7/04) and

"structure not structural genomics"

480

318 (<70%)

34%

1980-2004

24,085

7,751 (<70%)

68%

 

Note 1: At OCA, limiting resolution (e.g. to 3.5 Å) excludes 3,600 NMR results (2/04). These include 860 representative structures, and 240 representative human structures. The exclusion of NMR accounts for the big drop between unlimited resolution (25,000) and 0.1-3.5 Å (20,000); only about 200 entries have resolutions >3.5 Å. Therefore, more accurate results (including both NMR and X-ray) are obtained when a resolution limit is not enforced.

Note 2: At RCSB in July, 2003, one could put "e" in the text query slot to find all entries. This was no longer working in February, 2004. However, one could still find all entries with "deposited after 1970".

(In Feb 2004, due to a bug in OCA, it is important to set the upper date to a week before "today", and request only "released" entries in order to avoid including unreleased entries.)

OCA gives similar results for "human", "homo", or "man". While it was problematic to get a count for human entries at RCSB in early 2003, by 2/04 "human" and "homo" gave similar results in the new (?) "Source" slot. Details follow.

Note 3: OCA (2/04) reports 5,791 for "human", and 5,100 for "homo" or "homo sapiens" or "man" ("Organism" slot). When "man", "homo", or "homo sapiens" is limited to eukaryota, 4,520. "Human" limited to eukaryota gives 4,601. Counts in the table above are for "homo" limited to eukaryota.

At RCSB (2/04):

Source contains "human" gets 4,911 entries.

Source contains "homo" gets 4,486.

Source contains "homo sapiens" gets 4,460.

Source matches exactly "homo sapiens" gets 4,450.

Source matches exactly "homo" gets 0.

 

Non-Redundant Chains (Dunbrack, Hobohm)

Source*

Sequence Identity <=

Chains

Resol- ution <=

R Factor <=

Chain length

Non X-Ray

Alpha Carbon Only

Dunbrack

25%

1718

2.0A

0.3

20-10K

Excluded

Excluded

(7/03)

25%

2656

3.5A

0.3

20-10K

Excluded

Excluded

 

25%

2666

3.5A

0.5

20-10K

Excluded

Excluded

 

50%

4272

3.5A

0.5

20-10K

Excluded

Excluded

 

50%

5042

3.5A

0.5

20-10K

Included

Excluded

 

50%

5089

3.5A

0.5

20-10K

Included

Included

 

75%

6158

3.5A

0.5

20-10K

Included

Included

 

100%

38760

3.0A

0.5

20-10K

Included

Included

               

Hobohm

25%

1999

3.0A

0.3

>30

Excluded?

Excluded?

(4/03)

90%

6254

"

"

"

"

"

               

* Sources:

Dunbrack: http://www.fccc.edu/research/labs/dunbrack/pisces/

Hobohm: http://homepages.fh-giessen.de/~hg12640/pdbselect/

OCA: http://bioportal.weizmann.ac.il/oca-bin/ocamain

RCSB: http://www.pdb.org