FigFams: Yet Another Set of Protein Families
What Are the FigFams?
Each individual FigFam is a set of protein sequences, along with a decision procedure. All of the protein sequences that make up a single FigFam are believed to implement the same functional role, and all of the sequences are easily recognizably similar over at least 70% of the length of the protein sequences. We think of this as roughly equivalent to "proteins that have the same function and are globally homologous". The decision procedure is a short piece of code (actually a Perl routine that internally uses
BLAST? to estimate regions of similarity) that takes as input a protein sequence and returns a decision as to whether or not the new sequence should be added to the FigFam.
It is now common for major annotation groups to provide protein families, along with associated decision procedures. A number of groups have chosen to use Hidden Markov Model technology to implement the decision procedures they distribute. For a number of reasons, we have not chosen this approach (we describe our efforts to evaluate differing types of decision procedures in a manuscript that will be submitted shortly for publication).
The initial release of the FigFams includes 1,140,115 protein sequences that are grouped into 117,425 families. Approximately 55,000 families contain only two members. On the other hand, about 830,000 protein sequences are members of the approximately 17,000 families that contain 10 or more proteins. We considered removing the families with few members, but decided to leave them in the release. We find them useful, and anyone else using the collection can easily discard them.
Let us briefly explain a few aspects of the project:
- The FigFams are an attempt to form sets of proteins that all perform the same cellular function. The most reliable FigFams are based on manually-curated subsystems (see The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes for the original reference, or the SubsystemTutorial? for a basic description of what it means to annotate using subsystems). In these cases a manual analysis (hopefully by an expert in the specific cellular subsystem) has produced a set of assertions that give a set of proteins that all implement the same functional role.
- There are two other cases in which two distinct proteins are placed in the same FigFamDescription:
- when we can align two very similar genomes, and with confidence establish a correspondence between the genes in some region, then we will place the corresponding genes in the same FigFam, and
- when we note a case in which proximity on the chromosome has been preserved over a number of genomes, we place the corresponding genes (believed to be playing the same functional role, although we may well not have any idea what role) into the same FigFam.
- We are using the FigFams to support our RAST server (Rapid Annotation using Subsystem Technology). In this context, it is important that we avoid false-positives. We would like to correctly identify the functions of the proteins encoded by as many genes as possible, but we set the thresholds to minimize the false-positives (leaving ourselves in the situation in which some genes that could be annotated using FigFams are not).
- We are making these protein families available with no restrictions.
How Good Are They?
This is, of course, the real question. Why would anyone use these as opposed to several of the other carefully curated sets of protein families? Well, first, you should check out several of the other efforts, and we believe that you will find that each has some areas of strength and some areas of weakness. The connection to the subsystems is certainly the major strength of the FigFams, and we believe that it wil continue to become more significant as the collections of subsystems expands.
In any event, let us describe our attempts to offer a quantitative analysis of the decision procedures we have installed:
- For each FigFam, we have taken the protein sequences in the FigFam, along with a large collection of proteins that are not in the FigFam but are similar to those that are in it. We broke this expanded set of sequences into 3 categories: those sequences in the FigFam, those that are not in and have not been assigned the same function as those that are in, and those that are not in but have been assigned the same function. We ignored this last set, but took all of the other sequences, ran them through the decision procedures and tabulated the results.
- We call a sequence from the FigFam that was classified as being in the FigFam a true positive; one that was in the FigFam but was classified as not in we call a false negative; one that was not in (and apparently did not have a matching function) and was classified as in we call a false positive; finally, one that was not in the FigFam and was correctly classified as not in we call a true negative.
- If we then compute the sensitivity of the decision procedure as the number of true positives divided by the sum of the the true positives plus the false negatives we get a value of 0.874. This means that we basically failed to classify almost 13% of the sequences that were actually in the FigFam as belonging to the FigFam.
- On the other hand, the specificity, which is defined as the number of true positives divided by the sum of the true positives and the false positives, was 0.975. This means that we very seldom (about 2.5% of the time) classified a sequence as belonging to a FigFam when it should not have been.
We will certainly try to improve these values by attaching more complex decision procedures to the more problematic FigFams. In fact, we will be attaching error estimates to each of the FigFams in the near future. These will be far from perfect, but we feel that they will give reasonable estimates.
A Simple Illustration of Intended Use
If you have installed a copy of the FigFams on your machine, you will find a simple utility called
assign_using_ff. To see how it is intended to be used, just get the proteins from a genome you are familiar with (say,
E.coli) and run
assign_using_ff ../FigfamsData.Release.1 < fasta.file > assignments 2> ones.that.could.not.be.assigned
This will create a file called assignments, and that file will be a two-column tab-separated table containing id-function pairs. The program is a simple Perl script, and by studying it you will see the essential functionality needed to embed the use of FigFams in a more comprehensive annotation effort.
It is worth noting that this little utility includes functionality we have not described above -- it does more than just submit the sequence to a decision procedure for a FigFam. It must take each sequence and estimate which FigFams it might belong to, and then it submits the sequence to individual decision procedures until it gets a "hit" or it runs out of FigFams to try.
When we tried this little utility on a set of proteins from the %SV("E. coli genome" id="83333.1"}%, it made assignments to approximately 90% of the sequences, and we feel that they were fairly good. Remember that an assignment may simply assert that a sequence is something like
FIG004507 (not subsystem-based): hypothetical protein
if it can place the sequence in a FigFam (in this case FIG004507), but the FigFam function asserts that we cannot yet characterize the proteins in that family.
How Can One Get a Copy?
We provide the release in the form of two files: a README file and a tar file. The tar file is about 8.5 gigabytes. You need to download these, install the families on a Unix system that has Perl with Berkeley DB support, and then check them out. You can download these files from
here.
Please feel free to send us suggestions and (within reason) complaints. We sincerely hope that these families will prove useful to you. If you employ them in your research, we would appreciate an acknowledgment using the citation string below.