FIG Genome Annotation System

Introduction

This is the main object for access to the SEED data store. The data store itself is a combination of flat files and a database. The flat files can be moved easily between systems and the database rebuilt as needed.

A reduced set of this object's functions are available via the SFXlate object. The SFXlate object uses a single database to represent all its genomic information. It provides a much smaller capability for updating the data, and eliminates all similarities except for bidirectional best hits.

The key to making the FIG system work is proper configuration of the FIG_Config.pm file. This file contains names and URLs for the key directories as well as the type and login information for the database.

FIG was designed to operate as a series of peer instances. Each instance is updated independently by its owner, and the instances can be synchronized using a process called a peer-to-peer update. The terms SEED instance and peer are used more-or-less interchangeably.

The POD documentation for this module is still in progress, and is provided on an AS IS basis without warranty. If you have a correction and you're not a developer, EMAIL the details to bruce@gigabarb.com and I'll fold it in.

NOTE: The usage example for each method specifies whether it is static

    FIG::something

or dynamic

    $fig->something

If the method is static and has no parameters (FIG::something()) it can also be invoked dynamically. This is a general artifact of the way PERL implements object-oriented programming.

Hiding/Caching in a FIG object

We save the DB handle, cache taxonomies, and put a few other odds and ends in the FIG object. We expect users to invoke these services using the object $fig constructed using:

    use FIG;
    my $fig = new FIG;

$fig is then used as the basic mechanism for accessing FIG services. It is, of course, just a hash that is used to retain/cache data. The most commonly accessed item is the DB filehandle, which is accessed via $self->db_handle.

We cache genus/species expansions, taxonomies, distances (very crudely estimated) estimated between genomes, and a variety of other things.

Public Methods

new

    my $fig = FIG->new();

This is the constructor for a FIG object. It uses no parameters. If tracing has not yet been turned on, it will be turned on here. The tracing type and level are specified by the configuration variables $FIG_Config::trace_levels and $FIG_Config::trace_type. These defaults can be overridden using the environment variables Trace and TraceType, respectively.

CacheTrick

    my $value = $fig->CacheTrick($self, $field => $evalString);

This is a helper method used to create simple field caching in another object. If the named field is found in $self, then it will be returned directly. Otherwise, the eval string will be executed to compute the value. The value is then cahced in the $self object so it can be retrieved easily when needed. Use this method to make a FIG data-access object more like an object created by PPO or ERDB.

self

Hash or blessed object containing the cached fields.

field

Name of the field desired.

evalString

String that can be evaluated to compute the field value.

RETURN

Returns the value of the desired field.

go_number_to_term

Returns GO term for GO number from go_number_to_term table in database

db_handle

    my $dbh = $fig->db_handle;

Return the handle to the internal DBrtns object. This allows direct access to the database methods.

cached

    my $x = $fig->cached($name);

Return a reference to a hash containing transient data. If no hash exists with the specified name, create an empty one under that name and return it.

The idea behind this method is to allow clients to cache data in the FIG object for later use. (For example, a method might cache feature data so that it can be retrieved later without using the database.) This facility should be used sparingly, since different clients may destroy each other's data if they use the same name.

name

Name assigned to the cached data.

RETURN

Returns a reference to a hash that is permanently associated with the specified name. If no such hash exists, an empty one will be created for the purpose.

get_system_name

    my $name = $fig->get_system_name;

Returns seed, indicating that this is object is using the SEED database. The same method on an SFXlate object will return sprout.

DESTROY

The destructor releases the database handle.

same_seqs

    my $sameFlag = FIG::same_seqs($s1, $s2);

Return TRUE if the specified protein sequences are considered equivalent and FALSE otherwise. The sequences should be presented in nr-analysis form, which is in reverse order and upper case with the stop codon omitted.

The sequences will be considered equivalent if the shorter matches the initial portion of the long one and is no more than 30% smaller. Since the sequences are in nr-analysis form, the equivalent start potions means that the sequences have the same tail. The importance of the tail is that the stop point of a PEG is easier to find than the start point, so a same tail means that the two sequences are equivalent except for the choice of start point.

s1

First protein sequence, reversed and with the stop codon removed.

s2

Second protein sequence, reversed and with the stop codon removed.

RETURN

Returns TRUE if the two protein sequences are equivalent, else FALSE.

is_locked_fid

    $fig->is_locked_fid($fid);

returns 1 iff $fid is locked

lock_fid

    $fig->lock_fid($user,$fid);

Sets a lock on annotations for $fid.

unlock_fid

    $fig->unlock_fid($user,$fid);

Sets a unlock on annotations for $fid.

delete_genomes

    $fig->delete_genomes(\@genomes);

Delete the specified genomes from the data store. This requires making system calls to move and delete files.

add_genome

    my $ok = $fig->add_genome($genomeF, $force, $skipnr);

Add a new genome to the data store. A genome's data is kept in a directory by itself, underneath the main organism directory. This method essentially moves genome data from an external directory to the main directory and performs some indexing tasks to integrate it.

genomeF

Name of the directory containing the genome files. This should be a fully-qualified directory name. The last segment of the directory name should be the genome ID.

force

This will ignore errors thrown by verify_genome_directory. This is bad, and you should never do it, but I am in the situation where I need to move a genome from one machine to another, and although I trust the genome I can't.

skipnr

We don't always want to add the proteins into the nr database. For example wih a metagnome that has been called by blastx. This will just skip appending the proteins into the NR file.

RETURN

Returns TRUE if successful, else FALSE.

parse_genome_args

    my ($mode, @genomes) = FIG::parse_genome_args(@args);

Extract a list of genome IDs from an argument list. If the argument list is empty, return all the genomes in the data store.

This is a function that is performed by many of the FIG command-line utilities. The user has the option of specifying a list of specific genome IDs or specifying none in order to get all of them. If your command requires additional arguments in the command line, you can still use this method if you shift them out of the argument list before calling. The $mode return value will be all if the user asked for all of the genomes or some if he specified a list of IDs. This is useful to know if, for example, we are loading a table. If we're loading everything, we can delete the entire table; if we're only loading some genomes, we must delete them individually.

This method uses the genome directory rather than the database because it may be used before the database is ready.

args1, args2, ... argsN

List of genome IDs. If all genome IDs are to be processed, then this list should be empty.

RETURN

Returns a list. The first element of the list is all if the user is asking for all the genome IDs and some otherwise. The remaining elements of the list are the desired genome IDs.

reload_table

    $fig->reload_table($mode, $table, $flds, $xflds, $fileName, $keyList, $keyName);

Reload a database table from a sequential file. If $mode is all, the table will be dropped and re-created. If $mode is some, the data for the individual items in $keyList will be deleted before the table is loaded. Thus, the load process is optimized for the type of reload.

mode

all if we are reloading the entire table, some if we are only reloading specific entries.

table

Name of the table to reload.

flds

String defining the table columns, in SQL format. In general, this is a comma-delimited set of field specifiers, each specifier consisting of the field name followed by the field type and any optional qualifiers (such as NOT NULL or DEFAULT); however, it can be anything that would appear between the parentheses in a CREATE TABLE statement. The order in which the fields are specified is important, since it is presumed that is the order in which they are appearing in the load file.

xflds

Reference to a hash that describes the indexes. The hash is keyed by index name. The value is the index's field list. This is a comma-delimited list of field names in order from most significant to least significant. If a field is to be indexed in descending order, its name should be followed by the qualifier DESC. For example, the following $xflds value will create two indexes, one for name followed by creation date in reverse chronological order, and one for ID.

    { name_index => "name, createDate DESC", id_index => "id" }
fileName

Fully-qualified name of the file containing the data to load. Each line of the file must correspond to a record, and the fields must be arranged in order and tab-delimited. If the file name is omitted, the table is dropped and re-created but not loaded.

keyList

Reference to a list of the IDs for the objects being reloaded. This parameter is only used if $mode is some.

keyName (optional)

Name of the key field containing the IDs in the keylist. If omitted, genome is assumed.

enqueue_similarities

    FIG::enqueue_similarities(\@fids);

Queue the passed Feature IDs for similarity computation. The actual computation is performed by create_sim_askfor_pool. The queue is a persistent text file in the global data directory, and this method essentially writes new IDs on the end of it.

fids

Reference to a list of feature IDs.

export_similarity_request

Creates a similarity computation request from the queued similarities and the current NR.

We keep track of the exported requests in case one gets lost.

create_sim_askfor_pool

    $fig->create_sim_askfor_pool($chunk_size);

Creates an askfor pool, which a snapshot of the current NR and similarity queue. This process clears the old queue.

The askfor pool needs to keep track of which sequences need to be calculated, which have been handed out, etc. To simplify this task we chunk the sequences into fairly small numbers (20k characters) and allocate work on a per-chunk basis. We make use of the relational database to keep track of chunk status as well as the seek locations into the file of sequence data. The initial creation of the pool involves indexing the sequence data with seek offsets and lengths and populating the sim_askfor_index table with this information and with initial status information.

chunk_size

Number of features to put into a processing chunk. The default is 15.

get_sim_work

    my ($nrPath, $fasta) = $fig->get_sim_work();

Get the next piece of sim computation work to be performed. Returned are the path to the NR and a string containing the fasta data.

sim_work_done

    $fig->sim_work_done($pool_id, $chunk_id, $out_file);

Declare that the work in pool_id/chunk_id has been completed, and output written to the pool directory (get_sim_work gave it the path).

pool_id

The ID number of the pool containing the work that just completed.

chunk_id

The ID number of the chunk completed.

out_file

The file into which the work was placed.

schedule_sim_pool_postprocessing

    $fig->schedule_sim_pool_postprocessing($pool_id);

Schedule a job to do the similarity postprocessing for the specified pool.

pool_id

ID of the pool whose similarity postprocessing needs to be scheduled.

postprocess_computed_sims

    $fig->postprocess_computed_sims($pool_id);

Set up to reduce, reformat, and split the similarities in a given pool. We build a pipe to this pipeline:

    reduce_sims peg.synonyms 300 | reformat_sims nr | split_sims dest prefix

Then we put the new sims in the pool directory, and then copy to NewSims.

pool_id

ID of the pool whose similarities are to be post-processed.

get_active_sim_pools

    @pools = $fig->get_active_sim_pools();

Return a list of the pool IDs for the sim processing queues that have entries awaiting computation.

compute_clusters

    my @clusterList = $fig->compute_clusters(\@pegList, $subsystem, $distance);

Partition a list of PEGs into sections that are clustered close together on the genome. The basic algorithm used builds a graph connecting PEGs to other PEGs close by them on the genome. Each connected subsection of the graph is then separated into a cluster. Singleton clusters are thrown away, and the remaining ones are sorted by length. All PEGs in the incoming list should belong to the same genome, but this is not a requirement. PEGs on different genomes will simply find themselves in different clusters.

pegList

Reference to a list of PEG IDs.

subsystem

Subsystem object for the relevant subsystem. This parameter is not used, but is required for compatability with Sprout.

distance (optional)

The maximum distance between PEGs that makes them considered close. If omitted, the distance is 5000 bases.

RETURN

Returns a list of lists. Each sub-list is a cluster of PEGs.

get_sim_pool_info

    my ($total_entries, $n_finished, $n_assigned, $n_unassigned) = $fig->get_sim_pool_info($pool_id);

Return information about the given sim pool.

pool_id

Pool ID of the similarity processing queue whose information is desired.

RETURN

Returns a four-element list. The first is the number of features in the queue; the second is the number of features that have been processed; the third is the number of features that have been assigned to a processor, and the fourth is the number of features left over.

get_local_hostname

    my $result = FIG::get_local_hostname();

Return the local host name for the current processor. The name may be stored in a configuration file, or we may have to get it from the operating system.

get_hostname_by_adapter

    my $name = FIG::get_hostname_by_adapter();

Return the local host name for the current network environment.

get_seed_id

    my $id = FIG::get_seed_id();

Return the Universally Unique ID for this SEED instance. If one does not exist, it will be created.

get_release_info

    my ($name, $id, $inst, $email, $parent_id, $description) = FIG::get_release_info();

Return the current data release information..

The release info comes from the file FIG/Data/RELEASE. It is formatted as:

 <release-name>
 <unique id>
 <institution>
 <contact email>
 <unique id of data release this release derived from>
 <description>

For instance:

 -----
 SEED Data Release, 09/15/2004.
 4148208C-1DF2-11D9-8417-000A95D52EF6
 ANL/FIG
 olson@mcs.anl.gov
 Test release.
 -----

If no RELEASE file exists, this routine will create one with a new unique ID. This lets a peer optimize the data transfer by being able to cache ID translations from this instance.

Title

    my $title = $fig->Title();

Return the title of this database. For SEED, this will return SEED, for Sprout it will return NMPDR, and so forth.

FIG

    my $realFig = $fig->FIG();

Return this object. This method is provided for compatability with SFXlate.

get_peer_last_update

    my $date = $fig->get_peer_last_update($peer_id);

Return the timestamp from the last successful peer-to-peer update with the given peer. If the specified peer has made updates, comparing this timestamp to the timestamp of the updates can tell you whether or not the updates have been integrated into your SEED data store.

We store this information in FIG/Data/Global/Peers/<peer-id>.

peer_id

Universally Unique ID for the desired peer.

RETURN

Returns the date/time stamp for the last peer-to-peer updated performed with the identified SEED instance.

set_peer_last_update

    $fig->set_peer_last_update($peer_id, $time);

Manually set the update timestamp for a specified peer. This informs the SEED that you have all of the assignments and updates from a particular SEED instance as of a certain date.

clean_spaces

Remove any extra spaces from input fields. This will (currently) remove ^\s, \s$, and concatenate multiple spaces into one.

my $input=$fig->clean_spaces($cgi->param('input'));

cgi_url

    my $url = FIG::$fig->cgi_url();

Return the URL for the CGI script directory.

top_link

    my $url = FIG::top_link();

Return the relative URL for the top of the CGI script directory.

We determine this based on the SCRIPT_NAME environment variable, falling back to FIG_Config::cgi_base if necessary.

temp_url

    my $url = FIG::temp_url();

Return the URL of the temporary file directory.

plug_url

    my $url2 = $fig->plug_url($url);

or

    my $url2 = $fig->plug_url($url);

Change the domain portion of a URL to point to the current domain. This essentially relocates URLs into the current environment.

url

URL to relocate.

RETURN

Returns a new URL with the base portion converted to the current operating host. If the URL does not begin with http://, the URL will be returned unmodified.

file_read

    my $text = $fig->file_read($fileName);

or

    my @lines = $fig->file_read($fileName);

or

    my $text = FIG::file_read($fileName);

or

    my @lines = FIG::file_read($fileName);

Read an entire file into memory. In a scalar context, the file is returned as a single text string with line delimiters included. In a list context, the file is returned as a list of lines, each line terminated by a line delimiter. (For a method that automatically strips the line delimiters, use Tracer::GetFile.)

fileName

Fully-qualified name of the file to read.

RETURN

In a list context, returns a list of the file lines. In a scalar context, returns a string containing all the lines of the file with delimiters included.

file_head

    my $text = $fig->file_head($fileName, $count);

or

    my @lines = $fig->file_head($fileName, $count);

or

    my $text = FIG::file_head($fileName, $count);

or

    my @lines = FIG::file_head($fileName, $count);

Read a portion of a file into memory. In a scalar context, the file portion is returned as a single text string with line delimiters included. In a list context, the file portion is returned as a list of lines, each line terminated by a line delimiter.

fileName

Fully-qualified name of the file to read.

count (optional)

Number of lines to read from the file. If omitted, 1 is assumed. If the non-numeric string * is specified, the entire file will be read.

RETURN

In a list context, returns a list of the desired file lines. In a scalar context, returns a string containing the desired lines of the file with delimiters included.

min

    my $min = FIG::min(@x);

or

    my $min = $fig->min(@x);

Return the minimum numeric value from a list.

x1, x2, ... xN

List of numbers to process.

RETURN

Returns the numeric value of the list entry possessing the lowest value. Returns undef if the list is empty.

max

    my $max = FIG::max(@x);

or

    my $max = $fig->max(@x);

Return the maximum numeric value from a list.

x1, x2, ... xN

List of numbers to process.

RETURN

Returns the numeric value of t/he list entry possessing the highest value. Returns undef if the list is empty.

between

    my $flag = FIG::between($x, $y, $z);

or

    my $flag = $fig->between($x, $y, $z);

Determine whether or not $y is between $x and $z.

x

First edge number.

y

Number to examine.

z

Second edge number.

RETURN

Return TRUE if the number $y is between the numbers $x and $z. The check is inclusive (that is, if $y is equal to $x or $z the function returns TRUE), and the order of $x and $z does not matter. If $x is lower than $z, then the return is TRUE if $x <= $y <= $z. If $z is lower, then the return is TRUE if $x >= I$<$y> >= $z.

get_organism_info_from_ncbi

my $code = FIG::get_organism_info_from_ncbi( $taxonomyID );

For a given taxonomy ID returns a hash containing scientific name , genetic code , synonyms and lineage

standard_genetic_code

    my $code = FIG::standard_genetic_code();

Return a hash containing the standard translation of nucleotide triples to proteins. Methods such as translate can take a translation scheme as a parameter. This method returns the default translation scheme. The scheme is implemented as a reference to a hash that contains nucleotide triplets as keys and has protein letters as values.

translate

    my $aa_seq = &FIG::translate($dna_seq, $code, $fix_start);

Translate a DNA sequence to a protein sequence using the specified genetic code. If $fix_start is TRUE, will translate an initial TTG or GTG code to M. (In the standard genetic code, these two combinations normally translate to V and L, respectively.)

dna_seq

DNA sequence to translate. Note that the DNA sequence can only contain known nucleotides.

code

Reference to a hash specifying the translation code. The hash is keyed by nucleotide triples, and the value for each key is the corresponding protein letter. If this parameter is omitted, the standard_genetic_code will be used.

fix_start

TRUE if the first triple is to get special treatment, else FALSE. If TRUE, then a value of TTG or GTG in the first position will be translated to M instead of the value specified in the translation code.

RETURN

Returns a string resulting from translating each nucleotide triple into a protein letter.

reverse_comp

    my $dnaR = FIG::reverse_comp($dna);

or

    my $dnaR = $fig->reverse_comp($dna);

Return the reverse complement os the specified DNA sequence.

NOTE: for extremely long DNA strings, use rev_comp, which allows you to pass the strings around in the form of pointers.

dna

DNA sequence whose reverse complement is desired.

RETURN

Returns the reverse complement of the incoming DNA sequence.

rev_comp

    my $dnaRP = FIG::rev_comp(\$dna);

or

    my $dnaRP = $fig->rev_comp(\$dna);

Return the reverse complement of the specified DNA sequence. The DNA sequence is passed in as a string reference rather than a raw string for performance reasons. If this is unnecessary, use reverse_comp, which processes strings instead of references to strings.

dna

Reference to the DNA sequence whose reverse complement is desired.

RETURN

Returns a reference to the reverse complement of the incoming DNA sequence.

verify_dir

    FIG::verify_dir($dir);

or

    $fig->verify_dir($dir);

Insure that the specified directory exists. If it must be created, the permissions will be set to 0777.

run

    FIG::run($cmd);

or

    $fig->run($cmd);

Run a command. If the command fails, the error will be traced.

run_gathering_output

    FIG::run_gathering_output($cmd, @args);

or

    $fig->run_gathering_output($cmd, @args);

Run a command, gathering the output. This is similar to the backtick operator, but it does not invoke the shell. Note that the argument list must be explicitly passed one command line argument per argument to run_gathering_output.

If the command fails, the error will be traced.

interpret_error_code

    ($exitcode, $signal, $msg) = &FIG::interpret_error_code($rc);

Determine if the given result code was due to a process exiting abnormally or by receiving a signal.

augment_path

    FIG::augment_path($dirName);

Add a directory to the system path.

This method adds a new directory to the front of the system path. It looks in the configuration file to determine whether this is Windows or Unix, and uses the appropriate separator.

dirName

Name of the directory to add to the path.

read_fasta_record

    my ($seq_id, $seq_pointer, $comment) = FIG::read_fasta_record(\*FILEHANDLE);

or

    my ($seq_id, $seq_pointer, $comment) = $fig->read_fasta_record(\*FILEHANDLE);

Read and parse the next logical record of a FASTA file. A FASTA logical record consists of multiple lines of text. The first line begins with a > symbol and contains the sequence ID followed by an optional comment. (NOTE: comments are currently deprecated, because not all tools handle them properly.) The remaining lines contain the sequence data.

This method uses a trick to smooth its operation: the line terminator character is temporarily changed to \n> so that a single read operation brings in the entire logical record.

FILEHANDLE

Open handle of the FASTA file. If not specified, STDIN is assumed.

RETURN

If we are at the end of the file, returns undef. Otherwise, returns a three-element list. The first element is the sequence ID, the second is a pointer to the sequence data (that is, a string reference as opposed to as string), and the third is the comment.

display_id_and_seq

    FIG::display_id_and_seq($id_and_comment, $seqP, $fh);

Display a fasta ID and sequence to the specified open file. This method is designed to work well with read_fasta_sequence and rev_comp, because it takes as input a string pointer rather than a string. If the file handle is omitted it defaults to STDOUT.

The output is formatted into a FASTA record. The first line of the output is preceded by a > symbol, and the sequence is split into 60-character chunks displayed one per line. Thus, this method can be used to produce FASTA files from data gathered by the rest of the system.

id_and_comment

The sequence ID and (optionally) the comment from the sequence's FASTA record. The ID

seqP

Reference to a string containing the sequence. The sequence is automatically formatted into 60-character chunks displayed one per line.

fh

Open file handle to which the ID and sequence should be output. If omitted, \*STDOUT is assumed.

display_seq

    FIG::display_seq(\$seqP, $fh);

or

    $fig->display_seq(\$seqP, $fh);

Display a fasta sequence to the specified open file. This method is designed to work well with read_fasta_sequence and rev_comp, because it takes as input a string pointer rather than a string. If the file handle is omitted it defaults to STDOUT.

The sequence is split into 60-character chunks displayed one per line for readability.

seqP

Reference to a string containing the sequence.

fh

Open file handle to which the sequence should be output. If omitted, STDOUT is assumed.

flatten_dumper

    FIG::flatten_dumper( $perl_ref_or_object_1, ... );
    $fig->flatten_dumper( $perl_ref_or_object_1, ... );

Takes a list of perl references or objects, and "flattens" their Data::Dumper() output so that it can be printed on a single line.

ec_name

    my $enzymatic_function = $fig->ec_name($ec);

Returns the enzymatic name corresponding to the specified enzyme code.

ec

Code number for the enzyme whose name is desired. The code number is actually a string of digits and periods (e.g. 1.2.50.6).

RETURN

Returns the name of the enzyme specified by the indicated code, or a null string if the code is not found in the database.

all_roles

    my @roles = $fig->all_roles;

Return a list of the known roles. Currently, this is a list of the enzyme codes and names.

The return value is a list of list references. Each element of the big list contains an enzyme code (EC) followed by the enzymatic name.

expand_ec

    my $expanded_ec = $fig->expand_ec($ec);

Expands "1.1.1.1" to "1.1.1.1 - alcohol dehydrogenase" or something like that.

clean_tmp

    FIG::clean_tmp();

Delete temporary files more than two days old.

We store temporary files in $FIG_Config::temp. There are specific classes of files that are created and should be saved for at least a few days. This routine can be invoked to clean out those that are over two days old.

genomes

    my @genome_ids = $fig->genomes($complete, $restrictions, $domain);

Return a list of genome IDs. If called with no parameters, all genome IDs in the database will be returned.

Genomes are assigned ids of the form X.Y where X is the taxonomic id maintained by NCBI for the species (not the specific strain), and Y is a sequence digit assigned to this particular genome (as one of a set with the same genus/species). Genomes also have versions, but that is a separate issue.

complete

TRUE if only complete genomes should be returned, else FALSE.

restrictions

TRUE if only restriction genomes should be returned, else FALSE.

domain

Name of the domain from which the genomes should be returned. Possible values are Bacteria, Virus, Eukaryota, unknown, Archaea, and Environmental Sample. If no domain is specified, all domains will be eligible.

RETURN

Returns a list of all the genome IDs with the specified characteristics.

genome_info

    my $info = $fig->genome_info();

Return an array reference of information from the genome table

RETURN

This will return an array reference of genome table entries. All entries of the table will be returned. The columns will be the following:

genome, gname, szdna, maindomain, pegs, rnas, complete, taxonomy

is_complete

    my $flag = $fig->is_complete($genome);

Return TRUE if the genome with the specified ID is complete, else FALSE.

genome

ID of the relevant genome.

RETURN

Returns TRUE if there is a complete genome in the database with the specified ID, else FALSE.

is_genome

    my $flag = $fig->is_genome($genome);

Return TRUE if the specified genome exists, else FALSE.

genome

ID of the genome to test.

RETURN

Returns TRUE if a genome with the specified ID exists in the data store, else FALSE.

assert_genomes

    $fig->assert_genomes(gid, gid, ...);

Assert that the given list of genomes does exist, and allow is_genome() to succeed for them.

This is used in FIG-based computations in the context of the RAST genome-import code, so that genomes that currently exist only in RAST are treated as present for the purposes of FIG.pm-based code.

genome_counts

    my ($arch, $bact, $euk, $vir, $env, $unk) = $fig->genome_counts($complete);

Count the number of genomes in each domain. If $complete is TRUE, only complete genomes will be included in the counts.

complete

TRUE if only complete genomes are to be counted, FALSE if all genomes are to be counted

RETURN

A six-element list containing the number of genomes in each of six categories-- Archaea, Bacteria, Eukaryota, Viral, Environmental, and Unknown, respectively.

genome_domain

    my $domain = $fig->genome_domain($genome_id);

Find the domain of a genome.

genome_id

ID of the genome whose domain is desired.

RETURN

Returns the name of the genome's domain (archaea, bacteria, etc.), or undef if the genome is not in the database.

genome_pegs

    my $num_pegs = $fig->genome_pegs($genome_id);

Return the number of protein-encoding genes (PEGs) for a specified genome.

genome_id

ID of the genome whose PEG count is desired.

RETURN

Returns the number of PEGs for the specified genome, or undef if the genome is not indexed in the database.

genome_rnas

    my $num_rnas = $fig->genome_rnas($genome_id);

Return the number of RNA-encoding genes for a genome. "$genome_id" is indexed in the "genome" database, and 'undef' otherwise.

genome_id

ID of the genome whose RNA count is desired.

RETURN

Returns the number of RNAs for the specified genome, or undef if the genome is not indexed in the database.

genome_szdna

    my $szdna = $fig->genome_szdna($genome_id);

Return the number of DNA base-pairs in a genome's contigs.

genome_id

ID of the genome whose base-pair count is desired.

RETURN

Returns the number of base pairs in the specified genome's contigs, or undef if the genome is not indexed in the database.

genome_version

    my $version = $fig->genome_version($genome_id);

Return the version number of the specified genome.

Versions are incremented for major updates. They are put in as major updates of the form 1.0, 2.0, ...

Users may do local "editing" of the DNA for a genome, but when they do, they increment the digits to the right of the decimal. Two genomes remain comparable only if the versions match identically. Hence, minor updating should be committed only by the person/group responsible for updating that genome.

We can, of course, identify which genes are identical between any two genomes (by matching the DNA or amino acid sequences). However, the basic intent of the system is to support editing by the main group issuing periodic major updates.

genome_id

ID of the genome whose version is desired.

RETURN

Returns the version number of the specified genome, or undef if the genome is not in the data store or no version number has been assigned.

genome_md5sum

    my $md5sum = $fig->genome_md5sum($genome_id);

Returns the MD5 checksum of the specified genome.

The checksum of a genome is defined as the checksum of its signature file. The signature file consists of tab-separated lines, one for each contig, ordered by the contig id. Each line contains the contig ID, the length of the contig in nucleotides, and the MD5 checksum of the nucleotide data, with uppercase letters forced to lower case.

The checksum is indexed in the database. If you know a genome's checksum, you can use the genome_with_md5sum method to find its ID in the database.

genome

ID of the genome whose checksum is desired.

RETURN

Returns the specified genome's checksum, or undef if the genome is not in the database.

genome_with_md5sum

    my $genome = $fig->genome_with_md5sum($cksum);

Find a genome with the specified checksum.

The MD5 checksum is computed from the content of the genome (see genome_md5sum). This method can be used to determine if a genome already exists for a specified content.

cksum

Checksum to use for searching the genome table.

RETURN

The ID of a genome with the specified checksum, or undef if no such genome exists.

contig_md5sum

    my $cksum = $fig->contig_md5sum($genome, $contig);

Return the MD5 checksum for a contig. The MD5 checksum is computed from the content of the contig. This method retrieves the checksum stored in the database. The checksum can be compared to the checksum of an external contig as a cheap way of seeing if they match.

genome

ID of the genome containing the contig.

contig

ID of the relevant contig.

RETURN

Returns the checksum of the specified contig, or undef if the contig is not in the database.

md5_of_peg

    my $cksum = $fig->md5_of_peg( $peg );

Return the MD5 checksum for a peg. The MD5 checksum is computed from the uppercase sequence of the protein. This method retrieves the checksum stored in the database.

peg

FIG ID of the peg.

RETURN

Returns the checksum of the specified contig as a hex string, or undef if the peg is not in the database.

get_representative_genome

        my $rep_id = get_representative_genome($id)

return the representative genome of the set that $id is in

genome_id

ID of the genome used for set lookup

RETURN

Return the representative genome of the set that $id is in, 0 if not found


=cut

sub get_representative_genome { my($self, $id) = @_; my $repH;

    if (! ($repH = $self->{_repG})) {
        my @tab = map { [split(/\t/,$_)] } `cat $FIG_Config::data/Global/genome.sets`;
        my $x = shift @tab;
        while ($x)
        {
            my $set  = $x->[0];
            my $repG = $x->[1];
            while ($x && ($x->[0] eq $set))
            {
                $repH->{$x->[1]} = $repG;
                $x = shift @tab;
            }
        }
        $self->{_repG} = $repH;
    }
    return $repH->{$id};
}

pegs_with_md5

    my @pegs = $fig->pegs_with_md5( $md5 );

Return all pegs with sequence matching the check sum. Thus,

    my @pegs = $fig->pegs_with_md5( $fig->md5_of_peg( $peg ) );

produces all pegs with sequence identical the query peg.

md5

The md5 checksum as a hex string (32 characters).

RETURN

Returns the list of pegs matching the given md5 checksum.

prots_with_md5

    my @fids = $fig->prots_with_md5( $md5 );

Return all proteins with sequence matching the check sum, including non fig ids.

md5

The md5 checksum as a hex string (32 characters).

RETURN

Returns the list of protein ids matching the given md5 checksum.

genus_species

    my $gs = $fig->genus_species($genome_id);

Return the genus, species, and possibly also the strain of a specified genome.

This method converts a genome ID into a more recognizble species name. The species name is stored directly in the genome table of the database. Essentially, if the strain is present in the database, it will be returned by this method, and if it's not present, it won't.

genome_id

ID of the genome whose name is desired.

RETURN

Returns the scientific species name associated with the specified ID, or undef if the ID is not in the database.

set_genus_species

    my $gs = $fig->set_genus_species($genome_id, $genus_species_strain);

Sets the contents of the GENOME file of the specified genome ID

Does not (currently) update the relational DB.

genome_id

ID of the genome whose name is desired.

genus_species_strain

The new biological name that will correspond to the genome_id.

RETURN

Returns 1 if the write was successful, and undef if write fails.

org_of

    my $org = $fig->org_of($prot_id);

Return the genus/species name of the organism containing a protein. Note that in this context protein is not a certain string of amino acids but a protein encoding region on a specific contig.

For a FIG protein ID (e.g. fig|134537.1.peg.123), the organism and strain information is always available. In the case of external proteins, we can usually determine an organism, but not anything more precise than genus/species (and often not that). When the organism name is not present, a null string is returned.

prot_id

Protein or feature ID.

RETURN

Returns the displayable scientific name (genus, species, and strain) of the organism containing the identified PEG. If the name is not available, returns a null string. If the PEG is not found, returns undef.

orgid_of_orgname

    my $genomeID = $fig->orgid_of_orgname($genomeName);

Return the ID of the genome corresponding to the specified organism name, or a null string if the genome is not found.

genomeName

Name of the organism, consisting of the organism's genus, species, and unique characterization, separated by spaces.

RETURN

Returns the genome ID number for the named organism, or an empty string if the genome is not found.

orgname_of_orgid

    my $genomeName = $fig->orgname_of_orgid($genomeID);

Return the name of the genome corresponding to the specified organism ID.

genomeID

ID of the relevant genome.

RETURN

Returns the name of the organism, consisting of the organism's genus, species, and unique characterization, separated by spaces, or a null string if the genome is not found.

genus_species_domain

    my ($gs, $domain) = $fig->genus_species_domain($genome_id);

Returns a genome's genus and species (and strain if that has been properly recorded) in a printable form, along with its domain. This method is similar to genus_species, except it also returns the domain name (archaea, bacteria, etc.).

genome_id

ID of the genome whose species and domain information is desired.

RETURN

Returns a two-element list. The first element is the species name and the second is the domain name.

domain_color

    my $web_color = FIG::domain_color($domain);

Return the web color string associated with a specified domain. The colors are extremely subtle (86% luminance), so they absolutely require a black background. Archaea are slightly cyan, bacteria are slightly magenta, eukaryota are slightly yellow, viruses are slightly silver, environmental samples are slightly gray, and unknown or invalid domains are pure white.

domain

Name of the domain whose color is desired.

RETURN

Returns a web color string for the specified domain (e.g. #FFDDFF for bacteria).

org_and_color_of

    my ($org, $color) = $fig->org_and_domain_of($prot_id);

Return the best guess organism and domain html color string of an organism. In the case of external proteins, we can usually determine an organism, but not anything more precise than genus/species (and often not that).

prot_id

Relevant protein or feature ID.

RETURN

Returns a two-element list. The first element is the displayable organism name, and the second is an HTML color string based on the domain (see domain_color).

partial_genus_matching

Return a list of genome IDs that match a partial genus.

For example partial_genus_matching("Listeria") will return all genome IDs that begin with Listeria, and this can also be restricted to complete genomes with another argument like this partial_genus_matching("Listeria", 1)

abbrev

    my $abbreviated_name = FIG::abbrev($genome_name);

or

    my $abbreviated_name = $fig->abbrev($genome_name);

Abbreviate a genome name to 10 characters or less.

For alignments and such, it is very useful to be able to produce an abbreviation of genus/species. That's what this does. Note that multiple genus/species might reduce to the same abbreviation, so be careful (disambiguate them, if you must).

The abbreviation is formed from the first three letters of the species name followed by the first three letters of the genus name followed by the first three letters of the species name and then the next four nonblank characters.

genome_name

The name to abbreviate.

RETURN

An abbreviated version of the specified name.

wikipedia_link

    my $wikipedia_link = $fig->wikipedia_link($genome_name);

Check if Wikipedia has a page about this genome. If so, return it's url.

genome_name

The genome to find.

RETURN

The url of the wikipedia page.

organism_directory

    my $organism_directory = $fig->organism_directory($genome_id);

Get the directory that contains the organism data. This is just like the FIGV version.

genome_id

The id of the organism, e.g. 83333.1.

RETURN

A string containing the path to the organism directory.

ncbi_contig_description

<my $name = ncbi_contig_description($contig_id)>

Looks up the NCBI description line for this contig identifier. Values are cached in the directory $FIG_Config::var/ncbi_contigs.

ftype

    my $type = FIG::ftype($fid);

or

    my $type = $fig->ftype($fid);

Returns the type of a feature, given the feature ID. This just amounts to lifting it out of the feature ID, since features have IDs of the form

        fig|x.y.f.n

where x.y is the genome ID f is the type of feature n is an integer that is unique within the genome/type

fid

FIG ID of the feature whose type is desired.

RETURN

Returns the feature type (e.g. peg, rna, pi, or pp), or undef if the feature ID is not a FIG ID.

genome_of

    my $genome_id = $fig->genome_of($fid);

or

    my $genome_id = FIG::genome_of($fid);

Return the genome ID from a feature ID.

fid

ID of the feature whose genome ID is desired.

RETURN

If the feature ID is a FIG ID, returns the genome ID embedded inside it; otherwise, it returns undef.

genome_and_peg_of

    my ($genome_id, $peg_number = FIG::genome_and_peg_of($fid);
    my ($genome_id, $peg_number = $fig->genome_and_peg_of($fid);

Return the genome ID and peg number from a feature ID.

prot_id

ID of the feature whose genome and PEG number as desired.

RETURN

Returns the genome ID and peg number associated with a feature if the feature is represented by a FIG ID, else undef.

by_fig_id

    my @sorted_by_fig_id = sort { FIG::by_fig_id($a,$b) } @fig_ids;

Compare two feature IDs.

This function is designed to assist in sorting features by ID. The sort is by genome ID followed by feature type and then feature number.

a

First feature ID.

b

Second feature ID.

RETURN

Returns a negative number if the first parameter is smaller, zero if both parameters are equal, and a positive number if the first parameter is greater.

by_fig_id

my @sorted_by_location = sort { FIG::by_locus($a,$b) } @locations;

Compare two locations.

This function is designed to assist in sorting features by location. The sort is by contig ID, followed by left boundary, then by right bounday, then by strand.

a

First location.

b

Second location.

RETURN

Returns a negative number if the first location is to the left, of the second, zero if both locations are identical, and a positive number if the first location is to the right of the second.

by_genome_id

    my @sorted_by_genome_id = sort { FIG::by_genome_id($a,$b) } @genome_ids;

Compare two genome IDs.

This function is designed to assist in sorting genomes by ID.

a

First genome ID.

b

Second genome ID.

RETURN

Returns a negative number if the first parameter is smaller, zero if both parameters are equal, and a positive number if the first parameter is greater.

next_feature

    my $feature = $fig->next_feature( \%options );

Locate the next feature (optionally filtered by type) in a contig. The start position for the search can be defined by supplying genome, contig and position, or by supplying a feature id. Feature locations are defined by their midpoint. If a fid is supplied with contig and position, the latter are used to resolve ambiguities in the desired segement of a feature with a complex location.

options

Options:

after => $fid after => \@fids

Id(s) of features that should preceed the returned feature. This is a local operation, and is only meant to resolve features that are otherwise tied in location.

contig => $contig

Name of contig of features.

exclude => $id exclude => \@ids

Id(s) of features to exclude. Note that features listed with the 'after' option are also excluded (and that is most commonly the desired behavior).

fid => $fid

Alternative to supplying a location. It is possible to supply a fid and contig and position, which allows disambiguating the desired segment of a feature with a complex location.

genome => $genome

Name of genome of features.

position => $position

Feature midpoint must be >= $position. Note that this can be any multiple of 1/2. If the supplied value is negative, the position is taken from the right end of the contig.

type => $type type => \@types

Type(s) of desired feature (default is any type).

RETURN

Feature id or undef.

previous_feature

    my $feature = $fig->previous_feature( \%options );

Locate the previous feature (optionally filtered by type) in a contig. The start position for the search can be defined by supplying genome, contig and position, or by supplying a feature id. Feature locations are defined by their midpoint. If a fid is supplied with contig and position, the latter are used to resolve ambiguities in the desired segement of a feature with a complex location.

options

Options:

before => $fid before => \@fids

Id(s) of features that should follow the returned feature. This is a local operation, and is only meant to resolve features that are otherwise tied in location.

contig => $contig

Name of contig of features.

exclude => $id exclude => \@ids

Id(s) of features to exclude. Note that features listed with the 'before' option are also excluded (and that is most commonly the desired behavior).

fid => $fid

Alternative to supplying a location. It is possible to supply a fid and contig and position, which allows disambiguating the desired segment of a feature with a complex location.

genome => $genome

Name of genome of features.

position => $position

Feature midpoint must be >= $position. Note that this can be any multiple of 1/2. If the supplied value is negative, the position is taken from the right end of the contig.

type => $type type => \@types

Type(s) of desired feature (default is any type).

RETURN

Feature id or undef.

genes_in_region

    my ($features_in_region, $beg1, $end1) = $fig->genes_in_region($genome, $contig, $beg, $end, size_limit);

Locate features that overlap a specified region of a contig. This includes features that begin or end outside that region, just so long as some part of the feature can be found in the region of interest.

It is often important to be able to find the genes that occur in a specific region on a chromosome. This routine is designed to provide this information. It returns all genes that overlap positions from $beg through $end in the specified contig.

The $size_limit parameter limits the search process. It is presumed that no features are longer than the specified size limit. A shorter size limit means you'll miss some features; a longer size limit significantly slows the search process. For prokaryotes, a value of 10000 (the default) seems to work best.

genome

ID of the genome containing the relevant contig.

contig

ID of the relevant contig.

beg

Position of the first base pair in the region of interest.

end

Position of the last base pair in the region of interest.

size_limit

Maximum allowable size for a feature. If omitted, 10000 is assumed.

RETURN

Returns a three-element list. The first element is a reference to a list of the feature IDs found. The second element is the position of the leftmost base pair in any feature found. This may be well before the region of interest begins or it could be somewhere inside. The third element is the position of the rightmost base pair in any feature found. Again, this can be somewhere inside the region or it could be well to the right of it.

regions_spanned

    my ( [ $contig, $beg, $end ], ... ) = $fig->regions_spanned( $loc );

or

    my ( [ $contig, $beg, $end ], ... ) = FIG::regions_spanned( $loc );

The location of a feature in a scalar context is

    contig_b1_e1, contig_b2_e2, ...   [one contig_b_e for each segment]

This routine takes as input a scalar location in the above form and reduces it to one or more regions spanned by the gene. This involves combining regions in the location string that are on the same contig and going in the same direction. Unlike boundaries_of, which returns one region in which the entire gene can be found, regions_spanned handles wrapping through the orgin, features split over contigs and exons that are not ordered nicely along the chromosome (ugly but true).

loc

The location string for a feature.

RETURN

Returns a list of list references. Each inner list contains a contig ID, a starting position, and an ending position. The starting position may be numerically greater than the ending position (which indicates a backward-traveling gene). It is guaranteed that the entire feature is covered by the regions in the list.

filter_regions

    my  @regions = FIG::filter_regions( $contig, $min, $max,  @regions );

or

    my \@regions = FIG::filter_regions( $contig, $min, $max,  @regions );

or

    my @regions = FIG::filter_regions( $contig, $min, $max, \@regions );

or

    my \@regions = FIG::filter_regions( $contig, $min, $max, \@regions );

Filter a list of regions to those that overlap a specified section of a particular contig. Region definitions correspond to those produced by regions_spanned. That is, [contig,beg,end]. In the function call, either $contig or $min and $max can be undefined (permitting anything). So, for example,

    my @regions = FIG::filter_regions(undef, 1, 5000, $regionList);

would return all regions in $regionList that overlap the first 5000 base pairs in any contig. Conversely,

    my @regions = FIG::filter_regions('NC_003904', undef, undef, $regionList);

would return all regions on the contig NC_003904.

contig

ID of the contig whose regions are to be passed by the filter, or undef if the contig doesn't matter.

min

Leftmost position of the region used for filtering. Only regions which contain at least one base pair at or beyond this position will be passed. A value of undef is equivalent to zero.

max

Rightmost position of the region used for filtering. Only regions which contain at least one base pair on or before this position will be passed. A value of undef is equivalent to the length of the contig.

regionList

A list of regions, or a reference to a list of regions. Each region is a reference to a three-element list, the first element of which is a contig ID, the second element of which is the start position, and the third element of which is the ending position. (The ending position can be before the starting position if the region is backward-traveling.)

RETURN

In a scalar context, returns a reference to a list of the filtered regions. In a list context, returns the list itself.

close_genes

    my @features = $fig->close_genes($fid, $dist);

Return all features within a certain distance of a specified other feature.

This method is a quick way to get genes that are near another gene. It calls boundaries_of to get the boundaries of the incoming gene, then passes the region computed to genes_in_region.

So, for example, if the specified $dist is 500, the method would select a region that extends 500 base pairs to either side of the boundaries for the gene $fid, and pass it to genes_in_region for analysis. The features returned would be those that overlap the selected region. Note that the flaws inherent in genes_in_region are also inherent in this method: if a feature is more than 10000 base pairs long, it may not be caught even though it has an overlap in the specified region.

fid

ID of the relevant feature.

dist

Desired maximum distance.

RETURN

Returns a list of feature IDs for genes that overlap or are close to the boundaries for the specified incoming feature.

adjacent_genes

    my ($left_fid, $right_fid) = $fig->adjacent_genes($fid, $dist);

Return the IDs of the genes immediately to the left and right of a specified feature.

This method gets a list of the features within the specified distance of the incoming feature (using close_genes), and then chooses the two closest to the feature found. If the incoming feature is on the + strand, these are features to the left and the right. If the incoming feature is on the - strand, the features will be returned in reverse order.

fid

ID of the feature whose neighbors are desired.

dist

Maximum permissible distance to the neighbors.

RETURN

Returns a two-element list containing the IDs of the features on either side of the incoming feature.

compute_genome_similarity

Compute a rough estimate of "similarity" between genomes using the following algorithm:

        1.  You need at least five "genes" from each genome (let's work with incomplete as well as complete).  You get these by
                a. Taking up to 5 of the "universal genes"
                b. supplemented by genes (starting from 1) that are greater than 300 aa
        2.  For each gene from the set consider the set of similarities for it.
                For each match that covers over 200 aa of the gene,
                        if the % identify > 70, count a "too-similar{Genome2}"
                        else count a "not-too-similar{Genome2}"
             For each Genome2, if the "too-similar{Genome2}" count > "not-too-similar{Genome2}" count,
                                the Genome1-Genome2 matches are too similar.
                   else, they are not

Used for filtering candidate PCHs in remove_clustered_pchs2.pl.

univ_hash

Hash where the keys are the annotations for the universal proteins to be used in the similarity computation.

match_len

Minimum length of similarity match required to be considered for genome similarity.

num_genes

Number of genes to consider for the com.putation.

RETURN

List of lists of the form [genome2, is-similar, count of too-similar hits, count of not-too-similar hist]

feature_location

    my $loc = $fig->feature_location($fid);

or

    my @loc = $fig->feature_location($fid);;

Return the location of a feature. The location consists of a list of (contigID, begin, end) triples encoded as strings with an underscore delimiter. So, for example, NC_002755_100_199 indicates a location starting at position 100 and extending through 199 on the contig NC_002755. If the location goes backward, the start location will be higher than the end location (e.g. NC_002755_199_100).

In a scalar context, this method returns the locations as a comma-delimited string

    NC_002755_100_199,NC_002755_210_498

In a list context, the locations are returned as a list

    (NC_002755_100_199, NC_002755_210_498)
fid

ID of the feature whose location is desired.

RETURN

Returns the locations of a feature, either as a comma-delimited string or a list.

contig_of

    my $contigID = $fig->contig_of($location);

Return the ID of the contig containing a location.

This method only works with SEED-style locations (contigID_beg_end). For more comprehensive location parsing, use the Location object.

location

A SEED-style location (contigID_beg_end), or a comma-delimited list of SEED-style locations. In the latter case, only the first location in the list will be processed.

RETURN

Returns the contig ID from the first location in the incoming string.

beg_of

    my $beg = $fig->beg_of($location);

Return the beginning point of a location.

This method only works with SEED-style locations (contigID_beg_end). For more comprehensive location parsing, use the Location object.

location

A SEED-style location (contigID_beg_end), or a comma-delimited list of SEED-style locations. In the latter case, only the first location in the list will be processed.

RETURN

Returns the beginning point from the first location in the incoming string.

end_of

    my $end = $fig->end_of($location);

Return the ending point of a location.

This method only works with SEED-style locations (contigID_beg_end). For more comprehensive location parsing, use the Location object.

location

A SEED-style location (contigID_beg_end), or a comma-delimited list of SEED-style locations. In the latter case, only the first location in the list will be processed.

RETURN

Returns the contig ID from the first location in the incoming string.

upstream_of

    my $dna = $fig->upstream_of($peg, $upstream, $coding);

Return the DNA immediately upstream of a feature. This method contains code lifted from the upstream.pl script.

peg

ID of the feature whose upstream DNA is desired.

upstream

Number of base pairs considered upstream.

coding

Number of base pairs inside the feature to be included in the upstream region.

RETURN

Returns the DNA sequence upstream of the feature's begin point and extending into the coding region. Letters inside a feature are in upper case and inter-genic letters are in lower case. A hyphen separates the true upstream letters from the coding region.

strand_of

    my $strand = $fig->contig_of($location);

Return the strand (+ or -) of a location.

This method only works with SEED-style locations (contigID_beg_end). For more comprehensive location parsing, use the Location object.

location

A comma-delimited list of SEED-style location (contigID_beg_end).

RETURN

Returns + if the list describes a forward-oriented location, and - if the list described a backward-oriented location.

find_contig_with_checksum

    my $contigID = $fig->find_contig_with_checksum($genome, $checksum);

Find a contig in the given genome with the given checksum.

This method is useful for determining if a particular contig has already been recorded for the given genome. The checksum is computed from the contig contents, so a matching checksum indicates that the contigs may have the same content.

genome

ID of the genome whose contigs are to be examined.

checksum

Checksum value for the desired contig.

RETURN

Returns the ID of a contig in the given genome that has the caller-specified checksum, or undef if no such contig exists.

contig_checksum

    my $checksum = $fig->contig_checksum($genome, $contig);

or

    my @checksum = $fig->contig_checksum($genome, $contig);

Return the checksum of the specified contig. The checksum is computed from the contig's content in a parallel process. The process returns a space-delimited list of numbers. The numbers can be split into a real list if the method is invoked in a list context. For b

read_contig

Read a single contig from the contigs file.

boundaries_of

usage: ($contig,$beg,$end) = $fig->boundaries_of($loc)

The location of a feature in a scalar context is

    contig_b1_e1,contig_b2_e2,...   [one contig_b_e for each exon]

This routine takes as input such a location and reduces it to a single description of the entire region containing the gene.

boundaries_of_2

    \@regions = $fig->boundaries_of_2( $location );
     @regions = $fig->boundaries_of_2( $location );

Locations can be a list of intervals (contig_beg_end), but the intervals need not be on a single contig, contiguous, or in a consistent orientation (e.g., a feature that wraps from the end to the beginning of a genome, or a trans-spliced protein). This function defines a region of a gene a sequence parts of the location that are on the same contig, in the same orientation, and with end points that progress along the contig in the same direction as the individual parts. This function is a generalization of boundaries_of(). The latter function returns undef if the first and last contigs are not the same, and returns a location spanning nearly the entire contig it the location spans the origin.

location

contig1_beg1_end1,contig2_beg2_end2,...

@regions
   ( [contig, beg, end], [contig, beg, end], ... )

where consecutive location intervals with a common contig, direction, and consistent direction of progression along the contig are merged. The vast majority of genes will be reduced to the a single region, which is that returned by boundaries_of(). That is, most of the time:

    boundaries_of( $loc )

is the same as

    @{ boundaries_of_2( $loc )->[0] || [] }

all_features_detailed

    my $featureList = $fig->all_features_detailed($genome);

Returns a list of all features in the designated genome, with their location, alias, and type information included. This is used in the GenDB import and Sprout load to speed up the process.

Deleted features are not returned!

genome

ID of the genome whose features are desired.

RETURN

Returns a reference to a list of tuples. Each tuple consists of four elements: (1) the feature ID, (2) the feature location (as a comma-delimited list of location specifiers), (3) the feature aliases (as a comma-delimited list of named aliases), and (4) the feature type.

all_features_detailed_fast

    my $featureList = $fig->all_features_detailed($genome, $min, $max, $contig);

Returns a list of all features in the designated genome, with various useful information included.

Deleted features are not returned!

genome

ID of the genome whose features are desired.

min (optional)

If specified, the minimum contig location of interest. Features not entirely to the right of this location are ignored.

max (optional)

If specified, the maximum contig location of interest. Features not entirely to the left of this location are ignore.

contig (optional)

If specified, the contig of interest. Features not on this contig are ignored.

RETURN

Returns a reference to a list of tuples. Each tuple consists of four elements: (1) the feature ID, (2) the feature location (as a comma-delimited list of location specifiers), (3) the feature aliases (as a comma-delimited list of named aliases), (4) the feature type, (5) the leftmost index of the feature's first location, (6) the rightmost index of the feature's last location, (7) the current functional assignment, (8) the user who made the assignment, and (9) the quality of the assignment (which is usually an empty string).

all_features

    my @fidList = $fig->all_features($genome,$type);

Returns a list of all feature IDs of a specified type in the designated genome. You would usually use just

    $fig->pegs_of($genome) or
    $fig->rnas_of($genome)

which simply invoke this routine.

genome

ID of the genome whose features are desired.

type (optional)

Type of feature desired (peg, rna, etc.). If omitted, all features are returned.

RETURN

Returns a list of the IDs for the desired features.

pegs_of

usage: $fig->pegs_of($genome)

Returns a list of all PEGs in the specified genome. Note that order is not specified.

rnas_of

usage: $fig->rnas_of($genome)

Returns a list of all RNAs for the given genome.

feature_aliases

usage: @aliases = $fig->feature_aliases($fid) OR $aliases = $fig->feature_aliases($fid)

Returns a list of aliases (gene IDs, arbitrary numbers assigned by authors, etc.) for the feature. These must come from the tbl files, so add them there if you want to see them here.

In a scalar context, the aliases come back with commas separating them.

uniprot_aliases

    my @aliases = $fig->uniprot_aliases($fid)
    OR
    my $aliases = $fig->uniprot_aliases($fid)

Return the uniprot aliases (SwissProt, TREMBL and UniProt) for a PEG.

The aliases returned may be from a different organism than the organism of the input feature $fid.

A call to get_corresponding_ids is done first and will return the same-sequence same-genome ids. If none are found, mapped_prot_ids is called which will give the same-sequence ids.

If you need to know which form of alias is being returned, call these methods directly.

Only one id is returned for every accession found. Example 1: If both uni|Q8FLC2 and uni|Q8FLC2_ECOL6 are found in the database, only uni|Q8FLC2 will be returned. Example 2: If sp|P75616 and uni|P75616 are found in the database, only sp|P75616 will be returned. The order of preference here is sp before tr before uni.

fid

Feature ID of the PEG whose aliases are desired.

RETURN

Depending on the context of the call, either a list of aliases (sp, tr and uni) is returned, or a comma-separated string. If no aliases are found, the empty list or string will be returned.

uniprot_aliases_bulk

    my $hash = $fig->uniprot_aliases_bulk(\@fids, $no_del_check);

Return a hash mapping the specified feature IDs to lists of their uniprot aliases.

fids

A list of FIG feature IDs.

no_del_check

If TRUE, deleted feature IDs will not be removed from the feature ID list before processing. The default is FALSE, which means deleted feature IDs will be removed before processing.

RETURN

Returns a hash mapping each feature ID to a list of its uniprot aliases.

rewrite_db_xrefs_brc

Convert an alias to a db_xref. This uses the BRC format db_xref, which is a conglomeration of NCBI, GO, and BioMoby.

This method will return a correctly formatted db_ref if the argument is one of our currently recognized formats, otherwise it returns undef.

This example code should provide the functions you want

foreach my $alias ($fig->feature_aliases($peg)) { if (my $dbxref=$fig->rewrite_db_xrefs_brc($alias)) {print "The dbxref is $dbxref\n"} else {print "The alias is $alias\n"} }

For a list of approved dbxrefs, see http://www.brc-central.org/cgi-bin/brc-central/dbxref_list.cgi

by_alias

usage: $peg = $fig->by_alias($alias)

Returns a FIG id if the alias can be converted. Right now we convert aliases of the form NP_* (RefSeq IDs), gi|* (GenBank IDs), sp|* (Swiss Prot), uni|* (UniProt), kegg|* (KEGG) and maybe a few more

by_raw_alias

usage: $peg = $fig->by_raw_alias($alias)

Returns all FIG ids having the given alias. Unlike by_alias, we do not attempt any kind of normalization. I'm not sure this function is needed, but by_alias searches only in ext_alias table whereas here I'm searching in the features table. ext_alias does not have all external aliases which is keeping my code from working. In particular, it lacks EnsemblGene. It would be nice to combine these two functions. -Ed =cut

sub by_raw_alias { my($self,$alias) = @_; my($rdbH,$relational_db_response); my ($peg);

    $rdbH = $self->db_handle;
    if (($relational_db_response = $rdbH->SQL("SELECT id FROM features WHERE aliases LIKE \'%,$alias,%\'")) && (@$relational_db_response > 0)) {
        if (@$relational_db_response == 1) {
            $peg = $relational_db_response->[0]->[0];
            return wantarray() ? ($peg) : $peg;
        } elsif (wantarray()) {
            return map { $_->[0] } @$relational_db_response;
        }
    }
    return wantarray() ? () : "";
}

sub to_alias { my($self,$fid,$type) = @_;

    my @aliases = $self->feature_aliases($fid);
    if ($type)
    {
        @aliases = grep { $_ =~ /^$type\|/ } @aliases;
    }
    if (wantarray())
    {
        return @aliases;
    }
    elsif (@aliases > 0)
    {
        return $aliases[0];
    }
    else
    {
        return "";
    }
}

possibly_truncated

usage: $fig->possibly_truncated($feature_id) or $fig->possibly_truncated($genome, $loc)

Returns the empty string if the feature or location is not near either end of a contig.

Returns 'stop' if the feature or location is on the 'plus' strand and near the end of a contig, or is on the 'minus' starnd and near the beginning of the contig.

Returns 'start' if the feature or location is on the 'plus' strand and near the beginning of a contig, or is on the 'minus' starnd and near the end of the contig.

Possibly truncated STOPs have return priority over possibly truncated STARTs.

possible_frameshift

USAGE:

my $fs = $fig->possible_frameshift($peg);

RETURNS:

A pointer to a list of the form [ContigName,BegOfRegionContaining,EndOfContainingRegion,DNAofContaining,TemplatePEGid]

boolean FALSE otherwise.

merge

Merge two HSPs unless their overlap or separation is too large.

RETURNS: Merged boundaries if merger succeeds, and undef if merger fails.

map_peg_to_ids

<my $gnum, $pnum = $fig-map_peg_to_ids($peg)>>

Map a peg ID to a pair of numbers describing that peg.

In order to conserve storage and increase performance for some operations (the functional coupling computation, for instance), we provide a mechanism by which a full peg (of the form fig|X.Y.peg.Z) is mapped to a pair of integers: a genome number and a PEG index. We maintain a table genome_mapping that retains the mapping between genome ID and local genome number. No effort is expended to ensure this mapping is at all coherent between SEED instances; this is purely a local mechanism for performance enhancement.

$peg

ID of the peg to be mapped.

RETURN

A pair of numbers ($gnum, $pnum)

abstract_coupled_to

    my @coupled_to = $fig->abstract_coupled_to($peg);

Return a list of functionally coupled PEGs.

peg

ID of the protein encoding group whose functionally-coupled proteins are desired.

RETURN

Returns a list of 4-tuples, each consisting of the ID of a coupled PEG, a score, a "type" which indicates the method that produced the score, and "extra data" in the form of a pointer to a list. If there are no PEGs functionally coupled to the incoming PEG, it will return an empty list. If the PEG data is not present, it will return an empty list.

coupled_to

    my @coupled_to = $fig->coupled_to($peg);

Return a list of functionally coupled PEGs.

The new form of coupling and evidence computation is based on precomputed data. The old form took minutes to dynamically compute things when needed. The old form still works, if the directory Data/CouplingData is not present. If it is present, it theis assumed to contain comprehensive coupling data in the form of precomputed scores and PCHs.

If Data/CouplingData is present, this routine returns a list of 2-tuples [Peg,Sc]. It returns the empty list if the peg is not coupled. It returns undef if Data/CouplingData is not there.

peg

ID of the protein encoding group whose functionally-coupled proteins are desired.

RETURN

Returns a list of 2-tuples, each consisting of the ID of a coupled PEG and a score. If there are no PEGs functionally coupled to the incoming PEG, it will return an empty list. If the PEG data is not present, it will return undef.

coupling_evidence

usage: @evidence = $fig->coupling_evidence($peg1,$peg2)

The new form of coupling and evidence computation is based on precomputed data. The old form took minutes to dynamically compute things when needed. The old form still works, ikf the directory Data/CouplingData is not present. If it is present, it is assumed to contain comprehensive coupling data in the form of precomputed scores and PCHs.

If Data/CouplingData is present, this routine returns a list of 3-tuples [Peg3,Peg4,Rep]. Here, Peg3 is similar to Peg1, Peg4 is similar to Peg2, and Rep == 1 iff this is a "representative pair". That is, we take all pairs and create a representative set in which each pair is not "too close" to any other pair in the representative set. Think of "too close" as being roughly 95% identical at the DNA level. This keeps (usually) a single pair from a set of different genomes from the same species.

It returns the empty list if the peg is not coupled. It returns undef, if Data/CouplingData is not there.

coupling_and_evidence

usage: @coupling_data = $fig->coupling_and_evidence($fid,$bound,$sim_cutoff,$coupling_cutoff,$keep_record)

A computation of couplings and evidence starts with a given peg and produces a list of 3-tuples. Each 3-tuple is of the form

    [Score,CoupledToFID,Evidence]

Evidence is a list of 2-tuples of FIDs that are close in other genomes (producing a "pair of close homologs" of [$peg,CoupledToFID]). The maximum score for a single PCH is 1, but "Score" is the sum of the scores for the entire set of PCHs.

NOTE: once the new version of precomputed coupling is installed (i.e., when Data/CouplingData is filled with the precomputed relations), the parameters on computing evidence are ignored.

If $keep_record is true, the system records the information, asserting coupling for each of the pairs in the set of evidence, and asserting a pin from the given $fd through all of the PCH entries used in forming the score.

add_chr_clusters_and_pins

usage: $fig->add_chr_clusters_and_pins($peg,$hits)

The system supports retaining data relating to functional coupling. If a user computes evidence once and then saves it with this routine, data relating to both "the pin" and the "clusters" (in all of the organisms supporting the functional coupling) will be saved.

$hits must be a pointer to a list of 3-tuples of the sort returned by $fig->coupling_and_evidence.

translatable

usage: $fig->translatable($prot_id)

The system takes any number of sources of protein sequences as input (and builds an nr for the purpose of computing similarities). For each of these input fasta files, it saves (in the DB) a filename, seek address and length so that it can go get the translation if needed. This routine simply returns true iff info on the translation exists.

translation_length

usage: $len = $fig->translation_length($prot_id)

The system takes any number of sources of protein sequences as input (and builds an nr for the purpose of computing similarities). For each of these input fasta files, it saves (in the DB) a filename, seek address and length so that it can go get the translation if needed. This routine returns the length of a translation. This does not require actually retrieving the translation.

get_translation

    my $translation = $fig->get_translation($prot_id);

The system takes any number of sources of protein sequences as input (and builds an nr for the purpose of computing similarities). For each of these input fasta files, it saves (in the DB) a filename, seek address and length so that it can go get the translation if needed. This routine returns the stored protein sequence of the specified PEG feature.

prot_id

ID of the feature (PEG) whose translation is desired.

RETURN

Returns the protein sequence string for the specified feature.

mapped_prot_ids

usage: @mapped = $fig->mapped_prot_ids($prot)

This routine is at the heart of maintaining synonyms for protein sequences. The system determines which protein sequences are "essentially the same". These may differ in length (presumably due to miscalled starts), but the tails are identical (and the heads are not "too" extended). Anyway, the set of synonyms is returned as a list of 2-tuples [Id,length] sorted by length.

get_corresponding_ids

    my @id_list = $fig->get_corresponding_ids($id, $with_type_info);

Return a list of the identifiers that correspond to the given identifier, based on the PIR id correspondence table.

id

Identifer to look up.

with_type_info

Pass a true value here to return tuples [id, source-type, link-information] instead of identifiers.

RETURN

A list of identifiers if $with_type_info not true; a list of tuples [id, source-type, link-information] otherwise.

function_of

    my $function = $fig->function_of($id, $user);

or

    my @functions = $fig->function_of($id);

In a scalar context, returns the most recently-determined functional assignment of a specified feature by a particular user. In a list context, returns a list of 2-tuples, each consisting of a user ID followed by a functional assighment by that user. In this case, the list contains all the functional assignments for the feature.

id

ID of the relevant feature.

user

ID of the user whose assignment is desired (scalar context only)

RETURN

Returns the most recent functional assignment by the given user in scalar context, and a list of functional assignments in list context. Each assignment in the list context is a 2-tuple of the form [$user, $assignment].

function_of_bulk

    my $functionHash = $fig->function_of_bulk(\@fids, $no_del_check);

Return a hash mapping the specified proteins to their master functional assignments.

fids

Reference to a list of feature IDs.

no_del_check

If TRUE, then deleted features will not be removed from the list. The default is FALSE, which means deleted feature will be removed from the list.

RETURN

REturns a reference to a hash mapping feature IDs to their main functional assignments.

translated_function_of

usage: $function = $fig->translated_function_of($peg,$user)

You get just the translated function.

translate_function

usage: $translated_func = $fig->translate_function($func)

Translates a function based on the function.synonyms table.

assign_function

usage: $fig->assign_function($peg,$user,$function,$confidence)

Assigns a function. Note that confidence can (and should be if unusual) included. Now, assignments are logged in the annotation file by assign_function.

nsims

New sims code.

This code takes advantage of a network similarity server if it is available.

We gather sims in the following manner:

    If a local sims directory exists, gather the raw sims for our peg.
    If dynamic sims are available, gather the raw sims from there as well.
    Do an initial pruning of these raw sims, based on the conditions
    passed in to the sims call.
    Locally expand these sims.
    If we are using network sims, retrieve them now, and add to the local sims set.
    Do a final pruning of this set of sims, and sort.

osims

usage: @sims = $fig->osims($peg,$maxN,$maxP,$select,$max_expand, $filters)

Returns a list of similarities for $peg such that

    there will be at most $maxN similarities,
    each similarity will have a P-score <= $maxP, and
    $select gives processing instructions:
        "raw" means that the similarities will not be expanded (by far fastest option)
        "fig" means return only similarities to fig genes
        "all" means that you want all the expanded similarities.
        "figx" means exapand until the maximum number of fig sims

By "expanded", we refer to taking a "raw similarity" against an entry in the non-redundant protein collection, and converting it to a set of similarities (one for each of the proteins that are essentially identical to the representative in the nr).

Each entry in @sims is a refence to an array. These are the values in each array position:

 0.  The query peg
 1.  The similar peg
 2.  The percent id
 3.  Alignment length
 4.  Mismatches
 5.  Gap openings
 6.  The start of the match in the query peg
 7.  The end of the match in the query peg
 8.  The start of the match in the similar peg
 9.  The end of the match in the similar peg
10.  E value
11.  Bit score
12.  Length of query peg
13.  Length of similar peg
14.  Method

bbhs

    my @bbhList = $fig->bbhs($peg, $cutoff);

Return a list of the bi-directional best hits relevant to the specified PEG.

peg

ID of the feature whose bidirectional best hits are desired.

cutoff

Similarity cutoff. If omitted, 1e-10 is used.

RETURN

Returns a list of 3-tuples. The first element of the list is the best-hit PEG; the second element is the score. A lower score indicates a better match. The third element is the normalized bit score for the pair, and is normalized to the length of the protein.

bbh_list

    my $bbhHash = $fig->bbh_list($genomeID, \@featureList);

Return a hash mapping the features in a specified list to their bidirectional best hits on a specified target genome.

(Modeled after the Sprout call of the same name.)

genomeID

ID of the genome from which the best hits should be taken.

featureList

List of the features whose best hits are desired.

RETURN

Returns a reference to a hash that maps the IDs of the incoming features to the best hits on the target genome.

get_figfams_data

usage: $dir = $fig->get_figfams_data($mydir) usage: $dir = &FIG::get_figfams_data($mydir)

Returns the Figfams data directory to use. If $mydir is passed, use that value. Otherwise see if $FIG_Config::FigfamsData is defined, and use that. Otherwise default to $FIG_Config::data/FigfamsData.

dsims

usage: @sims = $fig->dsims($id,$seq,$maxN,$min_nbsc)

Returns a list of similarities for $seq against PEGs from FIGfams such that

    there will be at most $maxN similarities, and
    each similarity will have a normalized bit-score >= $min_nbsc

The "dsims" or "dynamic sims" are not precomputed. They are computed using a heuristic which is much faster than blast, but misses some similarities. Essentially, you have an "index" or representative sequences, a quick blast is done against it, and if there are any hits these are used to indicate which sub-databases to blast against. This implies that the p-scores are fairly meaningless; use the normalized bit-scores ($sim->nbsc)

in_cluster_with

usage: @pegs = $fig->in_cluster_with($peg)

Returns the set of pegs that are thought to be clustered with $peg (on the chromosome).

add_chromosomal_clusters

usage: $fig->add_chromosomal_clusters($file)

The given file is supposed to contain one predicted chromosomal cluster per line (either comma or tab separated pegs). These will be added (to the extent they are new) to those already in $FIG_Config::global/chromosomal_clusters.

in_pch_pin_with

usage: $fig->in_pch_pin_with($peg)

Returns the set of pegs that are believed to be "pinned" to $peg (in the sense that PCHs occur containing these pegs over significant phylogenetic distances).

add_pch_pins

usage: $fig->add_pch_pins($file)

The given file is supposed to contain one set of pinned pegs per line (either comma or tab seprated pegs). These will be added (to the extent they are new) to those already in $FIG_Config::global/pch_pins.

add_annotation

    my $okFlag = $fig->add_annotation($fid, $user, $annotation, $time_made);

Add an annotation to a feature.

fid

ID of the feature to be annotated.

user

Name of the user making the annotation.

annotation

Text of the annotation.

time_made (optional)

Time of the annotation, in seconds since the epoch. If omitted, the current time is used.

RETURN

Returns 1 if successful, 0 if any of the parameters are invalid or an error occurs.

add_annotation_batch

    my ($n_added, $badList) = $fig->add_annotation_batch($file);

Install a batch of annotations.

file

File containing annotations.

RETURN

Returns the number of annotations successfully added in $n_added. If annotations failed, they are returned in $badList as a tuple [$peg, $error_msg, $entry].

merged_related_annotations

usage: @annotations = $fig->merged_related_annotations($fids)

The set of annotations of a set of PEGs ($fids) is returned as a list of 4-tuples. Each entry in the list is of the form [$fid,$timestamp,$user,$annotation].

feature_annotations

    my @annotations = $fig->feature_annotations($fid, $rawtime);

Return a list of the specified feature's annotations. Each entry in the list returned is a 4-tuple containing the feature ID, time stamp, user ID, and annotation text. These are exactly the values needed to add the annotation using add_annotation, though in a different order.

fid

ID of the features whose annotations are to be listed.

rawtime (optional)

If TRUE, the times will be returned as PERL times (seconds since the epoch); otherwise, they will be returned as formatted time strings.

RETURN

Returns a list of 4-tuples, one per annotation. Each tuple is of the form ($fid, $timeStamp, $user, $annotation) where $fid is the feature ID, $timeStamp is the time the annotation was made, $user is the name of the user who made the annotation, and $annotation is the text of the annotation.

read_all_annotations

    my @annotations = $fig->read_all_annotations($genomeID);

Return a list of the specified genome's annotations. Each entry in the list returned is a 4-tuple containing the feature ID, time stamp, user ID, and annotation text. The values are read directly from the annotation flat file without resorting to the database.

genomeID

ID of the genome whose annotations are to be read.

RETURN

Returns a list of 4-tuples, one per annotation. Each tuple is of the form ($fid, $timeStamp, $user, $annotation) where $fid is the feature ID, $timeStamp is the time the annotation was made, $user is the name of the user who made the annotation, and $annotation is the text of the annotation.

read_annotation_record

    my $annoString = FIG::read_annotation_record($fileHandle);

Read an annotation record from the specified file handle. Will return the annotation record if successful, and undef if end-of-file is read. An annotation record consists of multiple lines of text separated by a line containing a double-slash //.

fileHandle

The file handle from which to read the record.

RETURN

Returns either the entire annotation record (without the double-slash) or undef, indicating end-of-file. Null records will not be returned.

parse_date

usage: $date = $fig->parse_date(date-string)

Parse a date string, returning seconds-since-the-epoch, or undef if the date did not parse.

Accepted formats include an integer, which is assumed to be seconds-since-the-epoch an is just returned; MM/DD/YYYY; or a date that can be parsed by the routines in the Date::Parse module.

extract_assignments_from_annotations

Extract a list of assignments from an annotations package as created by annotations_made_fast. Assumes that the user and date filtering was done by the annotations extraction, so all this has to do is to sort the lists of annotations by date and grab the latest one.

Return value is a list of tuples [$peg, $assignment, $date, $who].

annotations_made

usage: @annotations = $fig->annotations_made($genomes, $who, $date)

Return the list of annotations on the genomes in @$genomes made by $who after $date.

Each returned annotation is of the form [$fid,$timestamp,$user,$annotation].

Attributes

The attribute system automatically detects whether you are using a local attribute database, a remote attribute server, or the SEED data store. For details on the new attribute system see the documentation for the CustomAttributes module.

Because of the enormous number of attributes in the system (1.5 million and growing), the old system, which combined a database table and flat file data stores, has become too slow for live SEEDs. It is maintained for small test SEEDs, such as what you might have running on a local PC. Be aware, however, that not all functions of the old system work in the new system, and vice versa. You can get a more accurate test system by linking to the test attribute server. Simply place

    $attrURL = "http://nmpdr-1.nmpdr.org/next/FIG/AttribXMLRPC.cgi";;

in your FIG_Config file. This server contains old data that can be mangled without let or hindrance. To connect to the real server, use

    $attrURL = "http://nmpdr-1.nmpdr.org/next/FIG/AttribXMLRPC.cgi";;

but be aware that any changes you make will automatically be migrated to all the production SEEDs.

The SEED Data Store Interface

There are several base attribute methods:

 get_attributes
 add_attribute
 delete_attribute
 change_attribute

There are also methods for more complex things:

 get_keys
 get_values
 guess_value_format

By default all keys are case sensitive, and all keys have leading and trailing white space removed. Keys can not contain anything but [a-zA-Z0-9_] (or things matched by \w)

Attributes are not on a 1:1 correlation, so a single key can have several values.

Most attributes files are stored in the genome specific directories. These are in Organisms/nnnnn.n/Attributes for the organisms, and Organisms/nnnnn.n/Feaures/*/Attributes for the features. Attributes can also be stored in Global/Attributes where they will be loaded, but users are discouraged from doing this since there will be no packaging and sharing of those attibutes. Global should be reserved for those attributes that are calculated on a database-wide instance. There are several "special" files that we are using:

1. Definition files

These are the raw text files stored in the appropriate locations (Organisms/nnnnn.n/Attributes, Organisms/nnnnn.n/Feaures/*/Attributes, and Global/Attributes). The files should consist of ONLY feature, key, value, and optional URL. Any other columns will be ignored and not loaded into the database.

2. Global/Attributes/attribute_keys

This contains the definition of the attribute keys. There are currently 3 defined columns although others may be added and this file can contain lines of an arbitrary length.

3. Global/Attributes/transaction_log, Organisms/nnnnnn.n/Attributes/transaction_log, and Organisms/nnnnnn.n/Features/*/Attributes/transaction_log

These are the transaction logs that contain any modifications to the data. In general the data is loaded from a single definition file this is not modified by the software. Any changes to the attributes are made in the Database and then written to the transaction log. The transaction log has the following columns

1. command. This can be one of ADD/DELETE/CHANGE 2. feature. The feature id to be modified 3. key. The key to be modified 4. old value. The original value of the key 5. old url. The original URL 6. new value. The new value for the key. Ignored if the key is deleted. 7. new url. The new value for the URL. Ignored if the key is deleted.

Note that the old value and old url are optional. If they are not provided ALL instances of the key will be affected.

Notice also that the old file assigned_attributes is no longer used. This is replaced by the transaction log.

Finally, in the parsing of all files any line beginning with a pound sign is ignored as a comment.

A method, read_attribute_transaction_log, is provided to read the transaction_logs and implement the changes therein. In each of the methods add_attribute, delete_attribute, and change_attribute there is an optional boolean that can be set to prevent writing of the transaction_log. The read_attribute_transaction_log reads the log and then adds/changes/deletes the records as appropriate. Without this boolean there is a circular reference.

Get attributes requires one of four keys: fid (which can be genome, peg, rna, or other id, or a reference to a list of ids), key, value, url

It will find any attribute that has the characteristics that you request, and if any values match it will return a four-ple of: [fid, key, value, url]

You can request an E. coli key like this $fig->get_attributes('83333.1');

You can request a peg id like this: $fig->get_attributes($peg); $fig->get_attributes("fig|833333.1.peg.4");

You can request any structure key like this $fig->get_attributes(undef, 'structure');

You can request any url like this $fig->get_attributes(undef, undef, undef, 'http://pir.georgetown.edu/sfcs-cgi/new/pirclassif.pl?id=SF001547');

NOTE: If there are no attributes an empty array will be returned. You need to check for this and not assume that it will be undef.

get_attributes

    my @attributeList = $fig->get_attributes($objectID, $key, @values);

In the database, attribute values are sectioned into pieces using a splitter value specified in the constructor (new). This is not a requirement of the attribute system as a whole, merely a convenience for the purpose of these methods. If a value has multiple sections, each section is matched against the corresponding criterion in the @valuePatterns list.

This method returns a series of tuples that match the specified criteria. Each tuple will contain an object ID, a key, and one or more values. The parameters to this method therefore correspond structurally to the values expected in each tuple. In addition, you can ask for a generic search by suffixing a percent sign (%) to any of the parameters. So, for example,

    my @attributeList = $attrDB->GetAttributes('fig|100226.1.peg.1004', 'structure%', 1, 2);

would return something like

    ['fig}100226.1.peg.1004', 'structure', 1, 2]
    ['fig}100226.1.peg.1004', 'structure1', 1, 2]
    ['fig}100226.1.peg.1004', 'structure2', 1, 2]
    ['fig}100226.1.peg.1004', 'structureA', 1, 2]

Use of undef in any position acts as a wild card (all values). You can also specify a list reference in the ID column. Thus,

    my @attributeList = $attrDB->GetAttributes(['100226.1', 'fig|100226.1.%'], 'PUBMED');

would get the PUBMED attribute data for Streptomyces coelicolor A3(2) and all its features.

In addition to values in multiple sections, a single attribute key can have multiple values, so even

    my @attributeList = $attrDB->GetAttributes($peg, 'virulent');

which has no wildcard in the key or the object ID, may return multiple tuples.

Value matching in this system works very poorly, because of the way multiple values are stored. For the object ID and key name, we create queries that filter for the desired results. For the values, we do a comparison after the attributes are retrieved from the database. As a result, queries in which filter only on value end up reading the entire attribute table to find the desired results.

objectID

ID of object whose attributes are desired. If the attributes are desired for multiple objects, this parameter can be specified as a list reference. If the attributes are desired for all objects, specify undef or an empty string. Finally, you can specify attributes for a range of object IDs by putting a percent sign (%) at the end.

key

Attribute key name. A value of undef or an empty string will match all attribute keys. If the values are desired for multiple keys, this parameter can be specified as a list reference. Finally, you can specify attributes for a range of keys by putting a percent sign (%) at the end.

values

List of the desired attribute values, section by section. If undef or an empty string is specified, all values in that section will match. A generic match can be requested by placing a percent sign (%) at the end. In that case, all values that match up to and not including the percent sign will match. You may also specify a regular expression enclosed in slashes. All values that match the regular expression will be returned. For performance reasons, only values have this extra capability.

RETURN

Returns a list of tuples. The first element in the tuple is an object ID, the second is an attribute key, and the remaining elements are the sections of the attribute value. All of the tuples will match the criteria set forth in the parameter list.

query_attributes

    my @attributeData = $ca->query_attributes($filter, $filterParms);

Return the attribute data based on an SQL filter clause. In the filter clause, the name $object should be used for the object ID, $key should be used for the key name, $subkey for the subkey value, and $value for the value field.

filter

Filter clause in the standard ERDB format, except that the field names are $object for the object ID field, $key for the key name field, $subkey for the subkey field, and $value for the value field. This abstraction enables us to hide the details of the database construction from the user.

filterParms

Parameters for the filter clause.

RETURN

Returns a list of tuples. Each tuple consists of an object ID, a key (with optional subkey), and one or more attribute values.

get_cv_attributes

A simple wrapper around get_attriubtes to return only those attributes that have meta_data indicating that the key is a controlled vocabulary.

### DEPRECATED ### The controlled vocabulary feature was never used in the old system, and in the new system, ALL the keys are controlled vocabulary.

add_attribute

Add a new key/value pair to something. Something can be a genome id, a peg, an rna, prophage, whatever.

Arguments:

        feature id, this can be a peg, genome, etc,
        key name. This is case sensitive and has the leading and trailing white space removed
        value
        optional URL to add
        boolean to prevent writing to the transaction log. See above

delete_attribute

    $fig->delete_attribute($objectID, $key, @values);

Delete the specified attribute key/value combination from the database.

objectID

ID of the object whose attribute is to be deleted.

key

Attribute key name.

values

One or more values associated with the key. If no values are specified, then all values will be deleted. Otherwise, only a matching value will be deleted.

parse_oid

    my ($type, $id) = FIG::parse_oid($idValue);

Convert an attribute object ID to an object type and an ID applicable to that type. This information can be used to convert an ID string obtained from the get_attributes method to an object name and ID suitable for plugging into the GetEntity method of an ERDB database.

idValue

ID string from the attribute database.

RETURN

Returns a two-element list consisting of the object type and its individual ID.

form_oid

    my $idValue = FIG::form_oid($type, $id);

Convert an object type and ID into an ID string for the attribute database.

type

Object type. This should usually correspond to an entity name in a database. It can only contain letters. This means no digits, spaces, or even underscores.

id

Individual object ID.

RETURN

Returns the string used to represent the object in the attribute database.

delete_matching_attributes

    my @attributeList = $fig->delete_matching_attributes($objectID, $key, @values);

This method works identically to get_attributes, except that the attributes are deleted as they are retrieved.

change_attribute

    $fig->change_attribute($objectID, $key, \@oldValues, \@newValues);

Change the value of an attribute key/value pair for an object. This is implemented as a delete followed by an insert.

objectID

ID of the genome or feature to which the attribute is to be changed. In general, an ID that starts with fig| is treated as a feature ID, and an ID that is all digits and periods is treated as a genome ID. For IDs of other types, this parameter should be a reference to a 2-tuple consisting of the entity type name followed by the object ID.

key

Attribute key name. This corresponds to the name of a field in the database.

oldValues

One or more values identifying the key/value pair to change.

newValues

One or more values to be put in place of the old values.

clean_attribute_key()

## DEPRECATED ## This process is no longer required in the new system.

use $key=$fig->clean_attribute_key($key)

Keys for attributes are used as filenames in the code, and there are limitations on the characters that can be used in the key name. We provide an extended explanation of each key, so the key does not necessarily need to be person-readable.

Keys are not allowed to contain any non-word character (i.e. they must only contain [a-zA-Z0-9] and _

This method will remove these.

essential

    my $flag = $fig->essential($fid);

Return TRUE if a feature is considered essential and FALSE otherwise. This method provides a uniform method for determining essentiality that will remain consistent during the various overhauls of essentiality. Currently a feature is essential if it has an attribute with the value essential or potential_essential.

fid

ID of the feature to check for essentiality.

RETURN

Returns TRUE if the feature is considered essential, else FALSE.

virulent

    my $flag = $fig->virulent($fid);

Return TRUE if a feature is considered virulent and FALSE otherwise. This method provides a uniform method for determining virulence that will remain consistent during the various overhauls of virulence attributes. Currently a feature is virulent if it has an attribute whose key begins with virulence_associated.

fid

ID of the feature to check for essentiality.

RETURN

Returns TRUE if the feature is considered essential, else FALSE.

Splitting and Joining Attributes "oids"

There was a big problem with attributes being very slow to recover, and having to recover all attributes just to get those for a peg or a genome. The current implementation splits the original ID (oid) into three columns, genome, ftype, and id. The ftype is peg, rna, pp, etc. The id is the feature number. The genome is the genome number.

Hence: fig|83333.1.peg.1345 becomes 83333.1, peg, and 1345 83333.1 becomes 83333.1, '', and ''

To split an oid into an array with three parts: $self->split_attribute_oid($peg);

To join the three parts of a series of results: map {unshift @$_, $self->join_attribute_oid(splice(@$_, 0, 3))} @$res;

This code splices the first three elements of the the array, joins them, and then unshifts the result of that join back into the start of the array. Cool, eh?

split_attribute_oid()

use my ($genome, $type, $id)=split_attribute_feature($id);

splits an id into genome, type, and id if it is a feature, or just genome and '', '' if it is a genome, and just the id and undef undef if it is not known

join_attribute_oid()

use my $id=join_attribute_oid($genome, $feature, $id);

Joins an attribute back together after it has been pulled from the mysql database

read_attribute_transaction_log

use: $fig->read_attribute_transaction_log($logfile);

This method reads the transaction_log described in $logfile and enacts the changes described therein. The changes must be one of add, delete, or change.

erase_attribute_entirely

This method will remove any notion of the attribute that you give it. It is different from delete as that just removes a single attribute associated with a peg. This will remove the files and uninstall the attributes from the database so there is no memory of that type of attribute. All of the attribute files are moved to FIG_Tmp/Attributes/deleted_attributes, and so you can recover the data for a while. Still, you should probably use this carefully!

I use this to clean out old PIR superfamily attributes immediately before installing the new correspondence table.

e.g. my $status=$fig->erase_attribute_entirely("structure");

This will return the number of files that were moved to the new location

get_group_keys

    my @keys = $fig->get_group_keys($groupName);

Return all the attribute keys in the named group.

groupName

Name of the group whose keys are desired.

RETURN

Returns a list of the attribute keys in the named group.

get_group_key_info

    my %keys = $fig->get_group_key_info($groupName);

Return the descriptive data for all the attribute keys in the named group.

groupName

Name of the group whose keys are desired. If omitted, then all keys will be returned. This could be expensive, but when it's necessary, it's necessary.

RETURN

Returns a hash mapping each relevant attribute key to an n-tuple containing the the attribute relation name, the description, and the 0 or more group names.

get_genome_keys

Get all the keys that apply to genomes and only genomes. This method takes no arguments and returns an array.

get_peg_keys

Get all the keys that apply just to pegs. This method takes no arguments and returns an array.

get_peg_keys_for_genome

Get all the keys that apply just to pegs from a specified genome. This method takes a genome id as an argument and returns an array.

get_genomes_with_attribute

Get a list of all genomes that have a specified attribute. This will search for all genomes that have some attribute.

This will also accept partial matches. Hence to find all genomes that have essentiality data you can do this:

my @genomes=$fig->get_genomes_with_attribute("essential");

This will find Essential_Gene_Sets_Bacterial, essential, etc

key_info

DEPRECATED: in actual fact, no attribute metadata was ever put into the system.

Access a hash of key information. The data that are returned are currently:

hash key name what is it data type single [boolean] description Explanation of key [free text] readonly whether to allow read/write [boolean] is_cv attribute is a cv term [boolean]

Single is a boolean, if it is true only the last value returned should be used. Note that the other methods willl still return all the values, it is upto the implementer to ensure that only the last value is used.

Explanation is a user-derived explanation that can be free text

If a reference to a hash is provided, along with the key, those values will be set to the attribute_keys file

Returns an empty hash if the key is not provieded or doesn't exist

e.g. $fig->key_info($key, \%data); # set the data $data=$fig->key_info($key); # get the data

This data is stored in a file called $FIG_Config::global/Attributes/attribute_metadata and in a database called attribute_metadata. The data is strictly on a last in last out basis, so that if a datapoint is changed, the last datapoint in the database or file is returned. At the moment I am not coding the ability to edit data.

The method takes the following arguments

key

The key to look for or add data to.

$data

A reference to a hash containing the new data to add to the database. If provided this will cause the database to be updated

$nowrite

Do not write the new data to the attributes_metadata file. This is mainly used by load_attributes to prevent a circular read/write condition.

update_attributes_metadata()

This method exists solely to update the attributes metadata file and make sure that it is in the right format. This method can probably be deleted in a while, but it needs to be run on all machines with attributes data before then!

It is only called if an old attributes metadata file is found.

The method returns the filename where the data is now stored.

get_values

Get all the values that we know about

Without any arguments:

Returns a reference to a hash, where the key is the type of feature (peg, genome, rna, prophage, etc), and the value is a reference to a hash where the key is the value and the value is the number of occurences

e.g. print "There are " , {$fig->get_values}->{'peg'}->{'100'}, " keys with the value 100 in the database\n";

With a single argument:

The argument is assumed to be the type (rna, peg, genome, etc).

With two arguments:

The first argument is the type (rna, peg, genome, etc), and the second argument is the key.

In each case it will return a reference to a hash. E.g.

        $fig->get_values(); # will get all values
        $fig->get_values('peg'); # will get all values for pegs
        $fig->get_values('peg', 'structure'); # will get all values for pegs with attribute structure
        $fig->get_values(undef, 'structure'); # will get all values for anything with that attribute

guess_value_format

There are occassions where I want to know what a value is for a key. I have three scenarios right now:

 1. strings
 2. numbers
 3. percentiles ( a type of number, I know)

In these cases, I may want to know something about them and do something interesting with them. This will try and guess what the values are for a given key so that you can try and limit what people add. At the moment this is pure guess work, although I suppose we could put some restrictions on t/v pairs I don't feel like.

This method will return a reference to an array. If the element is a string there will only be one element in that array, the word "string". If the value is a number, there will be three elements, the word "float" in position 0, and then the minimum and maximum values. You can figure out if it is a percent :-)

attribute_location

This is just an internal method to find the appropriate location of the attributes file depending on whether it is a peg, an rna, or a genome or whatever.

add_cv_term

Add a controlled vocabulary term to a peg. Pass in the peg, the vocab name, the termId, and the term (see next paragraph). returns error string if problem, else returns nothing.

   my $status = $fig->add_cv_term( "master:EdF",
                                   "fig|9606.3.peg.26823", "MyVocab", "1234", "A thing of wonder.");
   if ($status) {print "error adding cv term: $status\n";}

Controlled vocabulary is read-only text associated with a peg. Each is a triple, namely (vocab name, termId, term text). The termId is an id that is used in the particulary vocabulary and the term text is the actual term. For example, the GO has the term "U12-type nuclear mRNA branch site recognition" with termId GO:0000371. Thus, the triplet is (GO, GO:0000371, "U12-type nuclear mRNA branch site recognition"). Don't be confused by the GO: in GO:0000371. We don't add the GO:. That's just what GO decided to do.

termIds can not have ';' in them.

This routine encapsulates our present implementation via attributes.

search_cv_file

Search a controlled vocabulary file for desired text. Pass the name of the CV, e.g., "GO" or "HUGO" and get back a reference to a list of results. Each result is a line from the file, and so is a tab-separated representation of the tripilet, (CV_name, CV_id, CV_text)

Case insensitivee, substring. =cut

sub search_cv_file { my ($self, $cv,$search_term) =@_; my $file = $FIG_Config::global."/CV/cv_search_".$cv.".txt"; if (! open(LOOKUP,"$file") ) { print STDERR "Search could not find vocabulary file, $file\n"; return; } my @lines; while (<LOOKUP>) { chomp; push @lines, $_; }

    my @grep_results = grep(/$search_term/i,@lines);
    return [@grep_results];
}

################################# Indexing Features and Functional Roles ####################################

search_index

    my ($pegs,$roles) = fig->search_index($pattern, $non_word_search, $user);

Find all pegs and roles that match a search pattern. The syntax of $pattern is deliberately left undefined so that we can change the underlying technology, but a single word or phrase should work.

pattern

A search pattern. In general, the pattern is a single word or phrase that is expected to occur somewhere in a functional role, attribute key, or attribute value.

non_word_search (optional)

If specified, the pattern will be interpreted as a string instead of a series of words.

user (optional)

If specified, the name of the current user. That user's annotation will be given precedence when the functional role is determined.

RETURN

Returns a 2-tuple. The first element is a reference to a list of features. For each feature, there is a tuple consisting of the (0) feature ID, (1) the organism name (genus and species), (2) the aliases, (3) the functional role, and (4) the relevant annotator. The second element in the returned tuple is a reference to a list of functional roles. All the roles and features in the lists must match the pattern in some way.

choose_function

    my ($who, $function) = $fig->choose_function($user, @funcs);

Choose the best functional role from a list of role/user tuples. If a user is specified, we look for one by that user. If that doesn't work, we look for one by a master user. If THAT doesn't work, we take the first one.

user

The name of the current user. If no user is active, specify either undef or a null string.

funcs

List of functional roles. Each role is represented by a 2-tuple consisting of the user name followed by the role description.

auto_assign

usage: $assignment = &FIG::auto_assign($peg,$seq)

This returns an automated assignment for $peg. $seq is optional; if it is not present, then it is assumed that similarities already exist for $peg. $assignment is set to either

    Function
or
    Function\tW

if it is felt that the assertion is pretty weak.

Protein Families

In the protein families we have our own concept of an id that I have called an cid. This is entirely internal and does not map to any known database except our own, however it is used to store the correspondence between different protein families. Therefore, to find out what family any protein is in you need to convert that protein to an cid. You can start with a KEGG, COG, TIGR, SP, GI, or FIG id, and get an cid back. From there, you can find out what other proteins that cid maps to, and what families that protein is also in.

all_protein_families

usage: @all = $fig->all_protein_families

Returns a list of the ids of all of the protein families currently defined.

families_for_protein

    my @families = $fig->families_for_protein($peg);

Return a list of all the families containing the specified protein.

peg

ID of the PEG representing the protein in question.

RETURN

Returns a list of the IDs of the families containing the protein.

proteins_in_family

    my @proteins = $fig->proteins_in_family($family);

Return a list of every protein in a family.

family

ID of the relevant protein family.

RETURN

Returns a list of all the proteins in the specified family.

family_function

    my $func = $fig->family_function($family);

Returns the putative function of all of the pegs in a protein family. Remember, we are defining "protein family" as a set of homologous proteins that have the same function.

family

ID of the relevant protein family.

RETURN

Returns the name of the function assigned to the members of the specified family.

sz_family

    my $n = $fig->sz_family($family);

Returns the number of proteins in a family.

family

ID of the relevant protein family.

RETURN

Returns the number of proteins in the specified family.

ext_sz_family

usage: $n = $fig->ext_sz_family($family)

Returns the number of external IDs in $family.

all_cids

usage: @all_cids=$fig->all_cids();

Returns a list of all the ids we know about.

ids_in_family

usage: @pegs = $fig->ids_in_family($family)

Returns a list of the cids in $family.

in_family

usage: @families = $fig->in_family($cid)

Returns an array containing the families containing an cid.

ext_ids_in_family

usage: @exts = $fig->ext_ids_in_family($family)

Returns a list of the external ids in an external family name.

ext_in_family

usage: @ext_families = $fig->ext_in_family($id)

Returns an array containing the external families containing an id. The ID is the one from the original database (e.g. pfam|PB129746)

families_by_source

use: my @famlies = $fig->families_by_source('fig');

This use SQL to look up all the families that have a partial match to the argument supplied. It should be quicker than getting all families and parsing out the ones you want since it is done at the db level.

number_of_cids

use: my $number=$fig->number_of_cids

The number_of_ methods here all use SQL queries to count how many of each thing there are. This method just returns the number of cids

number_of_families

use: my $number=$fig->number_of_families("fig");

This uses an SQL count method to count the number of families that match the given source. This should be a lot quicker than retrieving all families and then looping through them.

number_of_proteins_in_families

use: my $number=$fig->number_of_proteins_in_families("fig", "distinct");

This uses and SQL count to count the number of proteins in families that match a given source. If distinct is true each protein will only be counted once, else the total number will be returned.

prot_to_cid

Convert a protein to a global ID my $cid=$fig->prot_to_cid($proteinid)

$proteinid can be a FIG ID, a SP, tigr, or one of many other IDs

returns "" if not known

cid_to_prots

Convert an internal ID to the proteins that map to that ID. my @proteins=$fig->cid_to_prots($cid);

family_by_function

Get a list of families that have a partial match to a provided function.

E.g. my @families=$fig->family_by_function("histidine")

will return histidine kinase, histidine phosphatase, etc etc etc

Abstract Set Routines

KEGG methods

all_compounds

    my @compounds = $fig->all_compounds();

Return a list containing all of the KEGG compounds.

names_of_compound

    my @names = $fig->names_of_compound($cid);

Returns a list containing all of the names assigned to the specified KEGG compound. The list will be ordered as given by KEGG.

cid

ID of the desired compound.

RETURN

Returns a list of names for the specified compound.

ids_of_compound

usage: @ids = $fig->ids_of_compound

Returns a list containing all of the ids assigned to the KEGG compounds. The list will be ordered as given by KEGG.

ids_of_compound_like_name

usage: @ids = $fig->ids_of_compound_like_name($name)

Returns a list containing all of the ids assigned to the KEGG compounds that match $name. The list will be ordered as given by KEGG.

comp2react

    my @rids = $fig->comp2react($cid);

Returns a list containing all of the reaction IDs for reactions that take $cid as either a substrate or a product.

valid_reaction_id

    my $flag = $fig->valid_reaction_id($rid);

Returns true iff the specified ID is a valid reaction ID.

This will become important as we include non-KEGG reactions

rid

Reaction ID to test.

RETURN

Returns TRUE if the reaction ID is in the data store, else FALSE.

cas

    my $cas = $fig->cas($cid);

Return the Chemical Abstract Service (CAS) ID for the compound, if known.

cid

ID of the compound whose CAS ID is desired.

RETURN

Returns the CAS ID of the specified compound, or an empty string if the CAS ID is not known or does not exist.

cas_to_cid

    my $cid = $fig->cas_to_cid($cas);

Return the compound id (cid), given the Chemical Abstract Service (CAS) ID.

cas

CAS ID of the desired compound.

RETURN

Returns the ID of the compound corresponding to the specified CAS ID, or an empty string if the CAS ID is not in the data store.

all_reactions

    my @rids = $fig->all_reactions();

Return a list containing all of the KEGG reaction IDs.

reversible

    my $flag = $fig->reversible($rid);

Return TRUE if the specified reaction is reversible. A reversible reaction has no main direction. The connector is symbolized by <=> instead of =>.

rid

ID of the ralevant reaction.

RETURN

Returns TRUE if the specified reaction is reversible, else FALSE. If the reaction does not exist, returns TRUE.

reaction_direction

    my $rev = $fig->reaction_direction($rid);

Returns an array of triplets mapping from reactions in the context of maps to reversibility.

rid

ID of the relevant reaction.

RETURN

Return B if the reaction proceeds in both directions, L if it proceeds from right to left, or R if it proceeds from left to right (by convention the "substrates" are on the left and the "products" are on the right).

reaction2comp

    my @tuples = $fig->reaction2comp($rid, $which, $paths);

Return the substrates or products for a reaction. In any event (i.e., whether you ask for substrates or products), you get back a list of 3-tuples. Each 3-tuple will contain

    [$cid,$stoich,$main]

Stoichiometry indicates how many copies of the compound participate in the reaction. It is normally numeric, but can be things like "n" or "(n+1)". $main is 1 iff the compound is considered "main" or "connectable".

rid

ID of the reaction whose compounds are desired.

which

TRUE if the products (right side) should be returned, FALSE if the substrates (left side) should be returned.

paths

Optional list of paths to check whether compound is "main"

RETURN

Returns a list of 3-tuples. Each tuple contains the ID of a compound, its stoichiometry, and a flag that is TRUE if the compound is one of the main participants in the reaction. If paths are specified, the flag indicates whether the compound is main in any of the specified paths.

catalyzed_by

    my @ecs = $fig->catalyzed_by($rid);

Return the ECs (roles) that are reputed to catalyze the reaction. Note that we are currently just returning the ECs that KEGG gives. We need to handle the incompletely specified forms (e.g., 1.1.1.-), but we do not do it yet.

rid

ID of the reaction whose catalyzing roles are desired.

RETURN

Returns the IDs of the roles that catalyze the reaction.

catalyzes

    my @ecs = $fig->catalyzes($role);

Returns the reaction IDs of the reactions catalyzed by the specified role (normally an EC).

role

ID of the role whose reactions are desired.

RETURN

Returns a list containing the IDs of the reactions catalyzed by the role.

displayable_reaction

    my $displayString = $fig->displayable_reaction($rid)

Returns a string giving the displayable version of a reaction.

all_maps

    my @maps = $fig->all_maps();

Return all of the KEGG maps in the data store.

ec_to_maps

    my @maps = $fig->ec_to_maps($ec);

Return the set of maps that contain a specific functional role. The role can be specified by an EC number or a full-blown role ID.

ec

The EC number or role ID of the role whose maps are desired.

RETURN

Returns a list of the IDs for the maps that contain the specified role.

role_to_maps

This is an alternate name for ec_to_maps.

map_to_ecs

    my @ecs = $fig->map_to_ecs($map);

Return the set of functional roles (usually ECs) that are contained in the functionality depicted by a map.

map

ID of the KEGG map whose roles are desired.

RETURN

Returns a list of EC numbers for the roles in the specified map.

map_name

    my $name = $fig->map_name($map);

Return the descriptive name covering the functionality depicted by the specified map.

map

ID of the map whose description is desired.

RETURN

Returns the descriptive name of the map, or an empty string if no description is available.

neighborhood_of_role

usage: @roles = $fig->neighborhood_of_role($role)

Returns a list of functional roles that we consider to be "the neighborhood" of $role.

roles_of_function

    my @roles = $fig->roles_of_function($func);

Returns a list of the functional roles implemented by the specified function. This method parses the role data out of the function name, and does not require access to the database.

func

Name of the function whose roles are to be parsed out.

RETURN

Returns a list of the roles performed by the specified function.

protein_subsystem_to_roles

    my $roles = $fig->protein_subsystem_to_roles($peg, $subsystem);

Return the roles played by a particular PEG in a particular subsytem. If the protein is not part of the subsystem, an empty list will be returned.

peg

ID of the protein whose role is desired.

subsystem

Name of the relevant subsystem.

RETURN

Returns a reference to a list of the roles performed by the specified PEG in the specified subsystem.

is_BRC_genome

$fig->is_BRC_genome($genome) returns true if $genome is an BRC genome

is_NMPDR_genome

$fig->is_NMPDR_genome($genome) returns true if $genome is an NMPDR genome

seqs_with_role

    my @pegs = $fig->seqs_with_role($role,$who);

Return a list of the pegs that implement $role. If $who is not given, it defaults to "master". The system returns all pegs with an assignment made by either "master" or $who (if it is different than the master) that implement $role. Note that this includes pegs for which the "master" annotation disagrees with that of $who, the master's implements $role, and $who's does not.

seqs_with_roles_in_genomes

usage: $result = $fig->seqs_with_roles_in_genomes($genomes,$roles,$made_by)

This routine takes a pointer to a list of genomes ($genomes) and a pointer to a list of roles ($roles) and looks up all of the sequences that connect to those roles according to either the master assignments or those made by $made_by. Again, you will get assignments for which the "master" assignment connects, but the $made_by does not.

A hash is returned. The keys to the hash are genome IDs for which at least one sequence was found. $result->{$genome} will itself be a hash, assuming that at least one sequence was found for $genome. $result->{$genome}->{$role} will be set to a pointer to a list of 2-tuples. Each 2-tuple will contain [$peg,$function], where $function is the one for $made_by (which may not be the one that connected).

largest_clusters

usage: @clusters = $fig->largest_clusters($roles,$user)

This routine can be used to find the largest clusters containing some of the designated set of roles. A list of clusters is returned. Each cluster is a pointer to a list of pegs.

Bidirectional Best Hits

best_bbh_candidates

usage: @candidates = $fig->best_bbh_candidates($genome,$cutoff,$requested,$known)

This routine returns a list of up to $requested candidates from $genome. A candidate is a BBH against one of the PEGs in genomes from the list given by@$known. Each entry in the list is a 3-tuple:

    [CandidatePEG,KnownBBH,Pscore]

best_bbh_candidates_additional

usage: @candidates = $fig->best_bbh_candidates_additional($genome,$cutoff,$requested,$known)

This routine returns a list of up to $requested candidates from $genome. A candidate is a BBH against one of the PEGs in genomes from the list given by@$known. The method collects additional information from the similarities and is used in the subsystem extension. Each entry in the list is a 10-tuple:

    [CandidatePEG,KnownBBH,Pscore,fraction, b1, e1, b2, e2, ln1, ln2]

DNA Sequences

extract_seq

usage: $seq = &FIG::extract_seq($contigs,$loc)

This is just a little utility routine that I have found convenient. It assumes that $contigs is a hash that contains IDs as keys and sequences as values. $loc must be of the form

       Contig_Beg_End

where Contig is the ID of one of the sequences; Beg and End give the coordinates of the sought subsequence. If Beg > End, it is assumed that you want the reverse complement of the subsequence. This routine plucks out the subsequence for you.

contigs_of

    my @contig_ids = $fig->contigs_of($genome);

Returns a list of all of the contigs occurring in the designated genome.

genome

ID of the genome whose contigs are desired.

RETURN

Returns a list of the IDs for the contigs occurring in the specified genome.

number_of_contigs

usage: $n=$fig->number_of_contigs($genome)

This uses the SQL count function to count the numbmer of contigs. It should be a lot faster than pulling all the contigs and counting them.

In fact, it causes about a 10-fold increase in speed! Compare fig n_contigs and fig number_of_contigs

all_contigs

usage: @contig_ids = $fig->all_contigs($genome)

Returns a list of all of the contigs occurring in the designated genome.

contig_ln

usage: $n = $fig->contig_ln($genome,$contig)

Returns the length of $contig from $genome.

get_dna_seq

    my $seq = $fig->get_dna_seq($fid);

Returns the DNA sequence for an FID

fid

FIG identifier of the feature whose sequence is desired

RETURN

DNA sequence

dna_seq

usage: $seq = $fig->dna_seq($genome,@locations)

Returns the concatenated subsequences described by the list of locations. Each location must be of the form

    Contig_Beg_End

where Contig must be the ID of a contig for genome $genome. If Beg > End the location describes a stretch of the complementary strand.

Taxonomy

taxonomy_of

usage: $taxonomy = $fig->taxonomy_of($genome_id)

Returns the taxonomy of the specified genome. Gives the taxonomy down to genus and species.

get_taxonomy_id_of

usage: $taxonomyID = $fig->get_taxonomy_id_of($genome_id)

Returns the taxonomy ID of the specified genome. If no taxonomy ID is found the genome id without ".\d+" suffix will be returned.

set_taxonomy_id_for

usage: $taxonomyID = $fig->set_taxonomy_id_for($genome_id)

Sets the taxonomy id for genome.

taxonomy_list

usage: $taxonomy = $fig->taxonomy_list()

Returns the taxonomy list of all organisms in a hash ref. Gives the taxonomy down to genus and species.

is_bacterial

usage: $fig->is_bacterial($genome)

Returns true iff the genome is bacterial.

is_archaeal

usage: $fig->is_archaeal($genome)

Returns true iff the genome is archaeal.

is_prokaryotic

usage: $fig->is_prokaryotic($genome)

Returns true iff the genome is prokaryotic

is_eukaryotic

usage: $fig->is_eukaryotic($genome)

Returns true iff the genome is eukaryotic

is_viral

usage: $fig->is_viral($genome)

Returns true iff the genome is viral

is_plasmid

usage: $fig->is_plasmid($genome)

Returns true iff the genome is marked as being a plasmid

is_environmental

usage: $fig->is_environmental($genome)

Returns true if the genome is from an environmental sample

sort_genomes_by_taxonomy

usage: @genomes = $fig->sort_genomes_by_taxonomy(@list_of_genomes)

This routine is used to sort a list of genome IDs to put them into taxonomic order.

crude_estimate_of_distance

usage: $dist = $fig->crude_estimate_of_distance($genome1,$genome2)

There are a number of places where we need estimates of the distance between two genomes. This routine will return a value between 0 and 1, where a value of 0 means "the genomes are essentially identical" and a value of 1 means "the genomes are in different major groupings" (the groupings are archaea, bacteria, euks, and viruses). The measure is extremely crude.

sort_fids_by_taxonomy

usage: @sorted_by_taxonomy = $fig->sort_fids_by_taxonomy(@list_of_fids)

Sorts a list of feature IDs based on the taxonomies of the genomes that contain the features.

Literature Methods

active_subsystems

    my $ssHash = $fig->active_subsystems($genome, $allFlag);

Get all the subsystems in which a genome is present. The return value is a hash which maps each subsystem name to the code for the variant used by the specified genome.

genome

ID of the genome whose subsystems are desired.

allFlag (optional)

If TRUE, all subsystems are returned, with unknown variants marked by a variant code of -1 and iffy variants marked by a code of 0. If FALSE or omitted, only subsystems in which the variant is definitively known are returned. The default is FALSE.

Subsystem Methods

is_experimental_subsystem

This states if a subsystem is experimental, what would be the opposite of usable.

is_private_subsystem

This states if a subsystem is private, meaning that it cannot be be exported. This is just the opposite of exchangable.

nmpdr_subsystem

Gets and sets whether the subsystem should be published with the NMPDR. Specifically writes a file called NMPDR in the subsystem directory.

Use:

$fig->nmpdr_subsystem($ssa, 1); # to set it as an nmpdr subsystem $fig->nmpdr_subsystem($ssa, -1); # to set it as NOT an nmpdr subsystem $fig->nmpdr_subsystem($ssa); # to test whether it is an nmpdr subsystem

distributable_subsystem

Gets and sets whether the subsystem is freely distributable and should be included in new releases.

Use:

$fig->distributable_subsystem($ssa, 1); # to set it as a distributable subsystem $fig->distributable_subsystem($ssa, -1); # to set it as NOT a distributable subsystem $fig->distributable_subsystem($ssa); # to test whether it is a distributable subsystem

all_subsystems

    my @names = $fig->all_subsystems();

Return a list of all of the subsystems in the data store.

all_usable_subsystems

    my @names = $fig->all_usable_subsystems();

Return a list of all of the subsystems in the data store that are "usable", that is, not experimental or deleted.

Use the subsystem information cache if valid.

index_subsystems

Run indexing on one or more subsystems. If no subsystems are defined we will reindex the whole thing. Otherwise we will only index the defined subsystem. Note that this method just launches index_subsystems as a background job. Returns the job of the child process.

$pid=$fig->index_subsystems("Alkanesulfonates Utilization"); # do only Alkanesulfonates Utilization $pid=$fig->index_subsystems(@ss); # do subsystems in @ss $pid=$fig->index_subsystems(); # do all subsystems

perform_subsystem_salvage

    my $glist = [['273035.1', '273035.4']];
    my $pmap = { 'fig|273035.1.peg.1' => 'fig|273035.4.peg.4', ... };
    $fig->perform_subsystem_salvage($glist, $pmap);

For each subsystem in this SEED, perform a subsystem salvage operation for each old-genome / new-genome pair in $glist. This operation will determine if the old genome exists in the subsystem. If it does, the new genome is added to the subsystem, and we attempt to map the pegs from the cells in the old subsystem's row to the new subsystem. If all pegs map, we copy the variant code for the genome. If all cells did not map, we prepend a * to the variant code before copying.

all_constructs

Hmmm...

subsystem_version

 my $version=subsystem_version($subsystem_name)

Returns the current version of the subsystem.

subsystem_classification

 Get or set the classification of the subsystem. Added by RAE in response to the changes made on seed wiki
 If a reference to an array is supplied it is saved as the new classification of the subsystem.
 Regardless, the current classification is returned as a reference to an array. There is no control over what the things are.
 Returns a reference to an empty array if a valid subsystem is not supplied, or if no classification is known
 The classification is stored as a \t separated list of things in $subsys/CLASSIFICATION. There is no control over what the things are.

all_subsystem_classifications

    my @classifications = $fig->all_subsystem_classifications();

Return a list of all the subsystem classifications. Each element in the list will contain a main subsystem class and a basic subsystem class. The resulting list enables us to determine easily what the three-level subsystem tree would look like.

subsystem_curator

usage: $curator = $fig->subsystem_curator($subsystem_name)

Return the curator of a subsystem.

subsystem_info

Returns the number of diagrams of the passed subsystem.

subsystems_for_genome

usage: @subsystems = $fig->subsystems_for_genome($genome, $all)

Return the list of subsystems in which the genome has been entered.

@subsystems is a list of subsystem names.

It will only return those genomes with a variant code other than 0 or -1, unless the $all argument is "true" (in which case all subsystems are returned).

If $all is 2 then it will return all subsystems with a variant code other than -1.

subsystem_genomes

usage: $genomes = $fig->subsystem_genomes($subsystem_name, $all)

Return the list of genomes in the subsystem.

$genomes is a list of tuples (genome_id, name)

unless ($all) is set to true it will only return those genomes with a variant code other thaN 0 OR -1.

readSpreadsheetForGenomes

    my $genomeList = $fig->readSpreadsheetForGenomes($fileName, $all);

Read the genomes from a specific subsystem file. This allows the client to get the genome data for a backup subsystem.

fileName

Name of the subsystem spreadsheet file.

all

If TRUE, all genomes will be read. Otherwise, only those genomes with a specific variant code (i.e. not 0 or -1) will be returned.

RETURN

Returns a reference to a list of 2-tuples, each consisting of a genome ID and the genome's name.

get_subsystem

    my $subsysObject = $fig->get_subsystem($name, $force_load);

Return a subsystem object for manipulation of the named subsystem. If the subsystem does not exist, an undefined value will be returned.

name

Name of the desired subsystem.

force_load

TRUE to reload the subsystem from the data store even if it is already cached in memory, else FALSE.

RETURN

Returns a blessed object that allows access to subsystem data, or an undefined value if the subsystem does not exist.

clear_subsystem_cache

    $fig->clear_subsystem_cache();

Delete all subsystems from the subsystem cache. This is not normally needed, because the cache is kept fairly small. However, in cases where all of the subsystems are needed, the cache grows by more than a gigabyte, and because the subsystems point back to the FIG object, the memory is not cleaned up properly. Calling this mehtod before you release the FIG object removes that problem.

subsystem_to_roles

    my @roles = $fig->subsystem_to_roles($subsysID);

Return a list of the roles for the specified subsystem.

subsysID

Name (ID) of the subsystem whose roles are to be listed.

RETURN

Returns a list of role IDs.

install_subsystem_directory_on_server

Install the given local subsystem directory on the SEED at the URL provided. If authentication is required, the given username and password will be used.

Uses an HTTP POST of the tarfile of the contents of the local directory to the install_subsystem_dir.cgi CGI script.

subsystems_for_peg

 Return the list of subsystems and roles that this peg appears in.
 Returns an array. Each item in the array is
 a reference to a tuple of subsystem and role.  If the last argument ($noaux)
 is "true", only roles playing non-auxiliary roles will be returned.

subsystems_for_peg_complete

 Return the list of subsystems that this peg appears in.
 Returns an array. Each item in the array is
 a reference to a tuple of subsystem, role, variant and is_auxiliary.

subsystems_for_pegs_complete

 Return the list of subsystems, roles and variants that the pegs appear in.
 Returns a hash keyed by peg. Each item in the hash is a reference to a tuple
 of subsystem, role and variant. If the last argument ($include_aux)
 is "true", also roles playing auxiliary roles will be returned.

subsystems_for_peg

 Return the list of subsystems and roles that this peg appears in.
 Returns an array. Each item in the array is
 a reference to a tuple of subsystem and role.  If the last argument ($noaux)
 is "true", only roles playing non-auxiliary roles will be returned.

subsystems_roles

Return the list of subsystems and roles for every peg in subsystems Returns an array. Each item in the array is a reference to a three-ple of subsystem, role, and peg.

subsystems_for_role

Return a list of subsystems, roles, and proteins containing a given role

Returns an array. Each item in the array is a reference to a three-ple of subsystem, role, and peg.

subsystems_for_ec

Return a list of subsystems, roles, and proteins containing an EC number.

Returns an arrray. Each item in the array is a reference to a three-ple of subsystem, role, and peg.

assigned_pegs_in_subsystems

Return list of [peg, function, ss, role in ss].

assigned_pegs_not_in_ss

Return all pegs with non-hypothetical assignments that are not in ss.

assigned_pegs

Return list of [peg, function, ss, role in ss] for every non-hypo protein regardless of being in ss

subsystem_roles

Return a list of all roles present in locally-installed subsystems. The return is a hash keyed on role name with each value a list of subsystem names.

get_genome_subsystem_count

    my $num_subsytems = $fig->get_genome_subsystem_count($genomeID);

Return the number of subsystems of the genome identified by $genomeID.

genomeID

ID of the genome whose number of subsystems is to be returned.

RETURN

Returns the number of subsystems.

get_all_subsystem_pegs

    my @pegData = $fig->get_all_subsystem_pegs($genomeID);

Return the subsystems, roles, and variant codes for all features in the specified genome. Unlike get_genome_subsystem_data, this method returns all pegs, regardless of the variant code.

genomeID

ID of the relevant genome.

RETURN

Returns a hash that maps each subsystem ID to a list of 3-tuples, each consisting of a role ID, a peg ID, and a variant code.

get_genome_subsystem_data

    my $roleList = $fig->get_genome_subsystem_data($genomeID);

Return the roles and pegs for a genome's participation in subsystems. The subsystem name, role ID, and feature ID will be returned for each of the genome's subsystem-related PEGs.

genomeID

ID of the genome whose PEG breakdown is desired.

RETURN

Returns a pointer to a list of 3-tuples. Each tuple consists of a subsystem name, a role ID, and a feature ID.

get_genome_stats

    my ($gname,$szdna,$pegs,$rnas,$taxonomy) = $fig->get_genome_stats($genomeID);

Return basic statistics about a genome.

genomeID

ID of the relevant genome.

RETURN

Returns a 5-tuple containing the genome name, number of base pairs, number of PEG features, number of RNA features, and the taxonomy string.

get_genome_assignment_data

    my $roleList = $fig->get_genome_subsystem_data($genomeID);

Return the functional assignments and pegs for a genome. The feature ID and assigned function will be returned for each of the genome's PEGs.

genomeID

ID of the genome whose PEG breakdown is desired.

RETURN

Returns a list of 2-tuples. Each tuple consists of a peg ID and its master functional assignment.

get_valid_cache_file

If the given cache file (name is relative to the FIG cache directory) exists and is less than a day old (Parameterize this sometime!) open and return a filehandle.

add_dlit

    $rc = $fig->add_dlit( 
                          -status   => 'D',       # required 
                          -peg      => $peg,      # or -md5 => $md5,  # one is required 
                          -pubmed   => $pubmed,   # required 
                          -curator  => 'RossO',   # required 
                          -go       => '',        # default = '' 
                          -override => 1);        # default = 0

This adds a dlit tuple. The currently supported arguments are

    -status =>          ' '  for not curated 
                        'D'  for dlit (direct literature on role) 
                        'G'  for genome data (propagates to all ' ' entries for this article) 
                        'N'  for not relevant 
                        'R'  for relevant, but not dlit
    -md5    =>          supply an md5 hash code for the peg, not the id.
    -peg    =>          the peg being connected to literature.  This peg will
                        be treated as a representative of the set that have the
                        same protein sequence.
    -pubmed =>          pubmed ID (all numeric, but stored as string)
    -curator =>         curator making the assertion (30 char max)
    -go     =>          an optional list of 3-character codes separated by commas
    
    -override =>        0 -> if there is an existing tuple, ignore this request 
                        1 -> if there is an existing tuple, replace it

The returned value will be

                        0 -> the tuple was not inserted 
                        1 -> the tuple was inserted
=cut

sub add_dlit { my( $self, @parms ) = @_; if (! $self->table_exists('dlits')) { system "load_dlits"; }

    my %parms = @parms;        #  Previous code clobbered the defaults
    $parms{-go}       ||= '';  #  Moved default here
    $parms{-override} ||=  0;  #  Moved default here
    #  Check for required parameters
    return 0 if ! $parms{-status};
    return 0 if ! ( $parms{-peg} || $parms{-md5} );
    return 0 if ! $parms{-pubmed};
    return 0 if ! $parms{-curator};
    my $status   = $parms{-status};
    my $peg      = $parms{-peg};
    my $md5      = $peg ? $self->md5_of_peg($peg) : lc $parms{-md5};
    my $pubmed   = $parms{-pubmed};
    my $curator  = $parms{-curator};
       $curator  =~ s/^master://i;     # Strip master from the recorded curator
    my $go       = $parms{-go};
    my $override = $parms{-override};  # Moved here to collect initializations
    my $rdbH = $self->db_handle;
    my $db_resp =
             $rdbH->SQL( "SELECT  status, md5_hash, pubmed, curator, go_code 
                          FROM dlits
                          WHERE ((md5_hash = '$md5') and (pubmed = '$pubmed'))"
                       );
    my $delete;
    if (@$db_resp == 1)
    {
        #  Default is no clobber except uncurated (i.e., $status eq ' ') -- GJO
        if ( ( $db_resp->[0]->[0] ne ' ' ) && ( ! $override ) ) { return 0 }
        
        $rdbH->SQL( "DELETE
                     FROM dlits
                     WHERE ((md5_hash = '$md5') and (pubmed = '$pubmed'))"
                  );
        $delete = join( "\t", 'delete', @{$db_resp->[0]} ) . "\n";
    }
    my $rc =  $rdbH->SQL( "INSERT
                           INTO dlits ( status,md5_hash,pubmed,curator,go_code ) 
                           VALUES ( '$status','$md5','$pubmed','$curator','$go' )"
                        );
    # Add logging
    if ( $rc )
    {
        &verify_dir( "$FIG_Config::data/Dlits" );
        if ( open LOG, ">>$FIG_Config::data/Dlits/dlits.log" )
        {
            print LOG $delete if $delete;
            print LOG join( "\t", 'insert', $status, $md5, $pubmed, $curator, $go ), "\n";
            close LOG;
        }
    }
    if ( $rc && ( $status eq "G") )
    {
        #  Only overwrite ' '  status with 'G' status -- GJO
        #  Update the curator, too -- GJO
        $rc = $rdbH->SQL( "UPDATE dlits
                           SET status = 'G', curator = '$curator'
                           WHERE ( pubmed = '$pubmed' ) AND ( status = ' ' )"
                        );
    }
    return $rc;
}

dlit_status

    $rc = $fig->dlit_status( 
                          -md5      => $md5,      # or -peg, one is required 
                          -peg      => $peg,      # or -md5, one is required 
                          -pubmed   => $pubmed,   # required 
                          );

This returns the current status code of a dlit, or undefined. The currently supported arguments are

    -md5    =>          supply an md5 hash code for the peg, not the id.
    -peg    =>          the peg being connected to literature.  This peg will
                        be treated as a representative of the set that have the
                        same protein sequence.
    -pubmed =>          pubmed ID (all numeric, but stored as string)

The returned value will be

    $status             called in scalar context 
    ( $status_code, $curator, $go_code) called in list array context 
=cut

sub dlit_status { my( $self, @parms ) = @_; if (! $self->table_exists('dlits')) { system "load_dlits"; }

    my %parms = @parms;        #  Previous code clobbered the defaults
    #  Check for required parameters
    if ( ! ( $parms{-peg} || $parms{-md5} ) || ! $parms{-pubmed} )
    {
        return wantarray ? () : undef;
    }
    
    my $peg    = $parms{-peg};
    my $md5    = $peg ? $self->md5_of_peg($peg) : lc $parms{-md5};
    my $pubmed = $parms{-pubmed};
    my $rdbH = $self->db_handle;
    my $db_resp = $rdbH->SQL( "SELECT  status, curator, go_code 
                               FROM    dlits
                               WHERE ((md5_hash = '$md5') and (pubmed = '$pubmed'))"
                            );
    return $db_resp && @$db_resp ? ( wantarray ? @{$db_resp->[0]} : $db_resp->[0]->[0] )
                                 : ( wantarray ? () : undef );
}

all_dlits

    $dlits = $fig->all_dlits();

Returns a reference to an array of all current dlit data.

The returned value is

    [ [ status, md5_hash, pubmed, curator, go_code ], ... ]
=cut

sub all_dlits { my($self) = @_; my $rdbH = $self->db_handle;

    my $db_resp = $rdbH->SQL( "SELECT * FROM dlits" );
    return [ sort { $a->[1] cmp $b->[1] }  #  Sorted by protein
             @$db_resp
           ];
}

all_dlits

    $dlits = $fig->all_dlits();

Returns a reference to an array of all current dlit data.

The returned value is

    [ [ status, md5_hash, pubmed, curator, go_code ], ... ]
=cut

sub all_dlits_status { my( $self, $status ) = @_; my $rdbH = $self->db_handle;

    my $db_resp = $rdbH->SQL( "SELECT * FROM dlits where status = '$status'" );
    return [ sort { $a->[1] cmp $b->[1] }  #  Sorted by protein
             @$db_resp
           ];
}

export_dlits

    $rc = $fig->export_dlits(); 
    $rc = $fig->export_dlits( $file );

Writs all current dlit data to $FIG_Config::data/Dlits/dlits, or to a specified file.

The returned value is 1 on success, or 0 on failure. =cut

sub export_dlits { my ( $self, $file ) = @_; my $rdbH = $self->db_handle;

    $file ||= "$FIG_Config::data/Dlits/dlits"; 
    open( DLITS, ">$file" ) || return 0;
    my $db_resp = $rdbH->SQL( "SELECT * FROM dlits" );
    $db_resp || return 0;
    foreach my $x ( @$db_resp ) { print DLITS join( "\t", @$x ), "\n" }
    close(DLITS);
    return 1;
}

add_title

    $rc = $fig->add_title( $pubmed_id, $title )

Add a pubmed title to the database. If the pubmed_id is not already present, the id and title are added. The return code reflects that success or failure of the add. If the pubmed_id is already defined, and the titles match, there is no change, and the return code is 2. If the id exists and the title is different, no change is made, and the return code is 0. To change an existing title, use:

    $rc = $fig->update_title( $pubmed_id, $title )

The returned values are:

    0  attempting to change a title, or failure; 
    1  successful addition of a new title; or 
    2  existing and new titles are the same

update_title

    $rc = $fig->update_title( $pubmed_id, $title )

Add or change a pubmed title to the database. If the pubmed_id is not already present, the id and title are added. The return code reflects that success or failure of the add. If the pubmed_id is already defined, and the titles match, there is no change, and the return code is 2. If the id exists and the title is different, no change is made, and the return code is 0. To change an existing title, use:

    $rc = $fig->update_title( $pubmed_id, $title )

The returned values are:

    0  on failure; 
    1  successful addition or change of a title; or 
    2  existing and new titles are the same

get_title

    $title = $fig->get_title( $pubmed_id )

Get a title for a literature id

Returned value:

    $title   upon success
    undef    upon failure

all_titles

    [ [ id, title ], ... ] = $fig->all_titles()

Get all pubmed_id, title pairs

Returned value:

    [ [ id, title ], ... ]   upon success
    []                       upon failure

PEG Translations

tough_search($pegs, $seq_of, $tran_peg, $sought)

$pegs - not used $seq_of - hash from peg to peg sequence $tran_peg - hash into which translated pegs are placed $sought - hash keyed on the list of pegs we're looking for.

find_genome_by_content

Find a genome given the number of contigs, number of nucleotides, and checksum. We pass in a potential name for the genome as a quick starting check.

Links

fid_links

    my @links = $fig->fid_links($fid);

Return a list of hyperlinks to web resources about a specified feature.

fid

ID of the feature whose hyperlinks are desired.

RETURN

Returns a list of raw HTML strings representing hyperlinks to web pages relating to the specified feature.

fids_with_link_to

    my @links = $fig->fids_with_link_to("text");

Return a list of tples of [fid, link] where text is a free-text string that will match to the URL. You can use this to get all the links that point to PIR, for example to identify all proteins that are members of PIR superfamilies.

text

A free-text match to the URL. The match is made using the SQL "like" command, so try to be as specific as possible.

RETURN

Returns a list tuples of [fid, link]

Search Database

Searches the database for objects that match the query string in some way.

Returns a list of results if the query is ambiguous or an unique identifier otherwise.

Peg Searches and Similarities

Some routines for dealing with peg search and similarities.

This is code lifted from pom.cgi and reformatted for more general use.

Find the given role in the given (via CGI params) organism.

We do this by finding a list of pegs that are annotated to have this role in other organisms that are "close enough" to our organism

We then find pegs in this organism that are similar to these pegs.

Utility Methods

is_ec

    my $flag = FIG::is_ec($role);

Return TRUE if the specified role is an EC number, else FALSE. This can be used to determine whether a role is specified via a role ID or the role's EC number.

role

Role ID or EC number to check.

RETURN

Returns TRUE if the specified role specification is an EC number, and FALSE if it is a true role ID.

run_in_background

Background job support.

If one wants to turn a script into a background, invoke $fig->run_in_background($coderef). This will cause $coderef to be invoked as a background job. This means its output will be written to $FIG_Config::data/Global/background_jobs/<pid>, and that it shows up and is killable via the seed control panel.

External Interface Methods

This section contains the functionality introduced by the interface with GenDB. The initial two functions simply register when GenDB has a version of the genome (so we can set links to it when displaying PEGS:

has_genome

usage: has_genome("GenDB",$genome)

Invoking this routine just records that GenDB has a copy of the genome designated by $genome.

dropped_genome

usage: dropped_genome("GenDB",$genome)

Invoking this routine just records that GenDB should no longer be viewed as having a copy of the genome designated by $genome.

link_to_system

usage: $url = link_to_system("GenDB",$fid) # usually $fid is a peg, but it can be other types of features, as well

This routine is used to get a URL that can be used to "flip" from one system to the other. If the feature is unknown to the system, undef should be returned.

Feature Update Methods

The following routines support alteration of features

delete_feature

usage: $fig->delete_feature($user,$fid)

Invoking this routine deletes the feature designated by $fid.

add_feature

    my $fid = $fig->add_feature($user,$genome,$type,$location,$aliases,$translation,$fid);

Invoking this routine adds the feature, returning a new (generated) $fid. It is also possible to specify the feature ID, which is recommended if the feature is to be permanent. (In order to do this the ID needs to be allocated from the clearinghouse machine.) The translation is optional and only applies to PEGs.

genome

ID of the genome to which the feature belongs.

type

Type of the feature (peg, rna, etc.)

location

Location of the feature, in the form of a comma-delimited list of location specifiers. These are of the form contig_begin_end, where contig is the ID of a contig, and begin and end are the starting and stopping offsets of the location. These offsets are 1-based, and depending on the strand, the beginning offset could be larger than the ending offset.

aliases

A comma-delimited list of alias names for the feature.

translation (optional)

The protein translation of the feature, if it is a peg.

fid (optional)

The ID to give to the new feature. If this parameter is omitted, an ID will be generated automatically.

RETURN

Returns the new feature's ID if successful,or undef if an error occurred.

clearinghouse_next_feature_id

    my $val = $fig->clearinghouse_next_feature_id($genome, $type)

Return the next feature ID that would be allocated by the clearinghouse for the given genome and feature type.

clearinghouse_register_metagenome_taxon_id

    my $tax = $fig->clearinghouse_register_metagenome_taxon_id($username, $genome_name)

Register a new taxon id for the MG-RAST metagenome server.

clearinghouse_register_subsystem_id

    my $tax = $fig->clearinghouse_register_subsystem_id($ss_name);

Return a subsystem's short ID. Short IDs are maintained at a special clearinghouse web site. If the subsystem does not yet have a short ID, a new one will be assigned by the clearinghouse and returned.

ss_name

Name of the subsystem whose ID is desired.

RETURN

ID of the desired subsystem.

clearinghouse_lookup_subsystem_by_id

    my $tax = $fig->clearinghouse_lookup_subsystem_by_id($ss_name)

Register a subsystem id for the given subsystem name. Returns the existing id if already present.

clearinghouse_register_features

    my $val = $fig->clearinghouse_register_features($genome, $type, $num)

Register $num new features of type $type on genome $genome. Returns the starting index for the new features.

call_start

usage: $fig->call_start($genome,$loc,$translation,$against)

This routine can be invoked to produce an estimate of the correct start, given a location in a genome believed to be a protein-encoding gene, along with a set of PEGs that are believed to be orthologs. If called in a list context, it returns a list containing

    a string representing the estimated start location
    a confidence measure (better than 0.2 seems to be pretty solid)
    a new translation

If called in a scalar context, it returns its best prediction of the start.

pick_gene_boundaries

usage: $fig->pick_gene_boundaries($genome,$loc,$translation)

This routine can be invoked to expand a region of similarity to potential gene boundaries. It does not try to find the best start, but only the one that is first after the beginning of the ORF. It returns a list containing the predicted location and the expanded translation. Thus, you might use

($new_loc,$new_tran) = $fig->pick_gene_boundaries($genome,$loc,$tran); $recalled = $fig->call_start($genome,$new_loc,$new_tran,\@others);

to get the location of a recalled gene (in, for example, the process of correcting a frameshift).

change_location_of_feature

usage: $fig->change_location_of_feature($fid,$location,$translation)

Invoking this routine changes the location of the feature. The $translation argument is optional (and applies only to PEGs).

The routine returns 1 on success and 0 on failure.

genome_to_gg

Render a genome's contig as GenoGraphics objects.

Markup Helper Methods

This section contains the methods used to read and write Markup data. The markup data associates labels with sections of a feature's translation.

In the SEED, Markup data is stored in a separate file for each marked feature in the the feature type subdirectory for an organism. So, for example, the PEG markups for fig|83333.1.peg.4 would be in the file

    FIG/Data/Organisms/83333.1/peg/markup4.tbl

The file is stored in tab-separated form. Each line contains the following fields

start

1-based offset into the translation of the first amino acid to mark

len

number of amino acids to mark

label

label identifying the type of markup

Reading and writing these tiny files is extremely fast, but they do have more overhead than would be expected if the data were stored in a single flat file managed by pointers from the FIG database. If that apprach becomes desirable, then only this section of FIG.pm needs to be changed.

ReadMarkups

    my $marks = $fig->ReadMarkups($fid);

Read the markup data for the specified feature. The markings are returned as a list of triples. Each triple contains the start location of a markup, the length of the markup, and the label.

fid

ID of the feature whose markups are to be read.

RETURN

Returns a reference to list of 3-tuples. Each list element will consist of the starting offset of the markup (1-based), the length of the markup, and the label. All values are expressed in terms of distance into the protein translation of the feature.

WriteMarkups

    $fig->WriteMarkups($fid, \@marks);

Write out the markups for the specified feature. If the markup file for the specified feature does not exist, it will be created. If it does exist, it will be completely overwritten.

fid

ID of the feature whose markups are to be written

marks

Reference to a list of markups. Each markup is in the form of a 3-tuple consisting of the 1-based offset to the start of the markup, the length of the markup, and the markup label. The offset and length are specified in terms of the protein translation string.

_MarkupFileName

    my $name = FIG::_MarkupFileName($fid);

Return the name of the file containing the markup data for the specified feature.

fid

ID of the feature whose markup file is desired.

RETURN

Returns the full path of the file containing the feature markups for the feature desired.

UserData Helper Methods

This section contains the methods used to implement UserData access. User data is stored in a subdirectory given by the user's name under the Users directory in the Global directory tree. In other words, the data for the default user basic would be at $FIG_Config::global/Users/basic.

In each directory, the capabilities.tbl file contains the capability data and the preferences.tbl file contains the preferences. Currently, preferences are stored in a single file, but if performance becomes a problem we may split them by category.

Each of these files has two columns of data-- a key and a value. In the preferences file the key is a hierarchical construct with the pieces separated by colons, and the value is essentially a free-format string understood only by the application. In the capabilities file the key is a group name, and the value is an access level-- RW (full access), RO (read-only access), or NO (no access).

Group names and key names are not allowed to contain white space. Tabs are used to separate them from the value strings or access levels. The value strings for preferences cannot contain tabs or new-lines. A backslash escape mechanism will be used to allow tabs and new-lines to be specified in the preference values.

The files are sorted by key, to make updates easier.

The special Security_Default subdirectory is used to track the default security options for each secure object. The object's security group and default level are specified in a file whose name is formed by appending the object ID to the object type with an extension of "tbl". So, for example, the file containing the security default information for Genome 83333.1 would be

    $FIG_Config::global/Users/Security_Default/Genome_83333.1.tbl

Each of these is a tiny file with the group name and default access level for that organism or subsystem. The two fields of the file are tab-separated, and any new-line character at the end is ignored.

GetDefault

    my ($group, $level) = $fig->GetDefault($objectID, $objectType);

Return the group name and default access level for the specified object.

objectID

ID of the object whose capabilities data is desired.

objectType

Type of the object whose capabilities data is desired. This should be expressed as a Sprout entity name. Currently, the only types supported are Genome and Subsystem.

RETURN

Returns a two-element list. The first element is the name of the group to witch the object belongs; the second is the default access level (RW, RO, or NO). If the object is not found, an empty list should be returned.

GetPreferences

    my $preferences = $fig->GetPreferences($userID, $category);

Return a map of preference keys to values for the specified user in the specified category.

userID

ID of the user whose preferences are desired.

category (optional)

Name of the category whose preferences are desired. If omitted, all preferences should be returned.

RETURN

Returns a reference to a hash mapping each preference key to a value. The keys are fully-qualified; in other words, the category name is included. It is acceptable for the hash to contain key-value pairs outside the category. In other words, if it's easier for you to read the entire preference set into memory, you can return that one set every time this method is called without worrying about the extra keys.

GetCapabilities

    my $level = $fig->GetCapabilities($userID);

Return a map of group names to access levels (RW, RO, or NO) for the specified user.

userID

ID of the user whose access level is desired.

RETURN

Returns a reference to a hash mapping group names to the user's access level for that group.

AllowsUpdates

    my $flag = $fig->AllowsUpdates();

Return TRUE if this access object supports updates, else FALSE. If the access object does not support updates, none of the SetXXXX methods will be called.

SetDefault

    $fig->SetDefault($objectID, $objectType, $group, $level);

Set the group and default access level for the specified object.

objectID

ID of the object whose access level and group are to be set.

objectType

Type of the relevant object. This should be expressed as a Sprout entity name. Currently, only Genome and Subsystem are supported.

group

Name of the group to which the object will belong. A user's access level for this group will override the default access level.

level

Default access level. This is the access level used for user's who do not have an explicit capability specified for the object's group.

SetCapabilities

    $fig->SetCapabilities($userID, \%groupLevelMap);

Set the access levels by the specified user for the specified groups.

userID

ID of the user whose capabilities are to be updated.

groupLevelMap

Reference to a hash that maps group names to access levels. The legal access levels are RW (read-write), RO (read-only), and NO (no access). An undefined value for the access level indicates the default level should be used for that group. The map will not replace all of the user's capability date; instead, it overrides existing data, with the undefined values indicating the specified group should be deleted from the list.

SetPreferences

    $fig->SetPreferences($userID, \%preferenceMap);

Set the preferences for the specified user.

userID

ID of the user whose preferences are to be udpated.

preferenceMap

Reference to a hash that maps each preference key to its value. The keys should be fully-qualified (that is, they should include the category name). A preference key mapped to an undefined value will use the default preference value for that key. The map will not replace all of the user's preference data; instead, it overrides existing data, with the undefined values indicating the specified preference should be deleted from the list.

CleanupUserData

    $fig->CleanupUserData();

Release any data being held in memory for use by the UserData object.

UserData Utilities

GetObjectCapabilityFile

    my $fileName = FIG::_GetObjectCapabilityFile($objectType, $objectID);

This is an internal method that computes the name of the file containing the default group and access data for a specified object. It returns the file name.

to_structured_english

    my ($ev_code_list, $subsys_list, $english_string) = $fig->to_structured_english($fig, $peg, $escape_flag);

Create a structured English description of the evidence codes for a PEG, in either HTML or text format. In addition to the structured text, we also return the subsystems and evidence codes for the PEG in list form.

peg

ID of the protein or feature whose evidence is desired.

escape_flag

TRUE if the output text should be HTML, else FALSE

RETURN

Returns a three-element list. The first element is a reference to a list of evidence codes, the second is a list of the subsystem containing the peg, and the third is the readable text description of the evidence.

GetUserDataDirectory

    my $directoryName = FIG::_GetUserDataDirectory($userName);

Return the name of the directory containing the user's preference and capability data. If the user does not have a directory, return undef.

userName

Name of the user whose directory is desired.

RETURN

Returns the name of the user's preference/capability directory. If the user does not exist, the directory will be created automatically. If this policy is changed, return undef to indicate an invalid user name.

GetUserDataFile

    my %userData = FIG::_GetUserDataFile($userID, $type, $prefix);

Create a hash from the user data file of the specified type. The user data file contains two tab-delimited fields. The first field will be read in as the key of the hash and the second as the data value. The file must be sorted, and only records beginning with the character string in $prefix will be put in the hash.

userID

Name of the user whose preference or capability data is desired.

type

Type of file desired: preferences or capabilities.

RETURN

Returns a hash containing all the key/value pairs in the user file of the specified type. If the file is not found, will return an empty hash.

ProcessUpdates

    FIG::_ProcessUpdates($fileName, \%map);

Apply the specified updates to a key-value file. The records in the key-value file must be sorted. If a key in the map matches a key in the file, the file's key value is replaced. If a key in the map is not found in the file, it is added. If a key in the map is found in the file and it has an undefined value in the map, then the key is deleted.

fileName

Name of the file to be updated.

map

Reference to a hash mapping keys to values. The keys may not contain any whitespace. The value will be escaped before it is written.

GetInputKVRecord

    my ($key, $value) = FIG::_GetInputKVRecord($handle);

Read a key/value pair from the specified input file. If we are at end-of-file the key returned will be the Tracer::EOF constant. The key and value are separated by a tab. The value will be unescaped if it exists.

handle

Open handle for the input file.

RETURN

Returns a two-element list. The first element will be the first field of the input record; the second element will be the second field. If we are at end-of-file, the first element will be the Tracer::EOF constant.

PutOutputKVRecord

    FIG::_PutOutputKVRecord($handle, $key, $value);

Write a key-value pair to the output file. The value will automatically be escaped. A tab will be used to separate the fields.

handle

Open output file handle.

key

First field to put in the output record.

value

Value field to put in the output record. It will automatically be escaped. If it is undefined, the method will have no effect. An undefined value therefore serves as a deleted-line marker.

scenario_directory

    FIG->scenario_directory($organism);

Returns the scenario directory of an organism. If the organism is 'All', returns the directory containing all possible paths through scenarios.

$organism

The seed-taxonomy id of the organism, e.g. 83333.1, or 'All'.

get_scenario_info

    FIG->scenario_directory(@subsystem_names)

Returns a reference to a hash containing the scenario information for the specified subsystems. The hash keys are subsystem names, the hash values are hashes keyed by subsystem name and with yet more hashes as values. The keys to these hashes are the strings "input_compounds", "output_compound", "map_ids", "additional_reactions" and "ignore reaction", values are references to lists of KEGG ids. If a subsystem has no scenarios, no hash entry is created for that subsystem.

@subsystem_names

A list of subsystem names.

FIG::Job module

init_das

Initialize a DAS data query object.