clean_attribute_key()split_attribute_oid()join_attribute_oid()update_attributes_metadata()
This is the main object for access to the SEED data store. The data store itself is a combination of flat files and a database. The flat files can be moved easily between systems and the database rebuilt as needed.
A reduced set of this object's functions are available via the SFXlate object. The SFXlate object uses a single database to represent all its genomic information. It provides a much smaller capability for updating the data, and eliminates all similarities except for bidirectional best hits.
The key to making the FIG system work is proper configuration of the
FIG_Config.pm file. This file contains names and URLs for the key
directories as well as the type and login information for the database.
FIG was designed to operate as a series of peer instances. Each instance is updated independently by its owner, and the instances can be synchronized using a process called a peer-to-peer update. The terms SEED instance and peer are used more-or-less interchangeably.
The POD documentation for this module is still in progress, and is provided on an AS IS basis without warranty. If you have a correction and you're not a developer, EMAIL the details to bruce@gigabarb.com and I'll fold it in.
NOTE: The usage example for each method specifies whether it is static
FIG::something
or dynamic
$fig->something
If the method is static and has no parameters (FIG::something()) it can
also be invoked dynamically. This is a general artifact of the
way PERL implements object-oriented programming.
We save the DB handle, cache taxonomies, and put a few other odds and ends in the FIG object. We expect users to invoke these services using the object $fig constructed using:
use FIG;
my $fig = new FIG;
$fig is then used as the basic mechanism for accessing FIG services. It is, of course, just a hash that is used to retain/cache data. The most commonly accessed item is the DB filehandle, which is accessed via $self->db_handle.
We cache genus/species expansions, taxonomies, distances (very crudely estimated) estimated between genomes, and a variety of other things.
my $fig = FIG->new();
This is the constructor for a FIG object. It uses no parameters. If tracing
has not yet been turned on, it will be turned on here. The tracing type and
level are specified by the configuration variables $FIG_Config::trace_levels
and $FIG_Config::trace_type. These defaults can be overridden using the
environment variables Trace and TraceType, respectively.
my $value = $fig->CacheTrick($self, $field => $evalString);
This is a helper method used to create simple field caching in another object. If the named field is found in $self, then it will be returned directly. Otherwise, the eval string will be executed to compute the value. The value is then cahced in the $self object so it can be retrieved easily when needed. Use this method to make a FIG data-access object more like an object created by PPO or ERDB.
Hash or blessed object containing the cached fields.
Name of the field desired.
String that can be evaluated to compute the field value.
Returns the value of the desired field.
Returns GO term for GO number from go_number_to_term table in database
my $dbh = $fig->db_handle;
Return the handle to the internal DBrtns object. This allows direct access to the database methods.
my $x = $fig->cached($name);
Return a reference to a hash containing transient data. If no hash exists with the specified name, create an empty one under that name and return it.
The idea behind this method is to allow clients to cache data in the FIG object for later use. (For example, a method might cache feature data so that it can be retrieved later without using the database.) This facility should be used sparingly, since different clients may destroy each other's data if they use the same name.
Name assigned to the cached data.
Returns a reference to a hash that is permanently associated with the specified name. If no such hash exists, an empty one will be created for the purpose.
my $name = $fig->get_system_name;
Returns seed, indicating that this is object is using the SEED
database. The same method on an SFXlate object will return sprout.
The destructor releases the database handle.
my $sameFlag = FIG::same_seqs($s1, $s2);
Return TRUE if the specified protein sequences are considered equivalent and FALSE otherwise. The sequences should be presented in nr-analysis form, which is in reverse order and upper case with the stop codon omitted.
The sequences will be considered equivalent if the shorter matches the initial portion of the long one and is no more than 30% smaller. Since the sequences are in nr-analysis form, the equivalent start potions means that the sequences have the same tail. The importance of the tail is that the stop point of a PEG is easier to find than the start point, so a same tail means that the two sequences are equivalent except for the choice of start point.
First protein sequence, reversed and with the stop codon removed.
Second protein sequence, reversed and with the stop codon removed.
Returns TRUE if the two protein sequences are equivalent, else FALSE.
$fig->is_locked_fid($fid);
returns 1 iff $fid is locked
$fig->lock_fid($user,$fid);
Sets a lock on annotations for $fid.
$fig->unlock_fid($user,$fid);
Sets a unlock on annotations for $fid.
$fig->delete_genomes(\@genomes);
Delete the specified genomes from the data store. This requires making system calls to move and delete files.
my $ok = $fig->add_genome($genomeF, $force, $skipnr);
Add a new genome to the data store. A genome's data is kept in a directory by itself, underneath the main organism directory. This method essentially moves genome data from an external directory to the main directory and performs some indexing tasks to integrate it.
Name of the directory containing the genome files. This should be a fully-qualified directory name. The last segment of the directory name should be the genome ID.
This will ignore errors thrown by verify_genome_directory. This is bad, and you should never do it, but I am in the situation where I need to move a genome from one machine to another, and although I trust the genome I can't.
We don't always want to add the proteins into the nr database. For example wih a metagnome that has been called by blastx. This will just skip appending the proteins into the NR file.
Returns TRUE if successful, else FALSE.
my ($mode, @genomes) = FIG::parse_genome_args(@args);
Extract a list of genome IDs from an argument list. If the argument list is empty, return all the genomes in the data store.
This is a function that is performed by many of the FIG command-line utilities. The
user has the option of specifying a list of specific genome IDs or specifying none
in order to get all of them. If your command requires additional arguments in the
command line, you can still use this method if you shift them out of the argument list
before calling. The $mode return value will be all if the user asked for all of
the genomes or some if he specified a list of IDs. This is useful to know if,
for example, we are loading a table. If we're loading everything, we can delete the
entire table; if we're only loading some genomes, we must delete them individually.
This method uses the genome directory rather than the database because it may be used before the database is ready.
List of genome IDs. If all genome IDs are to be processed, then this list should be empty.
Returns a list. The first element of the list is all if the user is asking for all
the genome IDs and some otherwise. The remaining elements of the list are the
desired genome IDs.
$fig->reload_table($mode, $table, $flds, $xflds, $fileName, $keyList, $keyName);
Reload a database table from a sequential file. If $mode is all, the table
will be dropped and re-created. If $mode is some, the data for the individual
items in $keyList will be deleted before the table is loaded. Thus, the load
process is optimized for the type of reload.
all if we are reloading the entire table, some if we are only reloading
specific entries.
Name of the table to reload.
String defining the table columns, in SQL format. In general, this is a
comma-delimited set of field specifiers, each specifier consisting of the
field name followed by the field type and any optional qualifiers (such as
NOT NULL or DEFAULT); however, it can be anything that would appear
between the parentheses in a CREATE TABLE statement. The order in which
the fields are specified is important, since it is presumed that is the
order in which they are appearing in the load file.
Reference to a hash that describes the indexes. The hash is keyed by index name.
The value is the index's field list. This is a comma-delimited list of field names
in order from most significant to least significant. If a field is to be indexed
in descending order, its name should be followed by the qualifier DESC. For
example, the following $xflds value will create two indexes, one for name followed
by creation date in reverse chronological order, and one for ID.
{ name_index => "name, createDate DESC", id_index => "id" }
Fully-qualified name of the file containing the data to load. Each line of the file must correspond to a record, and the fields must be arranged in order and tab-delimited. If the file name is omitted, the table is dropped and re-created but not loaded.
Reference to a list of the IDs for the objects being reloaded. This parameter is
only used if $mode is some.
Name of the key field containing the IDs in the keylist. If omitted, genome is
assumed.
FIG::enqueue_similarities(\@fids);
Queue the passed Feature IDs for similarity computation. The actual computation is performed by create_sim_askfor_pool. The queue is a persistent text file in the global data directory, and this method essentially writes new IDs on the end of it.
Reference to a list of feature IDs.
Creates a similarity computation request from the queued similarities and the current NR.
We keep track of the exported requests in case one gets lost.
$fig->create_sim_askfor_pool($chunk_size);
Creates an askfor pool, which a snapshot of the current NR and similarity queue. This process clears the old queue.
The askfor pool needs to keep track of which sequences need to be calculated, which have been handed out, etc. To simplify this task we chunk the sequences into fairly small numbers (20k characters) and allocate work on a per-chunk basis. We make use of the relational database to keep track of chunk status as well as the seek locations into the file of sequence data. The initial creation of the pool involves indexing the sequence data with seek offsets and lengths and populating the sim_askfor_index table with this information and with initial status information.
Number of features to put into a processing chunk. The default is 15.
my ($nrPath, $fasta) = $fig->get_sim_work();
Get the next piece of sim computation work to be performed. Returned are the path to the NR and a string containing the fasta data.
$fig->sim_work_done($pool_id, $chunk_id, $out_file);
Declare that the work in pool_id/chunk_id has been completed, and output written to the pool directory (get_sim_work gave it the path).
The ID number of the pool containing the work that just completed.
The ID number of the chunk completed.
The file into which the work was placed.
$fig->schedule_sim_pool_postprocessing($pool_id);
Schedule a job to do the similarity postprocessing for the specified pool.
ID of the pool whose similarity postprocessing needs to be scheduled.
$fig->postprocess_computed_sims($pool_id);
Set up to reduce, reformat, and split the similarities in a given pool. We build a pipe to this pipeline:
reduce_sims peg.synonyms 300 | reformat_sims nr | split_sims dest prefix
Then we put the new sims in the pool directory, and then copy to NewSims.
ID of the pool whose similarities are to be post-processed.
@pools = $fig->get_active_sim_pools();
Return a list of the pool IDs for the sim processing queues that have entries awaiting computation.
my @clusterList = $fig->compute_clusters(\@pegList, $subsystem, $distance);
Partition a list of PEGs into sections that are clustered close together on the genome. The basic algorithm used builds a graph connecting PEGs to other PEGs close by them on the genome. Each connected subsection of the graph is then separated into a cluster. Singleton clusters are thrown away, and the remaining ones are sorted by length. All PEGs in the incoming list should belong to the same genome, but this is not a requirement. PEGs on different genomes will simply find themselves in different clusters.
Reference to a list of PEG IDs.
Subsystem object for the relevant subsystem. This parameter is not used, but is required for compatability with Sprout.
The maximum distance between PEGs that makes them considered close. If omitted, the distance is 5000 bases.
Returns a list of lists. Each sub-list is a cluster of PEGs.
my ($total_entries, $n_finished, $n_assigned, $n_unassigned) = $fig->get_sim_pool_info($pool_id);
Return information about the given sim pool.
Pool ID of the similarity processing queue whose information is desired.
Returns a four-element list. The first is the number of features in the queue; the second is the number of features that have been processed; the third is the number of features that have been assigned to a processor, and the fourth is the number of features left over.
my $result = FIG::get_local_hostname();
Return the local host name for the current processor. The name may be stored in a configuration file, or we may have to get it from the operating system.
my $name = FIG::get_hostname_by_adapter();
Return the local host name for the current network environment.
my $id = FIG::get_seed_id();
Return the Universally Unique ID for this SEED instance. If one does not exist, it will be created.
my ($name, $id, $inst, $email, $parent_id, $description) = FIG::get_release_info();
Return the current data release information..
The release info comes from the file FIG/Data/RELEASE. It is formatted as:
<release-name> <unique id> <institution> <contact email> <unique id of data release this release derived from> <description>
For instance:
----- SEED Data Release, 09/15/2004. 4148208C-1DF2-11D9-8417-000A95D52EF6 ANL/FIG olson@mcs.anl.gov
Test release. -----
If no RELEASE file exists, this routine will create one with a new unique ID. This lets a peer optimize the data transfer by being able to cache ID translations from this instance.
my $title = $fig->Title();
Return the title of this database. For SEED, this will return SEED, for Sprout it will return NMPDR, and so forth.
my $realFig = $fig->FIG();
Return this object. This method is provided for compatability with SFXlate.
my $date = $fig->get_peer_last_update($peer_id);
Return the timestamp from the last successful peer-to-peer update with the given peer. If the specified peer has made updates, comparing this timestamp to the timestamp of the updates can tell you whether or not the updates have been integrated into your SEED data store.
We store this information in FIG/Data/Global/Peers/<peer-id>.
Universally Unique ID for the desired peer.
Returns the date/time stamp for the last peer-to-peer updated performed with the identified SEED instance.
$fig->set_peer_last_update($peer_id, $time);
Manually set the update timestamp for a specified peer. This informs the SEED that you have all of the assignments and updates from a particular SEED instance as of a certain date.
Remove any extra spaces from input fields. This will (currently) remove ^\s, \s$, and concatenate multiple spaces into one.
my $input=$fig->clean_spaces($cgi->param('input'));
my $url = FIG::$fig->cgi_url();
Return the URL for the CGI script directory.
my $url = FIG::top_link();
Return the relative URL for the top of the CGI script directory.
We determine this based on the SCRIPT_NAME environment variable, falling back to FIG_Config::cgi_base if necessary.
my $url = FIG::temp_url();
Return the URL of the temporary file directory.
my $url2 = $fig->plug_url($url);
or
my $url2 = $fig->plug_url($url);
Change the domain portion of a URL to point to the current domain. This essentially relocates URLs into the current environment.
URL to relocate.
Returns a new URL with the base portion converted to the current operating host.
If the URL does not begin with http://, the URL will be returned unmodified.
my $text = $fig->file_read($fileName);
or
my @lines = $fig->file_read($fileName);
or
my $text = FIG::file_read($fileName);
or
my @lines = FIG::file_read($fileName);
Read an entire file into memory. In a scalar context, the file is returned
as a single text string with line delimiters included. In a list context, the
file is returned as a list of lines, each line terminated by a line
delimiter. (For a method that automatically strips the line delimiters,
use Tracer::GetFile.)
Fully-qualified name of the file to read.
In a list context, returns a list of the file lines. In a scalar context, returns a string containing all the lines of the file with delimiters included.
my $text = $fig->file_head($fileName, $count);
or
my @lines = $fig->file_head($fileName, $count);
or
my $text = FIG::file_head($fileName, $count);
or
my @lines = FIG::file_head($fileName, $count);
Read a portion of a file into memory. In a scalar context, the file portion is returned as a single text string with line delimiters included. In a list context, the file portion is returned as a list of lines, each line terminated by a line delimiter.
Fully-qualified name of the file to read.
Number of lines to read from the file. If omitted, 1 is assumed. If the
non-numeric string * is specified, the entire file will be read.
In a list context, returns a list of the desired file lines. In a scalar context, returns a string containing the desired lines of the file with delimiters included.
my $min = FIG::min(@x);
or
my $min = $fig->min(@x);
Return the minimum numeric value from a list.
List of numbers to process.
Returns the numeric value of the list entry possessing the lowest value. Returns
undef if the list is empty.
my $max = FIG::max(@x);
or
my $max = $fig->max(@x);
Return the maximum numeric value from a list.
List of numbers to process.
Returns the numeric value of t/he list entry possessing the highest value. Returns
undef if the list is empty.
my $flag = FIG::between($x, $y, $z);
or
my $flag = $fig->between($x, $y, $z);
Determine whether or not $y is between $x and $z.
First edge number.
Number to examine.
Second edge number.
Return TRUE if the number $y is between the numbers $x and $z. The check is inclusive (that is, if $y is equal to $x or $z the function returns TRUE), and the order of $x and $z does not matter. If $x is lower than $z, then the return is TRUE if $x <= $y <= $z. If $z is lower, then the return is TRUE if $x >= I$<$y> >= $z.
my $code = FIG::get_organism_info_from_ncbi( $taxonomyID );
For a given taxonomy ID returns a hash containing scientific name , genetic code , synonyms and lineage
my $code = FIG::standard_genetic_code();
Return a hash containing the standard translation of nucleotide triples to proteins. Methods such as translate can take a translation scheme as a parameter. This method returns the default translation scheme. The scheme is implemented as a reference to a hash that contains nucleotide triplets as keys and has protein letters as values.
my $aa_seq = &FIG::translate($dna_seq, $code, $fix_start);
Translate a DNA sequence to a protein sequence using the specified genetic code.
If $fix_start is TRUE, will translate an initial TTG or GTG code to
M. (In the standard genetic code, these two combinations normally translate
to V and L, respectively.)
DNA sequence to translate. Note that the DNA sequence can only contain known nucleotides.
Reference to a hash specifying the translation code. The hash is keyed by nucleotide triples, and the value for each key is the corresponding protein letter. If this parameter is omitted, the standard_genetic_code will be used.
TRUE if the first triple is to get special treatment, else FALSE. If TRUE,
then a value of TTG or GTG in the first position will be translated to
M instead of the value specified in the translation code.
Returns a string resulting from translating each nucleotide triple into a protein letter.
my $dnaR = FIG::reverse_comp($dna);
or
my $dnaR = $fig->reverse_comp($dna);
Return the reverse complement os the specified DNA sequence.
NOTE: for extremely long DNA strings, use rev_comp, which allows you to pass the strings around in the form of pointers.
DNA sequence whose reverse complement is desired.
Returns the reverse complement of the incoming DNA sequence.
my $dnaRP = FIG::rev_comp(\$dna);
or
my $dnaRP = $fig->rev_comp(\$dna);
Return the reverse complement of the specified DNA sequence. The DNA sequence is passed in as a string reference rather than a raw string for performance reasons. If this is unnecessary, use reverse_comp, which processes strings instead of references to strings.
Reference to the DNA sequence whose reverse complement is desired.
Returns a reference to the reverse complement of the incoming DNA sequence.
FIG::verify_dir($dir);
or
$fig->verify_dir($dir);
Insure that the specified directory exists. If it must be created, the permissions will
be set to 0777.
FIG::run($cmd);
or
$fig->run($cmd);
Run a command. If the command fails, the error will be traced.
FIG::run_gathering_output($cmd, @args);
or
$fig->run_gathering_output($cmd, @args);
Run a command, gathering the output. This is similar to the backtick operator, but it does not invoke the shell. Note that the argument list must be explicitly passed one command line argument per argument to run_gathering_output.
If the command fails, the error will be traced.
($exitcode, $signal, $msg) = &FIG::interpret_error_code($rc);
Determine if the given result code was due to a process exiting abnormally or by receiving a signal.
FIG::augment_path($dirName);
Add a directory to the system path.
This method adds a new directory to the front of the system path. It looks in the configuration file to determine whether this is Windows or Unix, and uses the appropriate separator.
Name of the directory to add to the path.
my ($seq_id, $seq_pointer, $comment) = FIG::read_fasta_record(\*FILEHANDLE);
or
my ($seq_id, $seq_pointer, $comment) = $fig->read_fasta_record(\*FILEHANDLE);
Read and parse the next logical record of a FASTA file. A FASTA logical record
consists of multiple lines of text. The first line begins with a > symbol
and contains the sequence ID followed by an optional comment. (NOTE: comments
are currently deprecated, because not all tools handle them properly.) The
remaining lines contain the sequence data.
This method uses a trick to smooth its operation: the line terminator character
is temporarily changed to \n> so that a single read operation brings in
the entire logical record.
Open handle of the FASTA file. If not specified, STDIN is assumed.
If we are at the end of the file, returns undef. Otherwise, returns a
three-element list. The first element is the sequence ID, the second is
a pointer to the sequence data (that is, a string reference as opposed to
as string), and the third is the comment.
FIG::display_id_and_seq($id_and_comment, $seqP, $fh);
Display a fasta ID and sequence to the specified open file. This method is designed to work well with read_fasta_sequence and rev_comp, because it takes as input a string pointer rather than a string. If the file handle is omitted it defaults to STDOUT.
The output is formatted into a FASTA record. The first line of the output is
preceded by a > symbol, and the sequence is split into 60-character
chunks displayed one per line. Thus, this method can be used to produce
FASTA files from data gathered by the rest of the system.
The sequence ID and (optionally) the comment from the sequence's FASTA record. The ID
Reference to a string containing the sequence. The sequence is automatically formatted into 60-character chunks displayed one per line.
Open file handle to which the ID and sequence should be output. If omitted,
\*STDOUT is assumed.
FIG::display_seq(\$seqP, $fh);
or
$fig->display_seq(\$seqP, $fh);
Display a fasta sequence to the specified open file. This method is designed to work well with read_fasta_sequence and rev_comp, because it takes as input a string pointer rather than a string. If the file handle is omitted it defaults to STDOUT.
The sequence is split into 60-character chunks displayed one per line for readability.
Reference to a string containing the sequence.
Open file handle to which the sequence should be output. If omitted,
STDOUT is assumed.
FIG::flatten_dumper( $perl_ref_or_object_1, ... );
$fig->flatten_dumper( $perl_ref_or_object_1, ... );
Takes a list of perl references or objects, and "flattens" their Data::Dumper() output so that it can be printed on a single line.
my $enzymatic_function = $fig->ec_name($ec);
Returns the enzymatic name corresponding to the specified enzyme code.
Code number for the enzyme whose name is desired. The code number is actually
a string of digits and periods (e.g. 1.2.50.6).
Returns the name of the enzyme specified by the indicated code, or a null string if the code is not found in the database.
my @roles = $fig->all_roles;
Return a list of the known roles. Currently, this is a list of the enzyme codes and names.
The return value is a list of list references. Each element of the big list contains an enzyme code (EC) followed by the enzymatic name.
my $expanded_ec = $fig->expand_ec($ec);
Expands "1.1.1.1" to "1.1.1.1 - alcohol dehydrogenase" or something like that.
FIG::clean_tmp();
Delete temporary files more than two days old.
We store temporary files in $FIG_Config::temp. There are specific classes of files that are created and should be saved for at least a few days. This routine can be invoked to clean out those that are over two days old.
my @genome_ids = $fig->genomes($complete, $restrictions, $domain);
Return a list of genome IDs. If called with no parameters, all genome IDs in the database will be returned.
Genomes are assigned ids of the form X.Y where X is the taxonomic id maintained by NCBI for the species (not the specific strain), and Y is a sequence digit assigned to this particular genome (as one of a set with the same genus/species). Genomes also have versions, but that is a separate issue.
TRUE if only complete genomes should be returned, else FALSE.
TRUE if only restriction genomes should be returned, else FALSE.
Name of the domain from which the genomes should be returned. Possible values are
Bacteria, Virus, Eukaryota, unknown, Archaea, and
Environmental Sample. If no domain is specified, all domains will be
eligible.
Returns a list of all the genome IDs with the specified characteristics.
my $info = $fig->genome_info();
Return an array reference of information from the genome table
This will return an array reference of genome table entries. All entries of the table will be returned. The columns will be the following:
genome, gname, szdna, maindomain, pegs, rnas, complete, taxonomy
my $flag = $fig->is_complete($genome);
Return TRUE if the genome with the specified ID is complete, else FALSE.
ID of the relevant genome.
Returns TRUE if there is a complete genome in the database with the specified ID, else FALSE.
my $flag = $fig->is_genome($genome);
Return TRUE if the specified genome exists, else FALSE.
ID of the genome to test.
Returns TRUE if a genome with the specified ID exists in the data store, else FALSE.
$fig->assert_genomes(gid, gid, ...);
Assert that the given list of genomes does exist, and allow is_genome() to succeed for them.
This is used in FIG-based computations in the context of the RAST genome-import code, so that genomes that currently exist only in RAST are treated as present for the purposes of FIG.pm-based code.
my ($arch, $bact, $euk, $vir, $env, $unk) = $fig->genome_counts($complete);
Count the number of genomes in each domain. If $complete is TRUE, only complete genomes will be included in the counts.
TRUE if only complete genomes are to be counted, FALSE if all genomes are to be counted
A six-element list containing the number of genomes in each of six categories-- Archaea, Bacteria, Eukaryota, Viral, Environmental, and Unknown, respectively.
my $domain = $fig->genome_domain($genome_id);
Find the domain of a genome.
ID of the genome whose domain is desired.
Returns the name of the genome's domain (archaea, bacteria, etc.), or undef if
the genome is not in the database.
my $num_pegs = $fig->genome_pegs($genome_id);
Return the number of protein-encoding genes (PEGs) for a specified genome.
ID of the genome whose PEG count is desired.
Returns the number of PEGs for the specified genome, or undef if the genome
is not indexed in the database.
my $num_rnas = $fig->genome_rnas($genome_id);
Return the number of RNA-encoding genes for a genome. "$genome_id" is indexed in the "genome" database, and 'undef' otherwise.
ID of the genome whose RNA count is desired.
Returns the number of RNAs for the specified genome, or undef if the genome
is not indexed in the database.
my $szdna = $fig->genome_szdna($genome_id);
Return the number of DNA base-pairs in a genome's contigs.
ID of the genome whose base-pair count is desired.
Returns the number of base pairs in the specified genome's contigs, or undef
if the genome is not indexed in the database.
my $version = $fig->genome_version($genome_id);
Return the version number of the specified genome.
Versions are incremented for major updates. They are put in as major updates of the form 1.0, 2.0, ...
Users may do local "editing" of the DNA for a genome, but when they do, they increment the digits to the right of the decimal. Two genomes remain comparable only if the versions match identically. Hence, minor updating should be committed only by the person/group responsible for updating that genome.
We can, of course, identify which genes are identical between any two genomes (by matching the DNA or amino acid sequences). However, the basic intent of the system is to support editing by the main group issuing periodic major updates.
ID of the genome whose version is desired.
Returns the version number of the specified genome, or undef if the genome is not in
the data store or no version number has been assigned.
my $md5sum = $fig->genome_md5sum($genome_id);
Returns the MD5 checksum of the specified genome.
The checksum of a genome is defined as the checksum of its signature file. The signature file consists of tab-separated lines, one for each contig, ordered by the contig id. Each line contains the contig ID, the length of the contig in nucleotides, and the MD5 checksum of the nucleotide data, with uppercase letters forced to lower case.
The checksum is indexed in the database. If you know a genome's checksum, you can use the genome_with_md5sum method to find its ID in the database.
ID of the genome whose checksum is desired.
Returns the specified genome's checksum, or undef if the genome is not in the
database.
my $genome = $fig->genome_with_md5sum($cksum);
Find a genome with the specified checksum.
The MD5 checksum is computed from the content of the genome (see genome_md5sum). This method can be used to determine if a genome already exists for a specified content.
Checksum to use for searching the genome table.
The ID of a genome with the specified checksum, or undef if no such genome exists.
my $cksum = $fig->contig_md5sum($genome, $contig);
Return the MD5 checksum for a contig. The MD5 checksum is computed from the content of the contig. This method retrieves the checksum stored in the database. The checksum can be compared to the checksum of an external contig as a cheap way of seeing if they match.
ID of the genome containing the contig.
ID of the relevant contig.
Returns the checksum of the specified contig, or undef if the contig is not in the
database.
my $cksum = $fig->md5_of_peg( $peg );
Return the MD5 checksum for a peg. The MD5 checksum is computed from the uppercase sequence of the protein. This method retrieves the checksum stored in the database.
FIG ID of the peg.
Returns the checksum of the specified contig as a hex string, or undef if
the peg is not in the database.
my $rep_id = get_representative_genome($id)
return the representative genome of the set that $id is in
ID of the genome used for set lookup
Return the representative genome of the set that $id is in, 0 if not found
=cut
sub get_representative_genome { my($self, $id) = @_; my $repH;
if (! ($repH = $self->{_repG})) {
my @tab = map { [split(/\t/,$_)] } `cat $FIG_Config::data/Global/genome.sets`;
my $x = shift @tab;
while ($x)
{
my $set = $x->[0];
my $repG = $x->[1];
while ($x && ($x->[0] eq $set))
{
$repH->{$x->[1]} = $repG;
$x = shift @tab;
}
}
$self->{_repG} = $repH;
}
return $repH->{$id};
}
my @pegs = $fig->pegs_with_md5( $md5 );
Return all pegs with sequence matching the check sum. Thus,
my @pegs = $fig->pegs_with_md5( $fig->md5_of_peg( $peg ) );
produces all pegs with sequence identical the query peg.
The md5 checksum as a hex string (32 characters).
Returns the list of pegs matching the given md5 checksum.
my @fids = $fig->prots_with_md5( $md5 );
Return all proteins with sequence matching the check sum, including non fig ids.
The md5 checksum as a hex string (32 characters).
Returns the list of protein ids matching the given md5 checksum.
my $gs = $fig->genus_species($genome_id);
Return the genus, species, and possibly also the strain of a specified genome.
This method converts a genome ID into a more recognizble species name. The species name is stored directly in the genome table of the database. Essentially, if the strain is present in the database, it will be returned by this method, and if it's not present, it won't.
ID of the genome whose name is desired.
Returns the scientific species name associated with the specified ID, or undef if the
ID is not in the database.
my $gs = $fig->set_genus_species($genome_id, $genus_species_strain);
Sets the contents of the GENOME file of the specified genome ID
Does not (currently) update the relational DB.
ID of the genome whose name is desired.
The new biological name that will correspond to the genome_id.
Returns 1 if the write was successful, and undef if write fails.
my $org = $fig->org_of($prot_id);
Return the genus/species name of the organism containing a protein. Note that in this context protein is not a certain string of amino acids but a protein encoding region on a specific contig.
For a FIG protein ID (e.g. fig|134537.1.peg.123), the organism and strain
information is always available. In the case of external proteins, we can usually
determine an organism, but not anything more precise than genus/species (and
often not that). When the organism name is not present, a null string is returned.
Protein or feature ID.
Returns the displayable scientific name (genus, species, and strain) of the organism containing
the identified PEG. If the name is not available, returns a null string. If the PEG is not found,
returns undef.
my $genomeID = $fig->orgid_of_orgname($genomeName);
Return the ID of the genome corresponding to the specified organism name, or a null string if the genome is not found.
Name of the organism, consisting of the organism's genus, species, and unique characterization, separated by spaces.
Returns the genome ID number for the named organism, or an empty string if the genome is not found.
my $genomeName = $fig->orgname_of_orgid($genomeID);
Return the name of the genome corresponding to the specified organism ID.
ID of the relevant genome.
Returns the name of the organism, consisting of the organism's genus, species, and unique characterization, separated by spaces, or a null string if the genome is not found.
my ($gs, $domain) = $fig->genus_species_domain($genome_id);
Returns a genome's genus and species (and strain if that has been properly recorded) in a printable form, along with its domain. This method is similar to genus_species, except it also returns the domain name (archaea, bacteria, etc.).
ID of the genome whose species and domain information is desired.
Returns a two-element list. The first element is the species name and the second is the domain name.
my $web_color = FIG::domain_color($domain);
Return the web color string associated with a specified domain. The colors are extremely subtle (86% luminance), so they absolutely require a black background. Archaea are slightly cyan, bacteria are slightly magenta, eukaryota are slightly yellow, viruses are slightly silver, environmental samples are slightly gray, and unknown or invalid domains are pure white.
Name of the domain whose color is desired.
Returns a web color string for the specified domain (e.g. #FFDDFF for
bacteria).
my ($org, $color) = $fig->org_and_domain_of($prot_id);
Return the best guess organism and domain html color string of an organism. In the case of external proteins, we can usually determine an organism, but not anything more precise than genus/species (and often not that).
Relevant protein or feature ID.
Returns a two-element list. The first element is the displayable organism name, and the second is an HTML color string based on the domain (see domain_color).
Return a list of genome IDs that match a partial genus.
For example partial_genus_matching("Listeria") will return all genome IDs that begin with Listeria, and this can also be restricted to complete genomes with another argument like this partial_genus_matching("Listeria", 1)
my $abbreviated_name = FIG::abbrev($genome_name);
or
my $abbreviated_name = $fig->abbrev($genome_name);
Abbreviate a genome name to 10 characters or less.
For alignments and such, it is very useful to be able to produce an abbreviation of genus/species. That's what this does. Note that multiple genus/species might reduce to the same abbreviation, so be careful (disambiguate them, if you must).
The abbreviation is formed from the first three letters of the species name followed by the first three letters of the genus name followed by the first three letters of the species name and then the next four nonblank characters.
The name to abbreviate.
An abbreviated version of the specified name.
my $wikipedia_link = $fig->wikipedia_link($genome_name);
Check if Wikipedia has a page about this genome. If so, return it's url.
The genome to find.
The url of the wikipedia page.
my $organism_directory = $fig->organism_directory($genome_id);
Get the directory that contains the organism data. This is just like the FIGV version.
The id of the organism, e.g. 83333.1.
A string containing the path to the organism directory.
<my $name = ncbi_contig_description($contig_id)>
Looks up the NCBI description line for this contig identifier. Values are cached in the directory $FIG_Config::var/ncbi_contigs.
my $type = FIG::ftype($fid);
or
my $type = $fig->ftype($fid);
Returns the type of a feature, given the feature ID. This just amounts to lifting it out of the feature ID, since features have IDs of the form
fig|x.y.f.n
where x.y is the genome ID f is the type of feature n is an integer that is unique within the genome/type
FIG ID of the feature whose type is desired.
Returns the feature type (e.g. peg, rna, pi, or pp), or undef if the
feature ID is not a FIG ID.
my $genome_id = $fig->genome_of($fid);
or
my $genome_id = FIG::genome_of($fid);
Return the genome ID from a feature ID.
ID of the feature whose genome ID is desired.
If the feature ID is a FIG ID, returns the genome ID embedded inside it; otherwise, it
returns undef.
my ($genome_id, $peg_number = FIG::genome_and_peg_of($fid);
my ($genome_id, $peg_number = $fig->genome_and_peg_of($fid);
Return the genome ID and peg number from a feature ID.
ID of the feature whose genome and PEG number as desired.
Returns the genome ID and peg number associated with a feature if the feature
is represented by a FIG ID, else undef.
my @sorted_by_fig_id = sort { FIG::by_fig_id($a,$b) } @fig_ids;
Compare two feature IDs.
This function is designed to assist in sorting features by ID. The sort is by genome ID followed by feature type and then feature number.
First feature ID.
Second feature ID.
Returns a negative number if the first parameter is smaller, zero if both parameters are equal, and a positive number if the first parameter is greater.
my @sorted_by_location = sort { FIG::by_locus($a,$b) } @locations;
Compare two locations.
This function is designed to assist in sorting features by location. The sort is by contig ID, followed by left boundary, then by right bounday, then by strand.
First location.
Second location.
Returns a negative number if the first location is to the left, of the second, zero if both locations are identical, and a positive number if the first location is to the right of the second.
my @sorted_by_genome_id = sort { FIG::by_genome_id($a,$b) } @genome_ids;
Compare two genome IDs.
This function is designed to assist in sorting genomes by ID.
First genome ID.
Second genome ID.
Returns a negative number if the first parameter is smaller, zero if both parameters are equal, and a positive number if the first parameter is greater.
my $feature = $fig->next_feature( \%options );
Locate the next feature (optionally filtered by type) in a contig. The start position for the search can be defined by supplying genome, contig and position, or by supplying a feature id. Feature locations are defined by their midpoint. If a fid is supplied with contig and position, the latter are used to resolve ambiguities in the desired segement of a feature with a complex location.
Options:
after => $fid after => \@fids
Id(s) of features that should preceed the returned feature. This is a local operation, and is only meant to resolve features that are otherwise tied in location.
contig => $contig
Name of contig of features.
exclude => $id exclude => \@ids
Id(s) of features to exclude. Note that features listed with the 'after' option are also excluded (and that is most commonly the desired behavior).
fid => $fid
Alternative to supplying a location. It is possible to supply a fid and contig and position, which allows disambiguating the desired segment of a feature with a complex location.
genome => $genome
Name of genome of features.
position => $position
Feature midpoint must be >= $position. Note that this can be any multiple of 1/2. If the supplied value is negative, the position is taken from the right end of the contig.
type => $type type => \@types
Type(s) of desired feature (default is any type).
Feature id or undef.
my $feature = $fig->previous_feature( \%options );
Locate the previous feature (optionally filtered by type) in a contig. The start position for the search can be defined by supplying genome, contig and position, or by supplying a feature id. Feature locations are defined by their midpoint. If a fid is supplied with contig and position, the latter are used to resolve ambiguities in the desired segement of a feature with a complex location.
Options:
before => $fid before => \@fids
Id(s) of features that should follow the returned feature. This is a local operation, and is only meant to resolve features that are otherwise tied in location.
contig => $contig
Name of contig of features.
exclude => $id exclude => \@ids
Id(s) of features to exclude. Note that features listed with the 'before' option are also excluded (and that is most commonly the desired behavior).
fid => $fid
Alternative to supplying a location. It is possible to supply a fid and contig and position, which allows disambiguating the desired segment of a feature with a complex location.
genome => $genome
Name of genome of features.
position => $position
Feature midpoint must be >= $position. Note that this can be any multiple of 1/2. If the supplied value is negative, the position is taken from the right end of the contig.
type => $type type => \@types
Type(s) of desired feature (default is any type).
Feature id or undef.
my ($features_in_region, $beg1, $end1) = $fig->genes_in_region($genome, $contig, $beg, $end, size_limit);
Locate features that overlap a specified region of a contig. This includes features that begin or end outside that region, just so long as some part of the feature can be found in the region of interest.
It is often important to be able to find the genes that occur in a specific region on a chromosome. This routine is designed to provide this information. It returns all genes that overlap positions from $beg through $end in the specified contig.
The $size_limit parameter limits the search process. It is presumed that no features are longer than the
specified size limit. A shorter size limit means you'll miss some features; a longer size limit significantly
slows the search process. For prokaryotes, a value of 10000 (the default) seems to work best.
ID of the genome containing the relevant contig.
ID of the relevant contig.
Position of the first base pair in the region of interest.
Position of the last base pair in the region of interest.
Maximum allowable size for a feature. If omitted, 10000 is assumed.
Returns a three-element list. The first element is a reference to a list of the feature IDs found. The second element is the position of the leftmost base pair in any feature found. This may be well before the region of interest begins or it could be somewhere inside. The third element is the position of the rightmost base pair in any feature found. Again, this can be somewhere inside the region or it could be well to the right of it.
my ( [ $contig, $beg, $end ], ... ) = $fig->regions_spanned( $loc );
or
my ( [ $contig, $beg, $end ], ... ) = FIG::regions_spanned( $loc );
The location of a feature in a scalar context is
contig_b1_e1, contig_b2_e2, ... [one contig_b_e for each segment]
This routine takes as input a scalar location in the above form and reduces it to one or more regions spanned by the gene. This involves combining regions in the location string that are on the same contig and going in the same direction. Unlike boundaries_of, which returns one region in which the entire gene can be found, regions_spanned handles wrapping through the orgin, features split over contigs and exons that are not ordered nicely along the chromosome (ugly but true).
The location string for a feature.
Returns a list of list references. Each inner list contains a contig ID, a starting position, and an ending position. The starting position may be numerically greater than the ending position (which indicates a backward-traveling gene). It is guaranteed that the entire feature is covered by the regions in the list.
my @regions = FIG::filter_regions( $contig, $min, $max, @regions );
or
my \@regions = FIG::filter_regions( $contig, $min, $max, @regions );
or
my @regions = FIG::filter_regions( $contig, $min, $max, \@regions );
or
my \@regions = FIG::filter_regions( $contig, $min, $max, \@regions );
Filter a list of regions to those that overlap a specified section of a
particular contig. Region definitions correspond to those produced
by regions_spanned. That is, [contig,beg,end].
In the function call, either $contig or $min and $max can be
undefined (permitting anything). So, for example,
my @regions = FIG::filter_regions(undef, 1, 5000, $regionList);
would return all regions in $regionList that overlap the first
5000 base pairs in any contig. Conversely,
my @regions = FIG::filter_regions('NC_003904', undef, undef, $regionList);
would return all regions on the contig NC_003904.
ID of the contig whose regions are to be passed by the filter, or undef
if the contig doesn't matter.
Leftmost position of the region used for filtering. Only regions which contain
at least one base pair at or beyond this position will be passed. A value
of undef is equivalent to zero.
Rightmost position of the region used for filtering. Only regions which contain
at least one base pair on or before this position will be passed. A value
of undef is equivalent to the length of the contig.
A list of regions, or a reference to a list of regions. Each region is a reference to a three-element list, the first element of which is a contig ID, the second element of which is the start position, and the third element of which is the ending position. (The ending position can be before the starting position if the region is backward-traveling.)
In a scalar context, returns a reference to a list of the filtered regions. In a list context, returns the list itself.
my @features = $fig->close_genes($fid, $dist);
Return all features within a certain distance of a specified other feature.
This method is a quick way to get genes that are near another gene. It calls boundaries_of to get the boundaries of the incoming gene, then passes the region computed to genes_in_region.
So, for example, if the specified $dist is 500, the method would select
a region that extends 500 base pairs to either side of the boundaries for
the gene $fid, and pass it to genes_in_region for analysis. The
features returned would be those that overlap the selected region. Note
that the flaws inherent in genes_in_region are also inherent in this
method: if a feature is more than 10000 base pairs long, it may not
be caught even though it has an overlap in the specified region.
ID of the relevant feature.
Desired maximum distance.
Returns a list of feature IDs for genes that overlap or are close to the boundaries for the specified incoming feature.
my ($left_fid, $right_fid) = $fig->adjacent_genes($fid, $dist);
Return the IDs of the genes immediately to the left and right of a specified feature.
This method gets a list of the features within the specified distance of the incoming feature (using close_genes), and then chooses the two closest to the feature found. If the incoming feature is on the + strand, these are features to the left and the right. If the incoming feature is on the - strand, the features will be returned in reverse order.
ID of the feature whose neighbors are desired.
Maximum permissible distance to the neighbors.
Returns a two-element list containing the IDs of the features on either side of the incoming feature.
Compute a rough estimate of "similarity" between genomes using the following algorithm:
1. You need at least five "genes" from each genome (let's work with incomplete as well as complete). You get these by
a. Taking up to 5 of the "universal genes"
b. supplemented by genes (starting from 1) that are greater than 300 aa
2. For each gene from the set consider the set of similarities for it.
For each match that covers over 200 aa of the gene,
if the % identify > 70, count a "too-similar{Genome2}"
else count a "not-too-similar{Genome2}"
For each Genome2, if the "too-similar{Genome2}" count > "not-too-similar{Genome2}" count,
the Genome1-Genome2 matches are too similar.
else, they are not
Used for filtering candidate PCHs in remove_clustered_pchs2.pl.
Hash where the keys are the annotations for the universal proteins to be used in the similarity computation.
Minimum length of similarity match required to be considered for genome similarity.
Number of genes to consider for the com.putation.
List of lists of the form [genome2, is-similar, count of too-similar hits, count of not-too-similar hist]
my $loc = $fig->feature_location($fid);
or
my @loc = $fig->feature_location($fid);;
Return the location of a feature. The location consists
of a list of (contigID, begin, end) triples encoded
as strings with an underscore delimiter. So, for example,
NC_002755_100_199 indicates a location starting at position
100 and extending through 199 on the contig NC_002755. If
the location goes backward, the start location will be higher
than the end location (e.g. NC_002755_199_100).
In a scalar context, this method returns the locations as a comma-delimited string
NC_002755_100_199,NC_002755_210_498
In a list context, the locations are returned as a list
(NC_002755_100_199, NC_002755_210_498)
ID of the feature whose location is desired.
Returns the locations of a feature, either as a comma-delimited string or a list.
my $contigID = $fig->contig_of($location);
Return the ID of the contig containing a location.
This method only works with SEED-style locations (contigID_beg_end).
For more comprehensive location parsing, use the Location object.
A SEED-style location (contigID_beg_end), or a comma-delimited list
of SEED-style locations. In the latter case, only the first location in the list will
be processed.
Returns the contig ID from the first location in the incoming string.
my $beg = $fig->beg_of($location);
Return the beginning point of a location.
This method only works with SEED-style locations (contigID_beg_end).
For more comprehensive location parsing, use the Location object.
A SEED-style location (contigID_beg_end), or a comma-delimited list
of SEED-style locations. In the latter case, only the first location in the list will
be processed.
Returns the beginning point from the first location in the incoming string.
my $end = $fig->end_of($location);
Return the ending point of a location.
This method only works with SEED-style locations (contigID_beg_end).
For more comprehensive location parsing, use the Location object.
A SEED-style location (contigID_beg_end), or a comma-delimited list
of SEED-style locations. In the latter case, only the first location in the list will
be processed.
Returns the contig ID from the first location in the incoming string.
my $dna = $fig->upstream_of($peg, $upstream, $coding);
Return the DNA immediately upstream of a feature. This method contains code lifted from
the upstream.pl script.
ID of the feature whose upstream DNA is desired.
Number of base pairs considered upstream.
Number of base pairs inside the feature to be included in the upstream region.
Returns the DNA sequence upstream of the feature's begin point and extending into the coding region. Letters inside a feature are in upper case and inter-genic letters are in lower case. A hyphen separates the true upstream letters from the coding region.
my $strand = $fig->contig_of($location);
Return the strand (+ or -) of a location.
This method only works with SEED-style locations (contigID_beg_end).
For more comprehensive location parsing, use the Location object.
A comma-delimited list of SEED-style location (contigID_beg_end).
Returns + if the list describes a forward-oriented location, and - if the list
described a backward-oriented location.
my $contigID = $fig->find_contig_with_checksum($genome, $checksum);
Find a contig in the given genome with the given checksum.
This method is useful for determining if a particular contig has already been recorded for the given genome. The checksum is computed from the contig contents, so a matching checksum indicates that the contigs may have the same content.
ID of the genome whose contigs are to be examined.
Checksum value for the desired contig.
Returns the ID of a contig in the given genome that has the caller-specified checksum,
or undef if no such contig exists.
my $checksum = $fig->contig_checksum($genome, $contig);
or
my @checksum = $fig->contig_checksum($genome, $contig);
Return the checksum of the specified contig. The checksum is computed from the contig's content in a parallel process. The process returns a space-delimited list of numbers. The numbers can be split into a real list if the method is invoked in a list context. For b
Read a single contig from the contigs file.
usage: ($contig,$beg,$end) = $fig->boundaries_of($loc)
The location of a feature in a scalar context is
contig_b1_e1,contig_b2_e2,... [one contig_b_e for each exon]
This routine takes as input such a location and reduces it to a single description of the entire region containing the gene.
\@regions = $fig->boundaries_of_2( $location );
@regions = $fig->boundaries_of_2( $location );
Locations can be a list of intervals (contig_beg_end), but the intervals
need not be on a single contig, contiguous, or in a consistent orientation
(e.g., a feature that wraps from the end to the beginning of a genome, or a
trans-spliced protein). This function defines a region of a gene a sequence
parts of the location that are on the same contig, in the same orientation,
and with end points that progress along the contig in the same direction as
the individual parts. This function is a generalization of boundaries_of().
The latter function returns undef if the first and last contigs are not
the same, and returns a location spanning nearly the entire contig it the
location spans the origin.
contig1_beg1_end1,contig2_beg2_end2,...
( [contig, beg, end], [contig, beg, end], ... )
where consecutive location intervals with a common contig, direction, and
consistent direction of progression along the contig are merged. The vast
majority of genes will be reduced to the a single region, which is that
returned by boundaries_of(). That is, most of the time:
boundaries_of( $loc )
is the same as
@{ boundaries_of_2( $loc )->[0] || [] }
my $featureList = $fig->all_features_detailed($genome);
Returns a list of all features in the designated genome, with their location, alias, and type information included. This is used in the GenDB import and Sprout load to speed up the process.
Deleted features are not returned!
ID of the genome whose features are desired.
Returns a reference to a list of tuples. Each tuple consists of four elements: (1) the feature ID, (2) the feature location (as a comma-delimited list of location specifiers), (3) the feature aliases (as a comma-delimited list of named aliases), and (4) the feature type.
my $featureList = $fig->all_features_detailed($genome, $min, $max, $contig);
Returns a list of all features in the designated genome, with various useful information included.
Deleted features are not returned!
ID of the genome whose features are desired.
If specified, the minimum contig location of interest. Features not entirely to the right of this location are ignored.
If specified, the maximum contig location of interest. Features not entirely to the left of this location are ignore.
If specified, the contig of interest. Features not on this contig are ignored.
Returns a reference to a list of tuples. Each tuple consists of four elements: (1) the feature ID, (2) the feature location (as a comma-delimited list of location specifiers), (3) the feature aliases (as a comma-delimited list of named aliases), (4) the feature type, (5) the leftmost index of the feature's first location, (6) the rightmost index of the feature's last location, (7) the current functional assignment, (8) the user who made the assignment, and (9) the quality of the assignment (which is usually an empty string).
my @fidList = $fig->all_features($genome,$type);
Returns a list of all feature IDs of a specified type in the designated genome. You would usually use just
$fig->pegs_of($genome) or
$fig->rnas_of($genome)
which simply invoke this routine.
ID of the genome whose features are desired.
Type of feature desired (peg, rna, etc.). If omitted, all features are returned.
Returns a list of the IDs for the desired features.
usage: $fig->pegs_of($genome)
Returns a list of all PEGs in the specified genome. Note that order is not specified.
usage: $fig->rnas_of($genome)
Returns a list of all RNAs for the given genome.
usage: @aliases = $fig->feature_aliases($fid) OR
$aliases = $fig->feature_aliases($fid)
Returns a list of aliases (gene IDs, arbitrary numbers assigned by authors, etc.) for the feature. These must come from the tbl files, so add them there if you want to see them here.
In a scalar context, the aliases come back with commas separating them.
my @aliases = $fig->uniprot_aliases($fid)
OR
my $aliases = $fig->uniprot_aliases($fid)
Return the uniprot aliases (SwissProt, TREMBL and UniProt) for a PEG.
The aliases returned may be from a different organism than the organism of the input feature $fid.
A call to get_corresponding_ids is done first and will return the same-sequence same-genome ids. If none are found, mapped_prot_ids is called which will give the same-sequence ids.
If you need to know which form of alias is being returned, call these methods directly.
Only one id is returned for every accession found. Example 1: If both uni|Q8FLC2 and uni|Q8FLC2_ECOL6 are found in the database, only uni|Q8FLC2 will be returned. Example 2: If sp|P75616 and uni|P75616 are found in the database, only sp|P75616 will be returned. The order of preference here is sp before tr before uni.
Feature ID of the PEG whose aliases are desired.
Depending on the context of the call, either a list of aliases (sp, tr and uni) is returned, or a comma-separated string. If no aliases are found, the empty list or string will be returned.
my $hash = $fig->uniprot_aliases_bulk(\@fids, $no_del_check);
Return a hash mapping the specified feature IDs to lists of their uniprot aliases.
A list of FIG feature IDs.
If TRUE, deleted feature IDs will not be removed from the feature ID list before processing. The default is FALSE, which means deleted feature IDs will be removed before processing.
Returns a hash mapping each feature ID to a list of its uniprot aliases.
Convert an alias to a db_xref. This uses the BRC format db_xref, which is a conglomeration of NCBI, GO, and BioMoby.
This method will return a correctly formatted db_ref if the argument is one of our currently recognized formats, otherwise it returns undef.
This example code should provide the functions you want
foreach my $alias ($fig->feature_aliases($peg)) { if (my $dbxref=$fig->rewrite_db_xrefs_brc($alias)) {print "The dbxref is $dbxref\n"} else {print "The alias is $alias\n"} }
For a list of approved dbxrefs, see http://www.brc-central.org/cgi-bin/brc-central/dbxref_list.cgi
usage: $peg = $fig->by_alias($alias)
Returns a FIG id if the alias can be converted. Right now we convert aliases of the form NP_* (RefSeq IDs), gi|* (GenBank IDs), sp|* (Swiss Prot), uni|* (UniProt), kegg|* (KEGG) and maybe a few more
usage: $peg = $fig->by_raw_alias($alias)
Returns all FIG ids having the given alias. Unlike by_alias, we do not attempt any kind of normalization. I'm not sure this function is needed, but by_alias searches only in ext_alias table whereas here I'm searching in the features table. ext_alias does not have all external aliases which is keeping my code from working. In particular, it lacks EnsemblGene. It would be nice to combine these two functions. -Ed =cut
sub by_raw_alias {
my($self,$alias) = @_;
my($rdbH,$relational_db_response);
my ($peg);
$rdbH = $self->db_handle;
if (($relational_db_response = $rdbH->SQL("SELECT id FROM features WHERE aliases LIKE \'%,$alias,%\'")) && (@$relational_db_response > 0)) {
if (@$relational_db_response == 1) {
$peg = $relational_db_response->[0]->[0];
return wantarray() ? ($peg) : $peg;
} elsif (wantarray()) {
return map { $_->[0] } @$relational_db_response;
}
}
return wantarray() ? () : "";
}
sub to_alias {
my($self,$fid,$type) = @_;
my @aliases = $self->feature_aliases($fid);
if ($type)
{
@aliases = grep { $_ =~ /^$type\|/ } @aliases;
}
if (wantarray())
{
return @aliases;
}
elsif (@aliases > 0)
{
return $aliases[0];
}
else
{
return "";
}
}
usage: $fig->possibly_truncated($feature_id) or $fig->possibly_truncated($genome, $loc)
Returns the empty string if the feature or location is not near either end of a contig.
Returns 'stop' if the feature or location is on the 'plus' strand and near the end of a contig, or is on the 'minus' starnd and near the beginning of the contig.
Returns 'start' if the feature or location is on the 'plus' strand and near the beginning of a contig, or is on the 'minus' starnd and near the end of the contig.
Possibly truncated STOPs have return priority over possibly truncated STARTs.
my $fs = $fig->possible_frameshift($peg);
A pointer to a list of the form [ContigName,BegOfRegionContaining,EndOfContainingRegion,DNAofContaining,TemplatePEGid]
boolean FALSE otherwise.
Merge two HSPs unless their overlap or separation is too large.
RETURNS: Merged boundaries if merger succeeds, and undef if merger fails.
<my $gnum, $pnum = $fig-map_peg_to_ids($peg)>>
Map a peg ID to a pair of numbers describing that peg.
In order to conserve storage and increase performance for some operations (the functional coupling computation, for instance), we provide a mechanism by which a full peg (of the form fig|X.Y.peg.Z) is mapped to a pair of integers: a genome number and a PEG index. We maintain a table genome_mapping that retains the mapping between genome ID and local genome number. No effort is expended to ensure this mapping is at all coherent between SEED instances; this is purely a local mechanism for performance enhancement.
ID of the peg to be mapped.
A pair of numbers ($gnum, $pnum)
my @coupled_to = $fig->abstract_coupled_to($peg);
Return a list of functionally coupled PEGs.
ID of the protein encoding group whose functionally-coupled proteins are desired.
Returns a list of 4-tuples, each consisting of the ID of a coupled PEG, a score, a "type" which indicates the method that produced the score, and "extra data" in the form of a pointer to a list. If there are no PEGs functionally coupled to the incoming PEG, it will return an empty list. If the PEG data is not present, it will return an empty list.
my @coupled_to = $fig->coupled_to($peg);
Return a list of functionally coupled PEGs.
The new form of coupling and evidence computation is based on precomputed data. The old form took minutes to dynamically compute things when needed. The old form still works, if the directory Data/CouplingData is not present. If it is present, it theis assumed to contain comprehensive coupling data in the form of precomputed scores and PCHs.
If Data/CouplingData is present, this routine returns a list of 2-tuples [Peg,Sc]. It
returns the empty list if the peg is not coupled. It returns undef if Data/CouplingData
is not there.
ID of the protein encoding group whose functionally-coupled proteins are desired.
Returns a list of 2-tuples, each consisting of the ID of a coupled PEG and a score. If
there are no PEGs functionally coupled to the incoming PEG, it will return an empty
list. If the PEG data is not present, it will return undef.
usage: @evidence = $fig->coupling_evidence($peg1,$peg2)
The new form of coupling and evidence computation is based on precomputed data. The old form took minutes to dynamically compute things when needed. The old form still works, ikf the directory Data/CouplingData is not present. If it is present, it is assumed to contain comprehensive coupling data in the form of precomputed scores and PCHs.
If Data/CouplingData is present, this routine returns a list of 3-tuples [Peg3,Peg4,Rep]. Here, Peg3 is similar to Peg1, Peg4 is similar to Peg2, and Rep == 1 iff this is a "representative pair". That is, we take all pairs and create a representative set in which each pair is not "too close" to any other pair in the representative set. Think of "too close" as being roughly 95% identical at the DNA level. This keeps (usually) a single pair from a set of different genomes from the same species.
It returns the empty list if the peg is not coupled. It returns undef, if Data/CouplingData is not there.
usage: @coupling_data = $fig->coupling_and_evidence($fid,$bound,$sim_cutoff,$coupling_cutoff,$keep_record)
A computation of couplings and evidence starts with a given peg and produces a list of 3-tuples. Each 3-tuple is of the form
[Score,CoupledToFID,Evidence]
Evidence is a list of 2-tuples of FIDs that are close in other genomes (producing a "pair of close homologs" of [$peg,CoupledToFID]). The maximum score for a single PCH is 1, but "Score" is the sum of the scores for the entire set of PCHs.
NOTE: once the new version of precomputed coupling is installed (i.e., when Data/CouplingData is filled with the precomputed relations), the parameters on computing evidence are ignored.
If $keep_record is true, the system records the information, asserting coupling for each of the pairs in the set of evidence, and asserting a pin from the given $fd through all of the PCH entries used in forming the score.
usage: $fig->add_chr_clusters_and_pins($peg,$hits)
The system supports retaining data relating to functional coupling. If a user computes evidence once and then saves it with this routine, data relating to both "the pin" and the "clusters" (in all of the organisms supporting the functional coupling) will be saved.
$hits must be a pointer to a list of 3-tuples of the sort returned by $fig->coupling_and_evidence.
usage: $fig->translatable($prot_id)
The system takes any number of sources of protein sequences as input (and builds an nr for the purpose of computing similarities). For each of these input fasta files, it saves (in the DB) a filename, seek address and length so that it can go get the translation if needed. This routine simply returns true iff info on the translation exists.
usage: $len = $fig->translation_length($prot_id)
The system takes any number of sources of protein sequences as input (and builds an nr for the purpose of computing similarities). For each of these input fasta files, it saves (in the DB) a filename, seek address and length so that it can go get the translation if needed. This routine returns the length of a translation. This does not require actually retrieving the translation.
my $translation = $fig->get_translation($prot_id);
The system takes any number of sources of protein sequences as input (and builds an nr for the purpose of computing similarities). For each of these input fasta files, it saves (in the DB) a filename, seek address and length so that it can go get the translation if needed. This routine returns the stored protein sequence of the specified PEG feature.
ID of the feature (PEG) whose translation is desired.
Returns the protein sequence string for the specified feature.
usage: @mapped = $fig->mapped_prot_ids($prot)
This routine is at the heart of maintaining synonyms for protein sequences. The system determines which protein sequences are "essentially the same". These may differ in length (presumably due to miscalled starts), but the tails are identical (and the heads are not "too" extended). Anyway, the set of synonyms is returned as a list of 2-tuples [Id,length] sorted by length.
my @id_list = $fig->get_corresponding_ids($id, $with_type_info);
Return a list of the identifiers that correspond to the given identifier, based on the PIR id correspondence table.
Identifer to look up.
Pass a true value here to return tuples [id, source-type, link-information] instead of identifiers.
A list of identifiers if $with_type_info not true; a list of tuples [id, source-type, link-information] otherwise.
my $function = $fig->function_of($id, $user);
or
my @functions = $fig->function_of($id);
In a scalar context, returns the most recently-determined functional assignment of a specified feature by a particular user. In a list context, returns a list of 2-tuples, each consisting of a user ID followed by a functional assighment by that user. In this case, the list contains all the functional assignments for the feature.
ID of the relevant feature.
ID of the user whose assignment is desired (scalar context only)
Returns the most recent functional assignment by the given user in scalar context, and a list of functional assignments in list context. Each assignment in the list context is a 2-tuple of the form [$user, $assignment].
my $functionHash = $fig->function_of_bulk(\@fids, $no_del_check);
Return a hash mapping the specified proteins to their master functional assignments.
Reference to a list of feature IDs.
If TRUE, then deleted features will not be removed from the list. The default is FALSE, which means deleted feature will be removed from the list.
REturns a reference to a hash mapping feature IDs to their main functional assignments.
usage: $function = $fig->translated_function_of($peg,$user)
You get just the translated function.
usage: $translated_func = $fig->translate_function($func)
Translates a function based on the function.synonyms table.
usage: $fig->assign_function($peg,$user,$function,$confidence)
Assigns a function. Note that confidence can (and should be if unusual) included. Now, assignments are logged in the annotation file by assign_function.
New sims code.
This code takes advantage of a network similarity server if it is available.
We gather sims in the following manner:
If a local sims directory exists, gather the raw sims for our peg.
If dynamic sims are available, gather the raw sims from there as well.
Do an initial pruning of these raw sims, based on the conditions
passed in to the sims call.
Locally expand these sims.
If we are using network sims, retrieve them now, and add to the local sims set.
Do a final pruning of this set of sims, and sort.
usage: @sims = $fig->osims($peg,$maxN,$maxP,$select,$max_expand, $filters)
Returns a list of similarities for $peg such that
there will be at most $maxN similarities,
each similarity will have a P-score <= $maxP, and
$select gives processing instructions:
"raw" means that the similarities will not be expanded (by far fastest option)
"fig" means return only similarities to fig genes
"all" means that you want all the expanded similarities.
"figx" means exapand until the maximum number of fig sims
By "expanded", we refer to taking a "raw similarity" against an entry in the non-redundant protein collection, and converting it to a set of similarities (one for each of the proteins that are essentially identical to the representative in the nr).
Each entry in @sims is a refence to an array. These are the values in each array position:
0. The query peg 1. The similar peg 2. The percent id 3. Alignment length 4. Mismatches 5. Gap openings 6. The start of the match in the query peg 7. The end of the match in the query peg 8. The start of the match in the similar peg 9. The end of the match in the similar peg 10. E value 11. Bit score 12. Length of query peg 13. Length of similar peg 14. Method
my @bbhList = $fig->bbhs($peg, $cutoff);
Return a list of the bi-directional best hits relevant to the specified PEG.
ID of the feature whose bidirectional best hits are desired.
Similarity cutoff. If omitted, 1e-10 is used.
Returns a list of 3-tuples. The first element of the list is the best-hit PEG; the second element is the score. A lower score indicates a better match. The third element is the normalized bit score for the pair, and is normalized to the length of the protein.
my $bbhHash = $fig->bbh_list($genomeID, \@featureList);
Return a hash mapping the features in a specified list to their bidirectional best hits on a specified target genome.
(Modeled after the Sprout call of the same name.)
ID of the genome from which the best hits should be taken.
List of the features whose best hits are desired.
Returns a reference to a hash that maps the IDs of the incoming features to the best hits on the target genome.
usage: $dir = $fig->get_figfams_data($mydir)
usage: $dir = &FIG::get_figfams_data($mydir)
Returns the Figfams data directory to use. If $mydir is passed, use that value. Otherwise see if $FIG_Config::FigfamsData is defined, and use that. Otherwise default to $FIG_Config::data/FigfamsData.
usage: @sims = $fig->dsims($id,$seq,$maxN,$min_nbsc)
Returns a list of similarities for $seq against PEGs from FIGfams such that
there will be at most $maxN similarities, and
each similarity will have a normalized bit-score >= $min_nbsc
The "dsims" or "dynamic sims" are not precomputed. They are computed using a heuristic which is much faster than blast, but misses some similarities. Essentially, you have an "index" or representative sequences, a quick blast is done against it, and if there are any hits these are used to indicate which sub-databases to blast against. This implies that the p-scores are fairly meaningless; use the normalized bit-scores ($sim->nbsc)
usage: @pegs = $fig->in_cluster_with($peg)
Returns the set of pegs that are thought to be clustered with $peg (on the chromosome).
usage: $fig->add_chromosomal_clusters($file)
The given file is supposed to contain one predicted chromosomal cluster per line (either comma or tab separated pegs). These will be added (to the extent they are new) to those already in $FIG_Config::global/chromosomal_clusters.
usage: $fig->in_pch_pin_with($peg)
Returns the set of pegs that are believed to be "pinned" to $peg (in the sense that PCHs occur containing these pegs over significant phylogenetic distances).
usage: $fig->add_pch_pins($file)
The given file is supposed to contain one set of pinned pegs per line (either comma or tab seprated pegs). These will be added (to the extent they are new) to those already in $FIG_Config::global/pch_pins.
my $okFlag = $fig->add_annotation($fid, $user, $annotation, $time_made);
Add an annotation to a feature.
ID of the feature to be annotated.
Name of the user making the annotation.
Text of the annotation.
Time of the annotation, in seconds since the epoch. If omitted, the current time is used.
Returns 1 if successful, 0 if any of the parameters are invalid or an error occurs.
my ($n_added, $badList) = $fig->add_annotation_batch($file);
Install a batch of annotations.
File containing annotations.
Returns the number of annotations successfully added in $n_added. If annotations failed, they are returned in $badList as a tuple [$peg, $error_msg, $entry].
usage: @annotations = $fig->merged_related_annotations($fids)
The set of annotations of a set of PEGs ($fids) is returned as a list of 4-tuples. Each entry in the list is of the form [$fid,$timestamp,$user,$annotation].
my @annotations = $fig->feature_annotations($fid, $rawtime);
Return a list of the specified feature's annotations. Each entry in the list returned is a 4-tuple containing the feature ID, time stamp, user ID, and annotation text. These are exactly the values needed to add the annotation using add_annotation, though in a different order.
ID of the features whose annotations are to be listed.
If TRUE, the times will be returned as PERL times (seconds since the epoch); otherwise, they will be returned as formatted time strings.
Returns a list of 4-tuples, one per annotation. Each tuple is of the form ($fid, $timeStamp, $user, $annotation) where $fid is the feature ID, $timeStamp is the time the annotation was made, $user is the name of the user who made the annotation, and $annotation is the text of the annotation.
my @annotations = $fig->read_all_annotations($genomeID);
Return a list of the specified genome's annotations. Each entry in the list returned is a 4-tuple containing the feature ID, time stamp, user ID, and annotation text. The values are read directly from the annotation flat file without resorting to the database.
ID of the genome whose annotations are to be read.
Returns a list of 4-tuples, one per annotation. Each tuple is of the form ($fid, $timeStamp, $user, $annotation) where $fid is the feature ID, $timeStamp is the time the annotation was made, $user is the name of the user who made the annotation, and $annotation is the text of the annotation.
my $annoString = FIG::read_annotation_record($fileHandle);
Read an annotation record from the specified file handle. Will return the
annotation record if successful, and undef if end-of-file is read. An
annotation record consists of multiple lines of text separated by a
line containing a double-slash //.
The file handle from which to read the record.
Returns either the entire annotation record (without the double-slash) or
undef, indicating end-of-file. Null records will not be returned.
usage: $date = $fig->parse_date(date-string)
Parse a date string, returning seconds-since-the-epoch, or undef if the date did not parse.
Accepted formats include an integer, which is assumed to be seconds-since-the-epoch an is just returned; MM/DD/YYYY; or a date that can be parsed by the routines in the Date::Parse module.
Extract a list of assignments from an annotations package as created by annotations_made_fast. Assumes that the user and date filtering was done by the annotations extraction, so all this has to do is to sort the lists of annotations by date and grab the latest one.
Return value is a list of tuples [$peg, $assignment, $date, $who].
usage: @annotations = $fig->annotations_made($genomes, $who, $date)
Return the list of annotations on the genomes in @$genomes made by $who after $date.
Each returned annotation is of the form [$fid,$timestamp,$user,$annotation].
The attribute system automatically detects whether you are using a local attribute database, a remote attribute server, or the SEED data store. For details on the new attribute system see the documentation for the CustomAttributes module.
Because of the enormous number of attributes in the system (1.5 million and growing), the old system, which combined a database table and flat file data stores, has become too slow for live SEEDs. It is maintained for small test SEEDs, such as what you might have running on a local PC. Be aware, however, that not all functions of the old system work in the new system, and vice versa. You can get a more accurate test system by linking to the test attribute server. Simply place
$attrURL = "http://nmpdr-1.nmpdr.org/next/FIG/AttribXMLRPC.cgi";
in your FIG_Config file. This server contains old data that can be mangled without let or hindrance. To connect to the real server, use
$attrURL = "http://nmpdr-1.nmpdr.org/next/FIG/AttribXMLRPC.cgi";
but be aware that any changes you make will automatically be migrated to all the production SEEDs.
There are several base attribute methods:
get_attributes add_attribute delete_attribute change_attribute
There are also methods for more complex things:
get_keys get_values guess_value_format
By default all keys are case sensitive, and all keys have leading and trailing white space removed. Keys can not contain anything but [a-zA-Z0-9_] (or things matched by \w)
Attributes are not on a 1:1 correlation, so a single key can have several values.
Most attributes files are stored in the genome specific directories. These are in Organisms/nnnnn.n/Attributes for the organisms, and Organisms/nnnnn.n/Feaures/*/Attributes for the features. Attributes can also be stored in Global/Attributes where they will be loaded, but users are discouraged from doing this since there will be no packaging and sharing of those attibutes. Global should be reserved for those attributes that are calculated on a database-wide instance. There are several "special" files that we are using:
1. Definition files
These are the raw text files stored in the appropriate locations (Organisms/nnnnn.n/Attributes, Organisms/nnnnn.n/Feaures/*/Attributes, and Global/Attributes). The files should consist of ONLY feature, key, value, and optional URL. Any other columns will be ignored and not loaded into the database.
2. Global/Attributes/attribute_keys
This contains the definition of the attribute keys. There are currently 3 defined columns although others may be added and this file can contain lines of an arbitrary length.
3. Global/Attributes/transaction_log, Organisms/nnnnnn.n/Attributes/transaction_log, and Organisms/nnnnnn.n/Features/*/Attributes/transaction_log
These are the transaction logs that contain any modifications to the data. In general the data is loaded from a single definition file this is not modified by the software. Any changes to the attributes are made in the Database and then written to the transaction log. The transaction log has the following columns
1. command. This can be one of ADD/DELETE/CHANGE 2. feature. The feature id to be modified 3. key. The key to be modified 4. old value. The original value of the key 5. old url. The original URL 6. new value. The new value for the key. Ignored if the key is deleted. 7. new url. The new value for the URL. Ignored if the key is deleted.
Note that the old value and old url are optional. If they are not provided ALL instances of the key will be affected.
Notice also that the old file assigned_attributes is no longer used. This is replaced by the transaction log.
Finally, in the parsing of all files any line beginning with a pound sign is ignored as a comment.
A method, read_attribute_transaction_log, is provided to read the transaction_logs and implement the changes therein. In each of the methods add_attribute, delete_attribute, and change_attribute there is an optional boolean that can be set to prevent writing of the transaction_log. The read_attribute_transaction_log reads the log and then adds/changes/deletes the records as appropriate. Without this boolean there is a circular reference.
Get attributes requires one of four keys: fid (which can be genome, peg, rna, or other id, or a reference to a list of ids), key, value, url
It will find any attribute that has the characteristics that you request, and if any values match it will return a four-ple of: [fid, key, value, url]
You can request an E. coli key like this $fig->get_attributes('83333.1');
You can request a peg id like this:
$fig->get_attributes($peg);
$fig->get_attributes("fig|833333.1.peg.4");
You can request any structure key like this $fig->get_attributes(undef, 'structure');
You can request any url like this $fig->get_attributes(undef, undef, undef, 'http://pir.georgetown.edu/sfcs-cgi/new/pirclassif.pl?id=SF001547');
NOTE: If there are no attributes an empty array will be returned. You need to check for this and not assume that it will be undef.
my @attributeList = $fig->get_attributes($objectID, $key, @values);
In the database, attribute values are sectioned into pieces using a splitter value specified in the constructor (new). This is not a requirement of the attribute system as a whole, merely a convenience for the purpose of these methods. If a value has multiple sections, each section is matched against the corresponding criterion in the @valuePatterns list.
This method returns a series of tuples that match the specified criteria. Each tuple
will contain an object ID, a key, and one or more values. The parameters to this
method therefore correspond structurally to the values expected in each tuple. In
addition, you can ask for a generic search by suffixing a percent sign (%) to any
of the parameters. So, for example,
my @attributeList = $attrDB->GetAttributes('fig|100226.1.peg.1004', 'structure%', 1, 2);
would return something like
['fig}100226.1.peg.1004', 'structure', 1, 2]
['fig}100226.1.peg.1004', 'structure1', 1, 2]
['fig}100226.1.peg.1004', 'structure2', 1, 2]
['fig}100226.1.peg.1004', 'structureA', 1, 2]
Use of undef in any position acts as a wild card (all values). You can also specify
a list reference in the ID column. Thus,
my @attributeList = $attrDB->GetAttributes(['100226.1', 'fig|100226.1.%'], 'PUBMED');
would get the PUBMED attribute data for Streptomyces coelicolor A3(2) and all its features.
In addition to values in multiple sections, a single attribute key can have multiple values, so even
my @attributeList = $attrDB->GetAttributes($peg, 'virulent');
which has no wildcard in the key or the object ID, may return multiple tuples.
Value matching in this system works very poorly, because of the way multiple values are stored. For the object ID and key name, we create queries that filter for the desired results. For the values, we do a comparison after the attributes are retrieved from the database. As a result, queries in which filter only on value end up reading the entire attribute table to find the desired results.
ID of object whose attributes are desired. If the attributes are desired for multiple
objects, this parameter can be specified as a list reference. If the attributes are
desired for all objects, specify undef or an empty string. Finally, you can specify
attributes for a range of object IDs by putting a percent sign (%) at the end.
Attribute key name. A value of undef or an empty string will match all
attribute keys. If the values are desired for multiple keys, this parameter can be
specified as a list reference. Finally, you can specify attributes for a range of
keys by putting a percent sign (%) at the end.
List of the desired attribute values, section by section. If undef
or an empty string is specified, all values in that section will match. A
generic match can be requested by placing a percent sign (%) at the end.
In that case, all values that match up to and not including the percent sign
will match. You may also specify a regular expression enclosed
in slashes. All values that match the regular expression will be returned. For
performance reasons, only values have this extra capability.
Returns a list of tuples. The first element in the tuple is an object ID, the second is an attribute key, and the remaining elements are the sections of the attribute value. All of the tuples will match the criteria set forth in the parameter list.
my @attributeData = $ca->query_attributes($filter, $filterParms);
Return the attribute data based on an SQL filter clause. In the filter clause,
the name $object should be used for the object ID, $key should be used for
the key name, $subkey for the subkey value, and $value for the value field.
Filter clause in the standard ERDB format, except that the field names are $object for
the object ID field, $key for the key name field, $subkey for the subkey field,
and $value for the value field. This abstraction enables us to hide the details of
the database construction from the user.
Parameters for the filter clause.
Returns a list of tuples. Each tuple consists of an object ID, a key (with optional subkey), and one or more attribute values.
A simple wrapper around get_attriubtes to return only those attributes that have meta_data indicating that the key is a controlled vocabulary.
### DEPRECATED ### The controlled vocabulary feature was never used in the old system, and in the new system, ALL the keys are controlled vocabulary.
Add a new key/value pair to something. Something can be a genome id, a peg, an rna, prophage, whatever.
Arguments:
feature id, this can be a peg, genome, etc,
key name. This is case sensitive and has the leading and trailing white space removed
value
optional URL to add
boolean to prevent writing to the transaction log. See above
$fig->delete_attribute($objectID, $key, @values);
Delete the specified attribute key/value combination from the database.
ID of the object whose attribute is to be deleted.
Attribute key name.
One or more values associated with the key. If no values are specified, then all values will be deleted. Otherwise, only a matching value will be deleted.
my ($type, $id) = FIG::parse_oid($idValue);
Convert an attribute object ID to an object type and an ID applicable to that type.
This information can be used to convert an ID string obtained from the get_attributes
method to an object name and ID suitable for plugging into the GetEntity method
of an ERDB database.
ID string from the attribute database.
Returns a two-element list consisting of the object type and its individual ID.
my $idValue = FIG::form_oid($type, $id);
Convert an object type and ID into an ID string for the attribute database.
Object type. This should usually correspond to an entity name in a database. It can only contain letters. This means no digits, spaces, or even underscores.
Individual object ID.
Returns the string used to represent the object in the attribute database.
my @attributeList = $fig->delete_matching_attributes($objectID, $key, @values);
This method works identically to get_attributes, except that the attributes are deleted as they are retrieved.
$fig->change_attribute($objectID, $key, \@oldValues, \@newValues);
Change the value of an attribute key/value pair for an object. This is implemented as a delete followed by an insert.
ID of the genome or feature to which the attribute is to be changed. In general, an ID that
starts with fig| is treated as a feature ID, and an ID that is all digits and periods
is treated as a genome ID. For IDs of other types, this parameter should be a reference
to a 2-tuple consisting of the entity type name followed by the object ID.
Attribute key name. This corresponds to the name of a field in the database.
One or more values identifying the key/value pair to change.
One or more values to be put in place of the old values.
clean_attribute_key()## DEPRECATED ## This process is no longer required in the new system.
use $key=$fig->clean_attribute_key($key)
Keys for attributes are used as filenames in the code, and there are limitations on the characters that can be used in the key name. We provide an extended explanation of each key, so the key does not necessarily need to be person-readable.
Keys are not allowed to contain any non-word character (i.e. they must only contain [a-zA-Z0-9] and _
This method will remove these.
my $flag = $fig->essential($fid);
Return TRUE if a feature is considered essential and FALSE otherwise. This method
provides a uniform method for determining essentiality that will remain consistent
during the various overhauls of essentiality. Currently a feature is essential
if it has an attribute with the value essential or potential_essential.
ID of the feature to check for essentiality.
Returns TRUE if the feature is considered essential, else FALSE.
my $flag = $fig->virulent($fid);
Return TRUE if a feature is considered virulent and FALSE otherwise. This method
provides a uniform method for determining virulence that will remain consistent
during the various overhauls of virulence attributes. Currently a feature is virulent
if it has an attribute whose key begins with virulence_associated.
ID of the feature to check for essentiality.
Returns TRUE if the feature is considered essential, else FALSE.
There was a big problem with attributes being very slow to recover, and having to recover all attributes just to get those for a peg or a genome. The current implementation splits the original ID (oid) into three columns, genome, ftype, and id. The ftype is peg, rna, pp, etc. The id is the feature number. The genome is the genome number.
Hence: fig|83333.1.peg.1345 becomes 83333.1, peg, and 1345 83333.1 becomes 83333.1, '', and ''
To split an oid into an array with three parts:
$self->split_attribute_oid($peg);
To join the three parts of a series of results: map {unshift @$_, $self->join_attribute_oid(splice(@$_, 0, 3))} @$res;
This code splices the first three elements of the the array, joins them, and then unshifts the result of that join back into the start of the array. Cool, eh?
split_attribute_oid()use my ($genome, $type, $id)=split_attribute_feature($id);
splits an id into genome, type, and id if it is a feature, or just genome and '', '' if it is a genome, and just the id and undef undef if it is not known
join_attribute_oid()use my $id=join_attribute_oid($genome, $feature, $id);
Joins an attribute back together after it has been pulled from the mysql database
use: $fig->read_attribute_transaction_log($logfile);
This method reads the transaction_log described in $logfile and enacts the changes described therein. The changes must be one of add, delete, or change.
This method will remove any notion of the attribute that you give it. It is different from delete as that just removes a single attribute associated with a peg. This will remove the files and uninstall the attributes from the database so there is no memory of that type of attribute. All of the attribute files are moved to FIG_Tmp/Attributes/deleted_attributes, and so you can recover the data for a while. Still, you should probably use this carefully!
I use this to clean out old PIR superfamily attributes immediately before installing the new correspondence table.
e.g. my $status=$fig->erase_attribute_entirely("structure");
This will return the number of files that were moved to the new location
my @keys = $fig->get_group_keys($groupName);
Return all the attribute keys in the named group.
Name of the group whose keys are desired.
Returns a list of the attribute keys in the named group.
my %keys = $fig->get_group_key_info($groupName);
Return the descriptive data for all the attribute keys in the named group.
Name of the group whose keys are desired. If omitted, then all keys will be returned. This could be expensive, but when it's necessary, it's necessary.
Returns a hash mapping each relevant attribute key to an n-tuple containing the the attribute relation name, the description, and the 0 or more group names.
Get all the keys that apply to genomes and only genomes. This method takes no arguments and returns an array.
Get all the keys that apply just to pegs. This method takes no arguments and returns an array.
Get all the keys that apply just to pegs from a specified genome. This method takes a genome id as an argument and returns an array.
Get a list of all genomes that have a specified attribute. This will search for all genomes that have some attribute.
This will also accept partial matches. Hence to find all genomes that have essentiality data you can do this:
my @genomes=$fig->get_genomes_with_attribute("essential");
This will find Essential_Gene_Sets_Bacterial, essential, etc
DEPRECATED: in actual fact, no attribute metadata was ever put into the system.
Access a hash of key information. The data that are returned are currently:
hash key name what is it data type single [boolean] description Explanation of key [free text] readonly whether to allow read/write [boolean] is_cv attribute is a cv term [boolean]
Single is a boolean, if it is true only the last value returned should be used. Note that the other methods willl still return all the values, it is upto the implementer to ensure that only the last value is used.
Explanation is a user-derived explanation that can be free text
If a reference to a hash is provided, along with the key, those values will be set to the attribute_keys file
Returns an empty hash if the key is not provieded or doesn't exist
e.g.
$fig->key_info($key, \%data); # set the data
$data=$fig->key_info($key); # get the data
This data is stored in a file called $FIG_Config::global/Attributes/attribute_metadata and in a database called attribute_metadata. The data is strictly on a last in last out basis, so that if a datapoint is changed, the last datapoint in the database or file is returned. At the moment I am not coding the ability to edit data.
The method takes the following arguments
The key to look for or add data to.
A reference to a hash containing the new data to add to the database. If provided this will cause the database to be updated
Do not write the new data to the attributes_metadata file. This is mainly used by load_attributes to prevent a circular read/write condition.
update_attributes_metadata()This method exists solely to update the attributes metadata file and make sure that it is in the right format. This method can probably be deleted in a while, but it needs to be run on all machines with attributes data before then!
It is only called if an old attributes metadata file is found.
The method returns the filename where the data is now stored.
Get all the values that we know about
Without any arguments:
Returns a reference to a hash, where the key is the type of feature (peg, genome, rna, prophage, etc), and the value is a reference to a hash where the key is the value and the value is the number of occurences
e.g. print "There are " , {$fig->get_values}->{'peg'}->{'100'}, " keys with the value 100 in the database\n";
With a single argument:
The argument is assumed to be the type (rna, peg, genome, etc).
With two arguments:
The first argument is the type (rna, peg, genome, etc), and the second argument is the key.
In each case it will return a reference to a hash. E.g.
$fig->get_values(); # will get all values
$fig->get_values('peg'); # will get all values for pegs
$fig->get_values('peg', 'structure'); # will get all values for pegs with attribute structure
$fig->get_values(undef, 'structure'); # will get all values for anything with that attribute
There are occassions where I want to know what a value is for a key. I have three scenarios right now:
1. strings 2. numbers 3. percentiles ( a type of number, I know)
In these cases, I may want to know something about them and do something interesting with them. This will try and guess what the values are for a given key so that you can try and limit what people add. At the moment this is pure guess work, although I suppose we could put some restrictions on t/v pairs I don't feel like.
This method will return a reference to an array. If the element is a string there will only be one element in that array, the word "string". If the value is a number, there will be three elements, the word "float" in position 0, and then the minimum and maximum values. You can figure out if it is a percent :-)
This is just an internal method to find the appropriate location of the attributes file depending on whether it is a peg, an rna, or a genome or whatever.
Add a controlled vocabulary term to a peg. Pass in the peg, the vocab name, the termId, and the term (see next paragraph). returns error string if problem, else returns nothing.
my $status = $fig->add_cv_term( "master:EdF",
"fig|9606.3.peg.26823", "MyVocab", "1234", "A thing of wonder.");
if ($status) {print "error adding cv term: $status\n";}
Controlled vocabulary is read-only text associated with a peg. Each is a triple, namely (vocab name, termId, term text). The termId is an id that is used in the particulary vocabulary and the term text is the actual term. For example, the GO has the term "U12-type nuclear mRNA branch site recognition" with termId GO:0000371. Thus, the triplet is (GO, GO:0000371, "U12-type nuclear mRNA branch site recognition"). Don't be confused by the GO: in GO:0000371. We don't add the GO:. That's just what GO decided to do.
termIds can not have ';' in them.
This routine encapsulates our present implementation via attributes.
Search a controlled vocabulary file for desired text. Pass the name of the CV, e.g., "GO" or "HUGO" and get back a reference to a list of results. Each result is a line from the file, and so is a tab-separated representation of the tripilet, (CV_name, CV_id, CV_text)
Case insensitivee, substring. =cut
sub search_cv_file
{
my ($self, $cv,$search_term) =@_;
my $file = $FIG_Config::global."/CV/cv_search_".$cv.".txt";
if (! open(LOOKUP,"$file") ) {
print STDERR "Search could not find vocabulary file, $file\n";
return;
}
my @lines;
while (<LOOKUP>) {
chomp;
push @lines, $_;
}
my @grep_results = grep(/$search_term/i,@lines);
return [@grep_results];
}
################################# Indexing Features and Functional Roles ####################################
my ($pegs,$roles) = fig->search_index($pattern, $non_word_search, $user);
Find all pegs and roles that match a search pattern. The syntax of $pattern is deliberately left undefined so that we can change the underlying technology, but a single word or phrase should work.
A search pattern. In general, the pattern is a single word or phrase that is expected to occur somewhere in a functional role, attribute key, or attribute value.
If specified, the pattern will be interpreted as a string instead of a series of words.
If specified, the name of the current user. That user's annotation will be given precedence when the functional role is determined.
Returns a 2-tuple. The first element is a reference to a list of features. For each feature, there is a tuple consisting of the (0) feature ID, (1) the organism name (genus and species), (2) the aliases, (3) the functional role, and (4) the relevant annotator. The second element in the returned tuple is a reference to a list of functional roles. All the roles and features in the lists must match the pattern in some way.
my ($who, $function) = $fig->choose_function($user, @funcs);
Choose the best functional role from a list of role/user tuples. If a user is specified, we look for one by that user. If that doesn't work, we look for one by a master user. If THAT doesn't work, we take the first one.
The name of the current user. If no user is active, specify either undef or
a null string.
List of functional roles. Each role is represented by a 2-tuple consisting of the user name followed by the role description.
usage: $assignment = &FIG::auto_assign($peg,$seq)
This returns an automated assignment for $peg. $seq is optional; if it is not present, then it is assumed that similarities already exist for $peg. $assignment is set to either
Function
or
Function\tW
if it is felt that the assertion is pretty weak.
In the protein families we have our own concept of an id that I have called an cid. This is entirely internal and does not map to any known database except our own, however it is used to store the correspondence between different protein families. Therefore, to find out what family any protein is in you need to convert that protein to an cid. You can start with a KEGG, COG, TIGR, SP, GI, or FIG id, and get an cid back. From there, you can find out what other proteins that cid maps to, and what families that protein is also in.
usage: @all = $fig->all_protein_families
Returns a list of the ids of all of the protein families currently defined.
my @families = $fig->families_for_protein($peg);
Return a list of all the families containing the specified protein.
ID of the PEG representing the protein in question.
Returns a list of the IDs of the families containing the protein.
my @proteins = $fig->proteins_in_family($family);
Return a list of every protein in a family.
ID of the relevant protein family.
Returns a list of all the proteins in the specified family.
my $func = $fig->family_function($family);
Returns the putative function of all of the pegs in a protein family. Remember, we are defining "protein family" as a set of homologous proteins that have the same function.
ID of the relevant protein family.
Returns the name of the function assigned to the members of the specified family.
my $n = $fig->sz_family($family);
Returns the number of proteins in a family.
ID of the relevant protein family.
Returns the number of proteins in the specified family.
usage: $n = $fig->ext_sz_family($family)
Returns the number of external IDs in $family.
usage: @all_cids=$fig->all_cids();
Returns a list of all the ids we know about.
usage: @pegs = $fig->ids_in_family($family)
Returns a list of the cids in $family.
usage: @families = $fig->in_family($cid)
Returns an array containing the families containing an cid.
usage: @exts = $fig->ext_ids_in_family($family)
Returns a list of the external ids in an external family name.
usage: @ext_families = $fig->ext_in_family($id)
Returns an array containing the external families containing an id. The ID is the one from the original database (e.g. pfam|PB129746)
use: my @famlies = $fig->families_by_source('fig');
This use SQL to look up all the families that have a partial match to the argument supplied. It should be quicker than getting all families and parsing out the ones you want since it is done at the db level.
use: my $number=$fig->number_of_cids
The number_of_ methods here all use SQL queries to count how many of each thing there are. This method just returns the number of cids
use: my $number=$fig->number_of_families("fig");
This uses an SQL count method to count the number of families that match the given source. This should be a lot quicker than retrieving all families and then looping through them.
use: my $number=$fig->number_of_proteins_in_families("fig", "distinct");
This uses and SQL count to count the number of proteins in families that match a given source. If distinct is true each protein will only be counted once, else the total number will be returned.
Convert a protein to a global ID
my $cid=$fig->prot_to_cid($proteinid)
$proteinid can be a FIG ID, a SP, tigr, or one of many other IDs
returns "" if not known
Convert an internal ID to the proteins that map to that ID.
my @proteins=$fig->cid_to_prots($cid);
Get a list of families that have a partial match to a provided function.
E.g. my @families=$fig->family_by_function("histidine")
will return histidine kinase, histidine phosphatase, etc etc etc
my @compounds = $fig->all_compounds();
Return a list containing all of the KEGG compounds.
my @names = $fig->names_of_compound($cid);
Returns a list containing all of the names assigned to the specified KEGG compound. The list will be ordered as given by KEGG.
ID of the desired compound.
Returns a list of names for the specified compound.
usage: @ids = $fig->ids_of_compound
Returns a list containing all of the ids assigned to the KEGG compounds. The list will be ordered as given by KEGG.
usage: @ids = $fig->ids_of_compound_like_name($name)
Returns a list containing all of the ids assigned to the KEGG compounds that match $name. The list will be ordered as given by KEGG.
my @rids = $fig->comp2react($cid);
Returns a list containing all of the reaction IDs for reactions that take $cid as either a substrate or a product.
my $flag = $fig->valid_reaction_id($rid);
Returns true iff the specified ID is a valid reaction ID.
This will become important as we include non-KEGG reactions
Reaction ID to test.
Returns TRUE if the reaction ID is in the data store, else FALSE.
my $cas = $fig->cas($cid);
Return the Chemical Abstract Service (CAS) ID for the compound, if known.
ID of the compound whose CAS ID is desired.
Returns the CAS ID of the specified compound, or an empty string if the CAS ID is not known or does not exist.
my $cid = $fig->cas_to_cid($cas);
Return the compound id (cid), given the Chemical Abstract Service (CAS) ID.
CAS ID of the desired compound.
Returns the ID of the compound corresponding to the specified CAS ID, or an empty string if the CAS ID is not in the data store.
my @rids = $fig->all_reactions();
Return a list containing all of the KEGG reaction IDs.
my $flag = $fig->reversible($rid);
Return TRUE if the specified reaction is reversible. A reversible reaction has no main
direction. The connector is symbolized by <=> instead of =>.
ID of the ralevant reaction.
Returns TRUE if the specified reaction is reversible, else FALSE. If the reaction does not exist, returns TRUE.
my $rev = $fig->reaction_direction($rid);
Returns an array of triplets mapping from reactions in the context of maps to reversibility.
ID of the relevant reaction.
Return B if the reaction proceeds in both directions, L if it proceeds from right
to left, or R if it proceeds from left to right (by convention the "substrates"
are on the left and the "products" are on the right).
my @tuples = $fig->reaction2comp($rid, $which, $paths);
Return the substrates or products for a reaction. In any event (i.e., whether you ask for substrates or products), you get back a list of 3-tuples. Each 3-tuple will contain
[$cid,$stoich,$main]
Stoichiometry indicates how many copies of the compound participate in the reaction. It is normally numeric, but can be things like "n" or "(n+1)". $main is 1 iff the compound is considered "main" or "connectable".
ID of the reaction whose compounds are desired.
TRUE if the products (right side) should be returned, FALSE if the substrates (left side) should be returned.
Optional list of paths to check whether compound is "main"
Returns a list of 3-tuples. Each tuple contains the ID of a compound, its stoichiometry, and a flag that is TRUE if the compound is one of the main participants in the reaction. If paths are specified, the flag indicates whether the compound is main in any of the specified paths.
my @ecs = $fig->catalyzed_by($rid);
Return the ECs (roles) that are reputed to catalyze the reaction. Note that we are currently just returning the ECs that KEGG gives. We need to handle the incompletely specified forms (e.g., 1.1.1.-), but we do not do it yet.
ID of the reaction whose catalyzing roles are desired.
Returns the IDs of the roles that catalyze the reaction.
my @ecs = $fig->catalyzes($role);
Returns the reaction IDs of the reactions catalyzed by the specified role (normally an EC).
ID of the role whose reactions are desired.
Returns a list containing the IDs of the reactions catalyzed by the role.
my $displayString = $fig->displayable_reaction($rid)
Returns a string giving the displayable version of a reaction.
my @maps = $fig->all_maps();
Return all of the KEGG maps in the data store.
my @maps = $fig->ec_to_maps($ec);
Return the set of maps that contain a specific functional role. The role can be specified by an EC number or a full-blown role ID.
The EC number or role ID of the role whose maps are desired.
Returns a list of the IDs for the maps that contain the specified role.
This is an alternate name for ec_to_maps.
my @ecs = $fig->map_to_ecs($map);
Return the set of functional roles (usually ECs) that are contained in the functionality depicted by a map.
ID of the KEGG map whose roles are desired.
Returns a list of EC numbers for the roles in the specified map.
my $name = $fig->map_name($map);
Return the descriptive name covering the functionality depicted by the specified map.
ID of the map whose description is desired.
Returns the descriptive name of the map, or an empty string if no description is available.
usage: @roles = $fig->neighborhood_of_role($role)
Returns a list of functional roles that we consider to be "the neighborhood" of $role.
my @roles = $fig->roles_of_function($func);
Returns a list of the functional roles implemented by the specified function. This method parses the role data out of the function name, and does not require access to the database.
Name of the function whose roles are to be parsed out.
Returns a list of the roles performed by the specified function.
my $roles = $fig->protein_subsystem_to_roles($peg, $subsystem);
Return the roles played by a particular PEG in a particular subsytem. If the protein is not part of the subsystem, an empty list will be returned.
ID of the protein whose role is desired.
Name of the relevant subsystem.
Returns a reference to a list of the roles performed by the specified PEG in the specified subsystem.
$fig->is_BRC_genome($genome)
returns true if $genome is an BRC genome
$fig->is_NMPDR_genome($genome)
returns true if $genome is an NMPDR genome
my @pegs = $fig->seqs_with_role($role,$who);
Return a list of the pegs that implement $role. If $who is not given, it defaults to "master". The system returns all pegs with an assignment made by either "master" or $who (if it is different than the master) that implement $role. Note that this includes pegs for which the "master" annotation disagrees with that of $who, the master's implements $role, and $who's does not.
usage: $result = $fig->seqs_with_roles_in_genomes($genomes,$roles,$made_by)
This routine takes a pointer to a list of genomes ($genomes) and a pointer to a list of roles ($roles) and looks up all of the sequences that connect to those roles according to either the master assignments or those made by $made_by. Again, you will get assignments for which the "master" assignment connects, but the $made_by does not.
A hash is returned. The keys to the hash are genome IDs for which at least one sequence was found. $result->{$genome} will itself be a hash, assuming that at least one sequence was found for $genome. $result->{$genome}->{$role} will be set to a pointer to a list of 2-tuples. Each 2-tuple will contain [$peg,$function], where $function is the one for $made_by (which may not be the one that connected).
usage: @clusters = $fig->largest_clusters($roles,$user)
This routine can be used to find the largest clusters containing some of the designated set of roles. A list of clusters is returned. Each cluster is a pointer to a list of pegs.
usage: @candidates = $fig->best_bbh_candidates($genome,$cutoff,$requested,$known)
This routine returns a list of up to $requested candidates from $genome. A candidate is a BBH against one of the PEGs in genomes from the list given by@$known. Each entry in the list is a 3-tuple:
[CandidatePEG,KnownBBH,Pscore]
usage: @candidates = $fig->best_bbh_candidates_additional($genome,$cutoff,$requested,$known)
This routine returns a list of up to $requested candidates from $genome. A candidate is a BBH against one of the PEGs in genomes from the list given by@$known. The method collects additional information from the similarities and is used in the subsystem extension. Each entry in the list is a 10-tuple:
[CandidatePEG,KnownBBH,Pscore,fraction, b1, e1, b2, e2, ln1, ln2]
usage: $seq = &FIG::extract_seq($contigs,$loc)
This is just a little utility routine that I have found convenient. It assumes that $contigs is a hash that contains IDs as keys and sequences as values. $loc must be of the form
Contig_Beg_End
where Contig is the ID of one of the sequences; Beg and End give the coordinates of the sought subsequence. If Beg > End, it is assumed that you want the reverse complement of the subsequence. This routine plucks out the subsequence for you.
my @contig_ids = $fig->contigs_of($genome);
Returns a list of all of the contigs occurring in the designated genome.
ID of the genome whose contigs are desired.
Returns a list of the IDs for the contigs occurring in the specified genome.
usage: $n=$fig->number_of_contigs($genome)
This uses the SQL count function to count the numbmer of contigs. It should be a lot faster than pulling all the contigs and counting them.
In fact, it causes about a 10-fold increase in speed! Compare fig n_contigs and fig number_of_contigs
usage: @contig_ids = $fig->all_contigs($genome)
Returns a list of all of the contigs occurring in the designated genome.
usage: $n = $fig->contig_ln($genome,$contig)
Returns the length of $contig from $genome.
my $seq = $fig->get_dna_seq($fid);
Returns the DNA sequence for an FID
FIG identifier of the feature whose sequence is desired
DNA sequence
usage: $seq = $fig->dna_seq($genome,@locations)
Returns the concatenated subsequences described by the list of locations. Each location must be of the form
Contig_Beg_End
where Contig must be the ID of a contig for genome $genome. If Beg > End the location describes a stretch of the complementary strand.
usage: $taxonomy = $fig->taxonomy_of($genome_id)
Returns the taxonomy of the specified genome. Gives the taxonomy down to genus and species.
usage: $taxonomyID = $fig->get_taxonomy_id_of($genome_id)
Returns the taxonomy ID of the specified genome. If no taxonomy ID is found the genome id without ".\d+" suffix will be returned.
usage: $taxonomyID = $fig->set_taxonomy_id_for($genome_id)
Sets the taxonomy id for genome.
usage: $taxonomy = $fig->taxonomy_list()
Returns the taxonomy list of all organisms in a hash ref. Gives the taxonomy down to genus and species.
usage: $fig->is_bacterial($genome)
Returns true iff the genome is bacterial.
usage: $fig->is_archaeal($genome)
Returns true iff the genome is archaeal.
usage: $fig->is_prokaryotic($genome)
Returns true iff the genome is prokaryotic
usage: $fig->is_eukaryotic($genome)
Returns true iff the genome is eukaryotic
usage: $fig->is_viral($genome)
Returns true iff the genome is viral
usage: $fig->is_plasmid($genome)
Returns true iff the genome is marked as being a plasmid
usage: $fig->is_environmental($genome)
Returns true if the genome is from an environmental sample
usage: @genomes = $fig->sort_genomes_by_taxonomy(@list_of_genomes)
This routine is used to sort a list of genome IDs to put them into taxonomic order.
usage: $dist = $fig->crude_estimate_of_distance($genome1,$genome2)
There are a number of places where we need estimates of the distance between two genomes. This routine will return a value between 0 and 1, where a value of 0 means "the genomes are essentially identical" and a value of 1 means "the genomes are in different major groupings" (the groupings are archaea, bacteria, euks, and viruses). The measure is extremely crude.
usage: @sorted_by_taxonomy = $fig->sort_fids_by_taxonomy(@list_of_fids)
Sorts a list of feature IDs based on the taxonomies of the genomes that contain the features.
my $ssHash = $fig->active_subsystems($genome, $allFlag);
Get all the subsystems in which a genome is present. The return value is a hash which maps each subsystem name to the code for the variant used by the specified genome.
ID of the genome whose subsystems are desired.
If TRUE, all subsystems are returned, with unknown variants marked by a variant
code of -1 and iffy variants marked by a code of 0. If FALSE or omitted,
only subsystems in which the variant is definitively known are returned. The
default is FALSE.
This states if a subsystem is experimental, what would be the opposite of usable.
This states if a subsystem is private, meaning that it cannot be be exported. This is just the opposite of exchangable.
Gets and sets whether the subsystem should be published with the NMPDR. Specifically writes a file called NMPDR in the subsystem directory.
Use:
$fig->nmpdr_subsystem($ssa, 1); # to set it as an nmpdr subsystem
$fig->nmpdr_subsystem($ssa, -1); # to set it as NOT an nmpdr subsystem
$fig->nmpdr_subsystem($ssa); # to test whether it is an nmpdr subsystem
Gets and sets whether the subsystem is freely distributable and should be included in new releases.
Use:
$fig->distributable_subsystem($ssa, 1); # to set it as a distributable subsystem
$fig->distributable_subsystem($ssa, -1); # to set it as NOT a distributable subsystem
$fig->distributable_subsystem($ssa); # to test whether it is a distributable subsystem
my @names = $fig->all_subsystems();
Return a list of all of the subsystems in the data store.
my @names = $fig->all_usable_subsystems();
Return a list of all of the subsystems in the data store that are "usable", that is, not experimental or deleted.
Use the subsystem information cache if valid.
Run indexing on one or more subsystems. If no subsystems are defined we will reindex the whole thing. Otherwise we will only index the defined subsystem. Note that this method just launches index_subsystems as a background job. Returns the job of the child process.
$pid=$fig->index_subsystems("Alkanesulfonates Utilization"); # do only Alkanesulfonates Utilization
$pid=$fig->index_subsystems(@ss); # do subsystems in @ss
$pid=$fig->index_subsystems(); # do all subsystems
my $glist = [['273035.1', '273035.4']];
my $pmap = { 'fig|273035.1.peg.1' => 'fig|273035.4.peg.4', ... };
$fig->perform_subsystem_salvage($glist, $pmap);
For each subsystem in this SEED, perform a subsystem salvage operation for each old-genome / new-genome pair in $glist. This operation will determine if the old genome exists in the subsystem. If it does, the new genome is added to the subsystem, and we attempt to map the pegs from the cells in the old subsystem's row to the new subsystem. If all pegs map, we copy the variant code for the genome. If all cells did not map, we prepend a * to the variant code before copying.
Hmmm...
my $version=subsystem_version($subsystem_name)
Returns the current version of the subsystem.
Get or set the classification of the subsystem. Added by RAE in response to the changes made on seed wiki If a reference to an array is supplied it is saved as the new classification of the subsystem. Regardless, the current classification is returned as a reference to an array. There is no control over what the things are. Returns a reference to an empty array if a valid subsystem is not supplied, or if no classification is known
The classification is stored as a \t separated list of things in $subsys/CLASSIFICATION. There is no control over what the things are.
my @classifications = $fig->all_subsystem_classifications();
Return a list of all the subsystem classifications. Each element in the list will contain a main subsystem class and a basic subsystem class. The resulting list enables us to determine easily what the three-level subsystem tree would look like.
usage: $curator = $fig->subsystem_curator($subsystem_name)
Return the curator of a subsystem.
Returns the number of diagrams of the passed subsystem.
usage: @subsystems = $fig->subsystems_for_genome($genome, $all)
Return the list of subsystems in which the genome has been entered.
@subsystems is a list of subsystem names.
It will only return those genomes with a variant code other than 0 or -1, unless the $all argument is "true" (in which case all subsystems are returned).
If $all is 2 then it will return all subsystems with a variant code other than -1.
usage: $genomes = $fig->subsystem_genomes($subsystem_name, $all)
Return the list of genomes in the subsystem.
$genomes is a list of tuples (genome_id, name)
unless ($all) is set to true it will only return those genomes with a variant code other thaN 0 OR -1.
my $genomeList = $fig->readSpreadsheetForGenomes($fileName, $all);
Read the genomes from a specific subsystem file. This allows the client to get the genome data for a backup subsystem.
Name of the subsystem spreadsheet file.
If TRUE, all genomes will be read. Otherwise, only those genomes with a specific variant code (i.e. not 0 or -1) will be returned.
Returns a reference to a list of 2-tuples, each consisting of a genome ID and the genome's name.
my $subsysObject = $fig->get_subsystem($name, $force_load);
Return a subsystem object for manipulation of the named subsystem. If the subsystem does not exist, an undefined value will be returned.
Name of the desired subsystem.
TRUE to reload the subsystem from the data store even if it is already cached in memory, else FALSE.
Returns a blessed object that allows access to subsystem data, or an undefined value if the subsystem does not exist.
$fig->clear_subsystem_cache();
Delete all subsystems from the subsystem cache. This is not normally needed, because the cache is kept fairly small. However, in cases where all of the subsystems are needed, the cache grows by more than a gigabyte, and because the subsystems point back to the FIG object, the memory is not cleaned up properly. Calling this mehtod before you release the FIG object removes that problem.
my @roles = $fig->subsystem_to_roles($subsysID);
Return a list of the roles for the specified subsystem.
Name (ID) of the subsystem whose roles are to be listed.
Returns a list of role IDs.
Install the given local subsystem directory on the SEED at the URL provided. If authentication is required, the given username and password will be used.
Uses an HTTP POST of the tarfile of the contents of the local directory to the install_subsystem_dir.cgi CGI script.
Return the list of subsystems and roles that this peg appears in. Returns an array. Each item in the array is a reference to a tuple of subsystem and role. If the last argument ($noaux) is "true", only roles playing non-auxiliary roles will be returned.
Return the list of subsystems that this peg appears in. Returns an array. Each item in the array is a reference to a tuple of subsystem, role, variant and is_auxiliary.
Return the list of subsystems, roles and variants that the pegs appear in. Returns a hash keyed by peg. Each item in the hash is a reference to a tuple of subsystem, role and variant. If the last argument ($include_aux) is "true", also roles playing auxiliary roles will be returned.
Return the list of subsystems and roles that this peg appears in. Returns an array. Each item in the array is a reference to a tuple of subsystem and role. If the last argument ($noaux) is "true", only roles playing non-auxiliary roles will be returned.
Return the list of subsystems and roles for every peg in subsystems Returns an array. Each item in the array is a reference to a three-ple of subsystem, role, and peg.
Return a list of subsystems, roles, and proteins containing a given role
Returns an array. Each item in the array is a reference to a three-ple of subsystem, role, and peg.
Return a list of subsystems, roles, and proteins containing an EC number.
Returns an arrray. Each item in the array is a reference to a three-ple of subsystem, role, and peg.
Return list of [peg, function, ss, role in ss].
Return all pegs with non-hypothetical assignments that are not in ss.
Return list of [peg, function, ss, role in ss] for every non-hypo protein regardless of being in ss
Return a list of all roles present in locally-installed subsystems. The return is a hash keyed on role name with each value a list of subsystem names.
my $num_subsytems = $fig->get_genome_subsystem_count($genomeID);
Return the number of subsystems of the genome identified by $genomeID.
ID of the genome whose number of subsystems is to be returned.
Returns the number of subsystems.
my @pegData = $fig->get_all_subsystem_pegs($genomeID);
Return the subsystems, roles, and variant codes for all features in the specified genome. Unlike get_genome_subsystem_data, this method returns all pegs, regardless of the variant code.
ID of the relevant genome.
Returns a hash that maps each subsystem ID to a list of 3-tuples, each consisting of a role ID, a peg ID, and a variant code.
my $roleList = $fig->get_genome_subsystem_data($genomeID);
Return the roles and pegs for a genome's participation in subsystems. The subsystem name, role ID, and feature ID will be returned for each of the genome's subsystem-related PEGs.
ID of the genome whose PEG breakdown is desired.
Returns a pointer to a list of 3-tuples. Each tuple consists of a subsystem name, a role ID, and a feature ID.
my ($gname,$szdna,$pegs,$rnas,$taxonomy) = $fig->get_genome_stats($genomeID);
Return basic statistics about a genome.
ID of the relevant genome.
Returns a 5-tuple containing the genome name, number of base pairs, number of PEG features, number of RNA features, and the taxonomy string.
my $roleList = $fig->get_genome_subsystem_data($genomeID);
Return the functional assignments and pegs for a genome. The feature ID and assigned function will be returned for each of the genome's PEGs.
ID of the genome whose PEG breakdown is desired.
Returns a list of 2-tuples. Each tuple consists of a peg ID and its master functional assignment.
If the given cache file (name is relative to the FIG cache directory) exists and is less than a day old (Parameterize this sometime!) open and return a filehandle.
$rc = $fig->add_dlit(
-status => 'D', # required
-peg => $peg, # or -md5 => $md5, # one is required
-pubmed => $pubmed, # required
-curator => 'RossO', # required
-go => '', # default = ''
-override => 1); # default = 0
This adds a dlit tuple. The currently supported arguments are
-status => ' ' for not curated
'D' for dlit (direct literature on role)
'G' for genome data (propagates to all ' ' entries for this article)
'N' for not relevant
'R' for relevant, but not dlit
-md5 => supply an md5 hash code for the peg, not the id.
-peg => the peg being connected to literature. This peg will
be treated as a representative of the set that have the
same protein sequence.
-pubmed => pubmed ID (all numeric, but stored as string)
-curator => curator making the assertion (30 char max)
-go => an optional list of 3-character codes separated by commas
-override => 0 -> if there is an existing tuple, ignore this request
1 -> if there is an existing tuple, replace it
The returned value will be
0 -> the tuple was not inserted
1 -> the tuple was inserted
=cut
sub add_dlit { my( $self, @parms ) = @_; if (! $self->table_exists('dlits')) { system "load_dlits"; }
my %parms = @parms; # Previous code clobbered the defaults
$parms{-go} ||= ''; # Moved default here
$parms{-override} ||= 0; # Moved default here
# Check for required parameters
return 0 if ! $parms{-status};
return 0 if ! ( $parms{-peg} || $parms{-md5} );
return 0 if ! $parms{-pubmed};
return 0 if ! $parms{-curator};
my $status = $parms{-status};
my $peg = $parms{-peg};
my $md5 = $peg ? $self->md5_of_peg($peg) : lc $parms{-md5};
my $pubmed = $parms{-pubmed};
my $curator = $parms{-curator};
$curator =~ s/^master://i; # Strip master from the recorded curator
my $go = $parms{-go};
my $override = $parms{-override}; # Moved here to collect initializations
my $rdbH = $self->db_handle;
my $db_resp =
$rdbH->SQL( "SELECT status, md5_hash, pubmed, curator, go_code
FROM dlits
WHERE ((md5_hash = '$md5') and (pubmed = '$pubmed'))"
);
my $delete;
if (@$db_resp == 1)
{
# Default is no clobber except uncurated (i.e., $status eq ' ') -- GJO
if ( ( $db_resp->[0]->[0] ne ' ' ) && ( ! $override ) ) { return 0 }
$rdbH->SQL( "DELETE
FROM dlits
WHERE ((md5_hash = '$md5') and (pubmed = '$pubmed'))"
);
$delete = join( "\t", 'delete', @{$db_resp->[0]} ) . "\n";
}
my $rc = $rdbH->SQL( "INSERT
INTO dlits ( status,md5_hash,pubmed,curator,go_code )
VALUES ( '$status','$md5','$pubmed','$curator','$go' )"
);
# Add logging
if ( $rc )
{
&verify_dir( "$FIG_Config::data/Dlits" );
if ( open LOG, ">>$FIG_Config::data/Dlits/dlits.log" )
{
print LOG $delete if $delete;
print LOG join( "\t", 'insert', $status, $md5, $pubmed, $curator, $go ), "\n";
close LOG;
}
}
if ( $rc && ( $status eq "G") )
{
# Only overwrite ' ' status with 'G' status -- GJO
# Update the curator, too -- GJO
$rc = $rdbH->SQL( "UPDATE dlits
SET status = 'G', curator = '$curator'
WHERE ( pubmed = '$pubmed' ) AND ( status = ' ' )"
);
}
return $rc;
}
$rc = $fig->dlit_status(
-md5 => $md5, # or -peg, one is required
-peg => $peg, # or -md5, one is required
-pubmed => $pubmed, # required
);
This returns the current status code of a dlit, or undefined. The currently supported arguments are
-md5 => supply an md5 hash code for the peg, not the id.
-peg => the peg being connected to literature. This peg will
be treated as a representative of the set that have the
same protein sequence.
-pubmed => pubmed ID (all numeric, but stored as string)
The returned value will be
$status called in scalar context
( $status_code, $curator, $go_code) called in list array context
=cut
sub dlit_status { my( $self, @parms ) = @_; if (! $self->table_exists('dlits')) { system "load_dlits"; }
my %parms = @parms; # Previous code clobbered the defaults
# Check for required parameters
if ( ! ( $parms{-peg} || $parms{-md5} ) || ! $parms{-pubmed} )
{
return wantarray ? () : undef;
}
my $peg = $parms{-peg};
my $md5 = $peg ? $self->md5_of_peg($peg) : lc $parms{-md5};
my $pubmed = $parms{-pubmed};
my $rdbH = $self->db_handle;
my $db_resp = $rdbH->SQL( "SELECT status, curator, go_code
FROM dlits
WHERE ((md5_hash = '$md5') and (pubmed = '$pubmed'))"
);
return $db_resp && @$db_resp ? ( wantarray ? @{$db_resp->[0]} : $db_resp->[0]->[0] )
: ( wantarray ? () : undef );
}
$dlits = $fig->all_dlits();
Returns a reference to an array of all current dlit data.
The returned value is
[ [ status, md5_hash, pubmed, curator, go_code ], ... ]
=cut
sub all_dlits {
my($self) = @_;
my $rdbH = $self->db_handle;
my $db_resp = $rdbH->SQL( "SELECT * FROM dlits" );
return [ sort { $a->[1] cmp $b->[1] } # Sorted by protein
@$db_resp
];
}
$dlits = $fig->all_dlits();
Returns a reference to an array of all current dlit data.
The returned value is
[ [ status, md5_hash, pubmed, curator, go_code ], ... ]
=cut
sub all_dlits_status { my( $self, $status ) = @_; my $rdbH = $self->db_handle;
my $db_resp = $rdbH->SQL( "SELECT * FROM dlits where status = '$status'" );
return [ sort { $a->[1] cmp $b->[1] } # Sorted by protein
@$db_resp
];
}
$rc = $fig->export_dlits();
$rc = $fig->export_dlits( $file );
Writs all current dlit data to $FIG_Config::data/Dlits/dlits, or to a specified file.
The returned value is 1 on success, or 0 on failure. =cut
sub export_dlits { my ( $self, $file ) = @_; my $rdbH = $self->db_handle;
$file ||= "$FIG_Config::data/Dlits/dlits";
open( DLITS, ">$file" ) || return 0;
my $db_resp = $rdbH->SQL( "SELECT * FROM dlits" );
$db_resp || return 0;
foreach my $x ( @$db_resp ) { print DLITS join( "\t", @$x ), "\n" }
close(DLITS);
return 1;
}
$rc = $fig->add_title( $pubmed_id, $title )
Add a pubmed title to the database. If the pubmed_id is not already present, the id and title are added. The return code reflects that success or failure of the add. If the pubmed_id is already defined, and the titles match, there is no change, and the return code is 2. If the id exists and the title is different, no change is made, and the return code is 0. To change an existing title, use:
$rc = $fig->update_title( $pubmed_id, $title )
The returned values are:
0 attempting to change a title, or failure;
1 successful addition of a new title; or
2 existing and new titles are the same
$rc = $fig->update_title( $pubmed_id, $title )
Add or change a pubmed title to the database. If the pubmed_id is not already present, the id and title are added. The return code reflects that success or failure of the add. If the pubmed_id is already defined, and the titles match, there is no change, and the return code is 2. If the id exists and the title is different, no change is made, and the return code is 0. To change an existing title, use:
$rc = $fig->update_title( $pubmed_id, $title )
The returned values are:
0 on failure;
1 successful addition or change of a title; or
2 existing and new titles are the same
$title = $fig->get_title( $pubmed_id )
Get a title for a literature id
Returned value:
$title upon success
undef upon failure
[ [ id, title ], ... ] = $fig->all_titles()
Get all pubmed_id, title pairs
Returned value:
[ [ id, title ], ... ] upon success
[] upon failure
$pegs - not used $seq_of - hash from peg to peg sequence $tran_peg - hash into which translated pegs are placed $sought - hash keyed on the list of pegs we're looking for.
Find a genome given the number of contigs, number of nucleotides, and checksum. We pass in a potential name for the genome as a quick starting check.
my @links = $fig->fid_links($fid);
Return a list of hyperlinks to web resources about a specified feature.
ID of the feature whose hyperlinks are desired.
Returns a list of raw HTML strings representing hyperlinks to web pages relating to the specified feature.
my @links = $fig->fids_with_link_to("text");
Return a list of tples of [fid, link] where text is a free-text string that will match to the URL. You can use this to get all the links that point to PIR, for example to identify all proteins that are members of PIR superfamilies.
A free-text match to the URL. The match is made using the SQL "like" command, so try to be as specific as possible.
Returns a list tuples of [fid, link]
Searches the database for objects that match the query string in some way.
Returns a list of results if the query is ambiguous or an unique identifier otherwise.
Some routines for dealing with peg search and similarities.
This is code lifted from pom.cgi and reformatted for more general use.
Find the given role in the given (via CGI params) organism.
We do this by finding a list of pegs that are annotated to have this role in other organisms that are "close enough" to our organism
We then find pegs in this organism that are similar to these pegs.
my $flag = FIG::is_ec($role);
Return TRUE if the specified role is an EC number, else FALSE. This can be used to determine whether a role is specified via a role ID or the role's EC number.
Role ID or EC number to check.
Returns TRUE if the specified role specification is an EC number, and FALSE if it is a true role ID.
Background job support.
If one wants to turn a script into a background, invoke $fig->run_in_background($coderef). This will cause $coderef to be invoked as a background job. This means its output will be written to $FIG_Config::data/Global/background_jobs/<pid>, and that it shows up and is killable via the seed control panel.
This section contains the functionality introduced by the interface with GenDB. The initial two functions simply register when GenDB has a version of the genome (so we can set links to it when displaying PEGS:
usage: has_genome("GenDB",$genome)
Invoking this routine just records that GenDB has a copy of the genome designated by $genome.
usage: dropped_genome("GenDB",$genome)
Invoking this routine just records that GenDB should no longer be viewed as having a copy of the genome designated by $genome.
usage: $url = link_to_system("GenDB",$fid) # usually $fid is a peg, but it can be other types of features, as well
This routine is used to get a URL that can be used to "flip" from one system to the other. If the feature is unknown to the system, undef should be returned.
The following routines support alteration of features
usage: $fig->delete_feature($user,$fid)
Invoking this routine deletes the feature designated by $fid.
my $fid = $fig->add_feature($user,$genome,$type,$location,$aliases,$translation,$fid);
Invoking this routine adds the feature, returning a new (generated) $fid. It is also possible to specify the feature ID, which is recommended if the feature is to be permanent. (In order to do this the ID needs to be allocated from the clearinghouse machine.) The translation is optional and only applies to PEGs.
ID of the genome to which the feature belongs.
Type of the feature (peg, rna, etc.)
Location of the feature, in the form of a comma-delimited list of location specifiers.
These are of the form contig_begin_end, where contig is the ID
of a contig, and begin and end are the starting and stopping offsets of the
location. These offsets are 1-based, and depending on the strand, the beginning
offset could be larger than the ending offset.
A comma-delimited list of alias names for the feature.
The protein translation of the feature, if it is a peg.
The ID to give to the new feature. If this parameter is omitted, an ID will be generated automatically.
Returns the new feature's ID if successful,or undef if an error occurred.
my $val = $fig->clearinghouse_next_feature_id($genome, $type)
Return the next feature ID that would be allocated by the clearinghouse for the given genome and feature type.
my $tax = $fig->clearinghouse_register_metagenome_taxon_id($username, $genome_name)
Register a new taxon id for the MG-RAST metagenome server.
my $tax = $fig->clearinghouse_register_subsystem_id($ss_name);
Return a subsystem's short ID. Short IDs are maintained at a special clearinghouse web site. If the subsystem does not yet have a short ID, a new one will be assigned by the clearinghouse and returned.
Name of the subsystem whose ID is desired.
ID of the desired subsystem.
my $tax = $fig->clearinghouse_lookup_subsystem_by_id($ss_name)
Register a subsystem id for the given subsystem name. Returns the existing id if already present.
my $val = $fig->clearinghouse_register_features($genome, $type, $num)
Register $num new features of type $type on genome $genome. Returns the starting index for the new features.
usage: $fig->call_start($genome,$loc,$translation,$against)
This routine can be invoked to produce an estimate of the correct start, given a location in a genome believed to be a protein-encoding gene, along with a set of PEGs that are believed to be orthologs. If called in a list context, it returns a list containing
a string representing the estimated start location
a confidence measure (better than 0.2 seems to be pretty solid)
a new translation
If called in a scalar context, it returns its best prediction of the start.
usage: $fig->pick_gene_boundaries($genome,$loc,$translation)
This routine can be invoked to expand a region of similarity to potential gene boundaries. It does not try to find the best start, but only the one that is first after the beginning of the ORF. It returns a list containing the predicted location and the expanded translation. Thus, you might use
($new_loc,$new_tran) = $fig->pick_gene_boundaries($genome,$loc,$tran);
$recalled = $fig->call_start($genome,$new_loc,$new_tran,\@others);
to get the location of a recalled gene (in, for example, the process of correcting a frameshift).
usage: $fig->change_location_of_feature($fid,$location,$translation)
Invoking this routine changes the location of the feature. The $translation argument is optional (and applies only to PEGs).
The routine returns 1 on success and 0 on failure.
Render a genome's contig as GenoGraphics objects.
This section contains the methods used to read and write Markup data. The markup data associates labels with sections of a feature's translation.
In the SEED, Markup data is stored in a separate file for each marked feature
in the the feature type subdirectory for an organism. So, for example, the
PEG markups for fig|83333.1.peg.4 would be in the file
FIG/Data/Organisms/83333.1/peg/markup4.tbl
The file is stored in tab-separated form. Each line contains the following fields
1-based offset into the translation of the first amino acid to mark
number of amino acids to mark
label identifying the type of markup
Reading and writing these tiny files is extremely fast, but they do have more overhead than would be expected if the data were stored in a single flat file managed by pointers from the FIG database. If that apprach becomes desirable, then only this section of FIG.pm needs to be changed.
my $marks = $fig->ReadMarkups($fid);
Read the markup data for the specified feature. The markings are returned as a list of triples. Each triple contains the start location of a markup, the length of the markup, and the label.
ID of the feature whose markups are to be read.
Returns a reference to list of 3-tuples. Each list element will consist of the starting offset of the markup (1-based), the length of the markup, and the label. All values are expressed in terms of distance into the protein translation of the feature.
$fig->WriteMarkups($fid, \@marks);
Write out the markups for the specified feature. If the markup file for the specified feature does not exist, it will be created. If it does exist, it will be completely overwritten.
ID of the feature whose markups are to be written
Reference to a list of markups. Each markup is in the form of a 3-tuple consisting of the 1-based offset to the start of the markup, the length of the markup, and the markup label. The offset and length are specified in terms of the protein translation string.
my $name = FIG::_MarkupFileName($fid);
Return the name of the file containing the markup data for the specified feature.
ID of the feature whose markup file is desired.
Returns the full path of the file containing the feature markups for the feature desired.
This section contains the methods used to implement UserData access. User data is
stored in a subdirectory given by the user's name under the Users directory
in the Global directory tree. In other words, the data for the default user
basic would be at $FIG_Config::global/Users/basic.
In each directory, the capabilities.tbl file contains the capability data and
the preferences.tbl file contains the preferences. Currently, preferences are
stored in a single file, but if performance becomes a problem we may split them
by category.
Each of these files has two columns of data-- a key and a value. In the preferences
file the key is a hierarchical construct with the pieces separated by colons, and
the value is essentially a free-format string understood only by the application. In
the capabilities file the key is a group name, and the value is an access level--
RW (full access), RO (read-only access), or NO (no access).
Group names and key names are not allowed to contain white space. Tabs are used to separate them from the value strings or access levels. The value strings for preferences cannot contain tabs or new-lines. A backslash escape mechanism will be used to allow tabs and new-lines to be specified in the preference values.
The files are sorted by key, to make updates easier.
The special Security_Default subdirectory is used to track the default security
options for each secure object. The object's security group and default level
are specified in a file whose name is formed by appending the object ID to the
object type with an extension of "tbl". So, for example, the file containing the
security default information for Genome 83333.1 would be
$FIG_Config::global/Users/Security_Default/Genome_83333.1.tbl
Each of these is a tiny file with the group name and default access level for that organism or subsystem. The two fields of the file are tab-separated, and any new-line character at the end is ignored.
my ($group, $level) = $fig->GetDefault($objectID, $objectType);
Return the group name and default access level for the specified object.
ID of the object whose capabilities data is desired.
Type of the object whose capabilities data is desired. This should be expressed
as a Sprout entity name. Currently, the only types supported are Genome
and Subsystem.
Returns a two-element list. The first element is the name of the group
to witch the object belongs; the second is the default access level
(RW, RO, or NO). If the object is not found, an empty list
should be returned.
my $preferences = $fig->GetPreferences($userID, $category);
Return a map of preference keys to values for the specified user in the specified category.
ID of the user whose preferences are desired.
Name of the category whose preferences are desired. If omitted, all preferences should be returned.
Returns a reference to a hash mapping each preference key to a value. The keys are fully-qualified; in other words, the category name is included. It is acceptable for the hash to contain key-value pairs outside the category. In other words, if it's easier for you to read the entire preference set into memory, you can return that one set every time this method is called without worrying about the extra keys.
my $level = $fig->GetCapabilities($userID);
Return a map of group names to access levels (RW, RO, or NO) for the
specified user.
ID of the user whose access level is desired.
Returns a reference to a hash mapping group names to the user's access level for that group.
my $flag = $fig->AllowsUpdates();
Return TRUE if this access object supports updates, else FALSE. If the access object does not support updates, none of the SetXXXX methods will be called.
$fig->SetDefault($objectID, $objectType, $group, $level);
Set the group and default access level for the specified object.
ID of the object whose access level and group are to be set.
Type of the relevant object. This should be expressed as a Sprout entity name.
Currently, only Genome and Subsystem are supported.
Name of the group to which the object will belong. A user's access level for this group will override the default access level.
Default access level. This is the access level used for user's who do not have an explicit capability specified for the object's group.
$fig->SetCapabilities($userID, \%groupLevelMap);
Set the access levels by the specified user for the specified groups.
ID of the user whose capabilities are to be updated.
Reference to a hash that maps group names to access levels. The legal
access levels are RW (read-write), RO (read-only), and NO (no
access). An undefined value for the access level indicates the default
level should be used for that group. The map will not replace all of
the user's capability date; instead, it overrides existing data, with
the undefined values indicating the specified group should be deleted
from the list.
$fig->SetPreferences($userID, \%preferenceMap);
Set the preferences for the specified user.
ID of the user whose preferences are to be udpated.
Reference to a hash that maps each preference key to its value. The keys should be fully-qualified (that is, they should include the category name). A preference key mapped to an undefined value will use the default preference value for that key. The map will not replace all of the user's preference data; instead, it overrides existing data, with the undefined values indicating the specified preference should be deleted from the list.
$fig->CleanupUserData();
Release any data being held in memory for use by the UserData object.
my $fileName = FIG::_GetObjectCapabilityFile($objectType, $objectID);
This is an internal method that computes the name of the file containing the default group and access data for a specified object. It returns the file name.
my ($ev_code_list, $subsys_list, $english_string) = $fig->to_structured_english($fig, $peg, $escape_flag);
Create a structured English description of the evidence codes for a PEG, in either HTML or text format. In addition to the structured text, we also return the subsystems and evidence codes for the PEG in list form.
ID of the protein or feature whose evidence is desired.
TRUE if the output text should be HTML, else FALSE
Returns a three-element list. The first element is a reference to a list of evidence codes, the second is a list of the subsystem containing the peg, and the third is the readable text description of the evidence.
my $directoryName = FIG::_GetUserDataDirectory($userName);
Return the name of the directory containing the user's preference and capability
data. If the user does not have a directory, return undef.
Name of the user whose directory is desired.
Returns the name of the user's preference/capability directory. If the user does
not exist, the directory will be created automatically. If this policy is changed,
return undef to indicate an invalid user name.
my %userData = FIG::_GetUserDataFile($userID, $type, $prefix);
Create a hash from the user data file of the specified type. The user data file contains two tab-delimited fields. The first field will be read in as the key of the hash and the second as the data value. The file must be sorted, and only records beginning with the character string in $prefix will be put in the hash.
Name of the user whose preference or capability data is desired.
Type of file desired: preferences or capabilities.
Returns a hash containing all the key/value pairs in the user file of the specified type. If the file is not found, will return an empty hash.
FIG::_ProcessUpdates($fileName, \%map);
Apply the specified updates to a key-value file. The records in the key-value file must be sorted. If a key in the map matches a key in the file, the file's key value is replaced. If a key in the map is not found in the file, it is added. If a key in the map is found in the file and it has an undefined value in the map, then the key is deleted.
Name of the file to be updated.
Reference to a hash mapping keys to values. The keys may not contain any whitespace. The value will be escaped before it is written.
my ($key, $value) = FIG::_GetInputKVRecord($handle);
Read a key/value pair from the specified input file. If we are at end-of-file
the key returned will be the Tracer::EOF constant. The key and value are
separated by a tab. The value will be unescaped if it exists.
Open handle for the input file.
Returns a two-element list. The first element will be the first field of the
input record; the second element will be the second field. If we are at
end-of-file, the first element will be the Tracer::EOF constant.
FIG::_PutOutputKVRecord($handle, $key, $value);
Write a key-value pair to the output file. The value will automatically be escaped. A tab will be used to separate the fields.
Open output file handle.
First field to put in the output record.
Value field to put in the output record. It will automatically be escaped. If it is undefined, the method will have no effect. An undefined value therefore serves as a deleted-line marker.
FIG->scenario_directory($organism);
Returns the scenario directory of an organism. If the organism is 'All', returns the directory containing all possible paths through scenarios.
The seed-taxonomy id of the organism, e.g. 83333.1, or 'All'.
FIG->scenario_directory(@subsystem_names)
Returns a reference to a hash containing the scenario information for the specified subsystems. The hash keys are subsystem names, the hash values are hashes keyed by subsystem name and with yet more hashes as values. The keys to these hashes are the strings "input_compounds", "output_compound", "map_ids", "additional_reactions" and "ignore reaction", values are references to lists of KEGG ids. If a subsystem has no scenarios, no hash entry is created for that subsystem.
A list of subsystem names.
Initialize a DAS data query object.