How to Find a Specific Gene or Protein to Study

icon handout pdf

Searching with words that describe or label the sequence

Simple keyword searching

The initial search option, which is presented in the banner of all pages as a text box with a "Go" button, is a keyword search against the text of the data records. Thus, it suffers from the same limitations as all keyword searches, such as misspellings and synonyms. Most genes and gene products can be described by several text strings. In this example, we will try to find an enzyme in the folate biosynthesis pathway that has several common names, but one specific EC number. The gene that encodes the target enzyme has been named by several groups working on different organisms. Use your favorite strategy to compose a keyword search in the box below (or in the NMPDR banner in a new window). Some of these terms will result in no hits, while some result in hundreds. Neither option is useful. A new search form is presented at the bottom of the search results table so that you may revise your search. As with all keyword searches, there is an appropriate subset of the terms listed below that will return the record of interest. Please note that at this time, we are not curating gene names, so for example, a search for lacZ may not return all instances of beta-galactosidase. (Use the back button on your browser to resume this tutorial.)

Keywords

  • 7,8-dihydro-6-hydroxymethylpterin-pyrophosphokinase
  • hydroxymethylpterin pyrophosphokinase
  • HPPK
  • pyrophosphokinase
  • sulD
  • folK
  • folate biosynthesis
  • EC 2.7.6.3


Quick search

Keywords can include gene IDs (gi|16802272), gene names (folK), EC numbers (2.7.6.3), genus (Vibrio), species (vulnificus), words contained in subsystem names (synthesis), functional assignments (pyrophosphokinase), and subsystem classes (cofactors). You may also use attributes like iedb, virulence, and essential. A list of protein encoding genes that match all of the keywords will be returned.

To search for genes matching only some of the keywords, surround the optional words with parentheses. For example, 2.7.6.3  4.1.2.25 would match only bifunctional genes associated with both EC numbers 2.7.6.3 and 4.1.2.25, while (2.7.6.3) (4.1.2.25) would match the bifunctional genes as well as all single function genes with either of those EC numbers. Use a minus sign to exclude genes matching a particular keyword. For example, pyrophosphokinase -2-amino-4-hydroxy-6-hydroxymethyldihydropteridine would match all pyrophosphokinases acting on substrates other than 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine.

Restricting keyword search to selected organisms or subsystems

There are several ways to limit the scope of a keyword search to organisms of interest to you. First, you may simply include the organism genus and/or species and/or strain name among the keywords entered in the simple keyword search box. Try, for example, searching for EC 2.7.6.3 listeria. Please note that if you enter only the full name and strain of a sequenced organism without any additional search words, the search will return its Organism Overview page (try, for example, Listeria monocytogenes EGD-e).

Second, if you start on one of the NMPDR organism summary pages, simple keyword searches are automatically limited to that group of organisms. Try, for example, searching for EC 2.7.6.3 from the Campylobacter page, which is directly accessible from the home page or through table of NMPDR organisms on the NMPDR Organisms page.

Third, from the menu of supporting organisms on the NMPDR Organisms page, you may select any single organism and go to its overview page, which links to a Browser that includes a searchable and sortable table of all features in the genome. The overview page also provides direct links to tables of features that have or have not been included in subsystems by NMPDR curators.

Advanced Search

Finally, the Advanced Search form has a menu of genomes for limiting your keyword search and a menu of subsystems that may be used to restrict your keyword search.

In the form, genomes are grouped with the NMPDR focus organisms listed first, followed by the Archaea (blue), Bacteria (pink), and Eukarya (yellow). Within groups, genomes are alphabetized. Select a single genome directly by clicking on its name in the list box. To select multiple genomes, hold down the CTRL key while clicking. To select a range of genomes, hold down the SHIFT key while clicking. Selected genomes appear in the box below the buttons as they are selected.

It is also possible to select all genomes whose name includes text you type into the form. For example, if you type pneumoniae into the box and click the button, "Select genomes containing," all genomes that contain "pneumoniae" in the name will be selected, including species of Streptococcus and Chlamydophila, as well as Mycoplasma hypopneumoniae. You can also type an NCBI taxonomy ID into the box: 171101 will select Streptococcus pneumoniae R6.

Use the buttons, Select All to select all the genomes, Clear All to de-select all the genomes, or Select NMPDR to select all the NMPDR focus genomes.

(^top)

Searching the sequence data directly

BLAST -- Sequence alignment searching

The BLAST family of tools use local sequence alignments to search for matching sequences in the database. BLAST uses a DNA or amino acid sequence as the query term instead of one or more keywords.

Suppose you did not know the EC number for our example enzyme, HPPK, and a search with your first choice of common name returned no usable results. But, you have the [[FIG.AminoaAcidCode][amino acid] sequence of the E.coli version:

>E.coli K12 HPPK 
MTVAYIAIGSNLASPLEQVNAALKALGDIPESHILTVSSFYRTPPLGPQDQPDYLNAAVA 
LETSLAPEELLNHTQRIELQQGRVRKAERWGPRTLDLDIMLFGNEVINTERLTVPHYDMK 
NRGFMLWPLFEIAPELVFPDGEMLRQILHTRAFDKLNKW

Copy the sequence above and paste it into the sequence box on the Sequence Search page. Since this is an amino acid sequence, set the tool to blastp. From the scrolling menu, choose any organism of interest to BLAST against. Multiple genomes may be selected by using the control or shift buttons as you click. Buttons are also provided for selecting all NMPDR focus genomes, or all of the supporting genomes. Click the button labeled "BLAST." The table of BLAST results returned is ranked by score, with the most significant hits at the top of the results table. The top entry in the table of returned results is most likely to be the target protein.

You may also use a nucleotide sequence to find your gene of interest:

>E.coli K12 HPPK gene
atgacagtggcgtatattgccataggcagcaatctggcctctccgctggagcaggtcaat
gctgccctgaaagcattaggcgatatccctgaaagccacattcttaccgtttcttcgttt
taccgcaccccaccgctggggccgcaagatcaacccgattacttaaacgcagccgtggcg
ctggaaacctctcttgcacctgaagagctactcaatcacacacagcgtattgaattgcag
caaggtcgcgtccgcaaagctgaacgctggggaccacgcacgctggatctcgacatcatg
ctgtttggtaatgaagtgataaatactgaacgcctgaccgttccgcactacgatatgaag
aatcgtggatttatgctgtggccgctgtttgaaatcgcgccggagttggtgtttcctgat
ggggagatgttgcgtcaaatcttacatacaagagcatttgacaaattaaacaaatggtaa

If you are interested in finding many orthologs of the query sequence, select the blastx tool, which translates the nucleotide sequence and compares the result to protein sequences in the database to find matching genes.

If you want to find the data page for the exact sequence you entered, then select the blastn tool, which will match the query (input) nucleotide sequence with nucleotide sequences in the database. The small number of characters and the degeneracy of the genetic code causes blastn to find shorter matching sequences than blastx will find with the same query.

(^top)

Scan -- Sequence pattern, or motif, searching

Protein motifs

Another way to search for proteins or genes is to make use of known sequence patterns, or motifs, that are characteristic of a a functional group of proteins. For example, a signature of HPPK enzymes has been defined by ProSite as this: [KRHD]-x-[GA]-[PSAE]-R-x(2)-D-[LIV]-D-[LIVM](2). Such a sequence is more commonly written in the text of a journal article, for example, as: (KRHD)X(GA)(PSAE)RXXD(LIV)D(LIVM)(LIVM).

The abstract instruction conveyed by the pattern is, "One of either lysine or arginine or histidine or aspartate, followed by any single amino acid, followed by either glycine or alanine, then one of these four, then arginine, then any two amino acids, then aspartate, then one of these three, then aspartate, then one of these four, then one of the same four again." All of the following three examples of protScan patterns convey the same instruction:

any(KRHD) x any(GA) any(PSAE) RxxD any(LIV) D any(LIVM) any(LIVM)
any(KRHD) 1...1 any(GA) any(PSAE) R 2...2 D any(LIV) D any(LIVM) any(LIVM)
((K | (R | (H | D))) X (G | A) (P | (S | (A | E))) RXXD (L | (I | V)) D (L | (I | (V | M))) (L | (I | (V | M)))

The word "any" must be hard up against the open parentheses to indicate a choice of those within the set. The tool is not sensitive to the case of amino acid letters. A space should separate elements of the pattern. The letter "X" is the wild card and specifies any of the 20 amino acids. The choice of any amino acid may also be indicated by the number of amino acids required and three dots to represent the ellipsis. For example, both "XX" and "2...2" mean any two amino acids. However, "2...4" means any two or three or four amino acids. The third way to indicate a choice is by the use of nested parentheses and the symbol "|", commonly used as "or" in computer science. This is not a lower-case letter L nor an upper-case letter i. It is sometimes called a pipe, and is usually "SHIFT \" on the keyboard.

Try copying any of the three patterns into the sequence box on the Sequence Search page (or try the pre-filled form below). Since this is an amino acid sequence, select protScan from the tool menu. Use the genomes list to select organisms to search, then click the Scan button. Please note that a ProSite pattern must be translated into one of the three forms recognized by protScan, which does NOT recognize the ProSite syntax.

Tool
Sequence
Select one or more genomes
Type to narrow selection  (help)

DNA patterns

Nucleic acid patterns may be used as input with the dnaScan tool. Pattern rules for spacing are similar to those for amino acid patterns. For a complete description of how to format complex patterns, such as hairpin loops, please see the article, Search Pattern. Limited options in degenerate positions are indicated using the IUB standard ambiguity code:

Code Nucleotides
M A or C (aMino)
R A or G (puRine)
W A or T (Weak, 2 H-bonds)
S C or G (Strong, 3 H-bonds)
Y C or T (pYrimidine)
K G or T (Keto)
V A or C or G (not T; V > T)
H A or C or T (not G; H > G)
D A or G or T (not C; D > C)
B C or G or T (not A; B > A)
N A or C or G or T

(^top)

Results

Search results are presented on a page with two tables and a search form. The downloads options table at the top provides several different ways to save the search results to your local computer. The bottom table presents the features that match your search parameters. A form for running the same type of search with different terms or parameters is provided at the bottom of the page.

Download options

download options
The search results table may be saved or downloaded to your local computer in several formats.
  • To save a url which will allow you to repeat the same search parameters in the future, e.g. after a data update, use the right mouse button or control-click to bookmark or copy the link called "Repeat" in the first row of the table.
  • Download the nucleotide sequences of the genes found by your search by clicking the download button in the corresponding row of the table to create a text file containing all DNA sequences in FASTA format. You may elect to append upstream and downstream sequences flanking each gene by typing a number in the box. Leave the box empty to save the coding sequences without flanking sequence.
  • Download the amino acid sequences of the proteins found by your search by clicking the download button in the corresponding row of the table to create a text file containing all protein sequences in FASTA format.
  • Download the search results table as a tab-delimited text file, which may be opened in Excel, by clicking the download button in the corresponding row of the table. The viewer buttons are not included in the saved table. To save the results table with links to viewer pages, perform your search in Internet Explorer, copy the table, then paste it into a new workbook in Excel.
  • Download the search results in XML format by clicking the download button in the corresponding row of the table.

Viewing options

In the top table on the results page, there is a button for viewing the search results in a sortable, expandable table. Click this button to view the resulting features in a table to which columns may be added using a pair of controlling lists at the top of the page. All possible parameters are available to select to be columns in your table; however, all features do not have data for all listed parameters. Further, some of the parameters describe organisms while some describe proteins. For a full explanation of this table, see interactive table.

The results table displays all features that match all the search terms and parameters. Different features are displayed in each row of the table. Columns of the table provide the database ID of the feature, name of the organism, functional annotation of the feature, and a link to the subsystem(s) for those features that have been included in a subsystem by a curator.

search results table

When results are found for more than one organism, the NMPDR focus organisms (if any) are listed before supporting organisms, in alphabetical order. Within the same organism, results are listed in order of their database ID numbers (fig| ...).

re-runsearch.png

To have another ID number listed in the database ID column, use the form at the bottom of the page to re-run the same search, but use the drop-down list to select an Identifier Type, for example, Locus Tag? .

To add a column containing all ID types, or aliases, use the form at the bottom of the page to re-run the same search, but use the check box to select "Show alias links." If there is an alias type you are particularly interested in and you aren't sure what it is called, e.g. those that start with "gi|", enter that into the box, and that type of ID will be first in the list of aliases if it is available.

aliasresult.png

Summary

The gene product that you want to study may be located in the NMPDR by searching for one or more text strings in a keyword search, or by searching directly for the protein or nucleic acid sequence using BLAST or Scan. The results of these searches are presented in a table with links to a new NMPDR Viewer environment, which provides tools for comparative analysis with other genomes. Sequence search results are presented with a link to a Context viewer, which localizes the pattern match within the chromosomal environment.

NMPDRtutorial1.pdf

-- Leslie Mc Neil - 02 Dec 2008

Topic revision: r17 - 04 Mar 2009 - 18:39:38 - Leslie Mc Neil
 
Notice to NMPDR Users - The NMPDR BRC contract has ended and bacterial data from NMPDR has been transferred to PATRIC (http://www.patricbrc.org), a new consolidated BRC for all NIAID category A-C priority pathogenic bacteria. NMPDR was a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.