percolator
Usage:
crux percolator [options] <pin>
Description:
Percolator is a semi-supervised learning algorithm that dynamically learns to separate target from decoy peptide-spectrum matches (PSMs). The algorithm is described in this article:
Lukas Käll, Jesse Canterbury, Jason Weston, William Stafford Noble and Michael J. MacCoss. "Semi-supervised learning for peptide identification from shotgun proteomics datasets." Nature Methods. 4(11):923-925, 2007.
Percolator requires as input two collections of PSMs, one set derived from matching observed spectra against real ("target") peptides, and a second derived from matching the same spectra against "decoy" peptides. The output consists of ranked lists of PSMs, peptides and proteins. Peptides and proteins are assigned two types of statistical confidence estimates: q-values and posterior error probabilities.
The features used by Percolator to represent each PSM are summarized here.
Percolator also includes code from Fido, whch performs protein-level inference. The Fido algorithm is described in this article:
Oliver Serang, Michael J. MacCoss and William Stafford Noble. "Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data." Journal of Proteome Research. 9(10):5346-5357, 2010.
Crux includes code from Percolator. Crux Percolator differs from the stand-alone version of Percolator in the following respects:
- In addition to the native Percolator XML file format, Crux Percolator supports additional input file formats (SQT, PepXML, tab-delimited text) and output file formats (PepXML, mzIdentML, tab-delimited text).
- To maintain consistency with the rest of the Crux commands, Crux Percolator uses different parameter syntax than the stand-alone version of Percolator.
- Like the rest of the Crux commands, Crux Percolator writes its files to an output directory, logs all standard error messages to a log file, and is capable of reading parameters from a parameter file.
Input:
pin
– A collection of target and decoy peptide-spectrum matches (PSMs). Input may be in one of five formats: PIN, SQT, pepXML, Crux tab-delimited text, or a list of files (when list-of-files=T). Note that if the input is provided as SQT, pepXML, or Crux tab-delimited text, then a PIN file will be generated in the output directory prior to execution.
Decoy PSMs can be provided to Percolator in two ways: either as a separate file or embedded within the same file as the target PSMs. Percolator will first search for target PSMs in a separate file. The decoy file name is constructed from the target name by replacing "target" with "decoy". For example, if search.target.txt is provided as input, then Percolator will search for a corresponding file named search.decoy.txt. If no decoy file is found, then Percolator will assume that the given input file contains a mix of target and decoy PSMs. Within this file, decoys are identified using a prefix (specified via --decoy-prefix) on the protein name.
Output:
The program writes files to the folder crux-output
by default. The name of the output folder can be set by the user using the --output-dir
option. The following files will be created:
percolator.target.proteins.txt
– a tab-delimited file containing the target protein matches. See here for a list of the fields.percolator.decoy.proteins.txt
– a tab-delimited file containing the decoy protein matches. See here for a list of the fields.percolator.target.peptides.txt
– a tab-delimited file containing the target peptide matches. See here for a list of the fields.percolator.decoy.peptides.txt
– a tab-delimited file containing the decoy peptide matches. See here for a list of the fields.percolator.target.psms.txt
– a tab-delimited file containing the target PSMs. See here for a list of the fields.percolator.decoy.psms.txt
– a tab-delimited file containing the decoy PSMs. See here for a list of the fields.percolator.params.txt
– a file containing the name and value of all parameters for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.percolator.pep.xml
– a file containing the PSMs in pepXML format. This file can be used as input to some of the tools in the Transproteomic Pipeline.percolator.mzid
– a file containing the protein, peptide, and spectrum matches in mzIdentML format.percolator.log.txt
– a log file containing a copy of all messages that were printed to standard error.
Options:
-
percolator options
--c-pos <float>
– Penalty for mistakes made on positive examples. If this value is not specified, then it is set via cross validation over the values {0.1, 1, 10}, selecting the value that yields the largest number of PSMs identified at the q-value threshold set via the --test-fdr parameter. Default =0.01
.--c-neg <float>
– Penalty for mistake made on negative examples. This parameter requires the c-pos is set explicitly; otherwise, --c-neg will have no effect. If not specified, then this value is set by cross validation over {0.1, 1, 10}. Default =0
.--train-fdr <float>
– False discovery rate threshold to define positive examples in training. Default =0.01
.--test-fdr <float>
– False discovery rate threshold used in selecting hyperparameters during internal cross-validation and for reporting the final results. Default =0.01
.--maxiter <integer>
– Maximum number of iterations for training. Default =10
.--quick-validation T|F
– Quicker execution by reduced internal cross-validation. Default =false
.--default-direction <string>
– In its initial round of training, Percolator uses one feature to induce a ranking of PSMs. By default, Percolator will select the feature that produces the largest set of target PSMs at a specified FDR threshold (cf. --train-fdr). This option allows the user to specify which feature is used for the initial ranking, using the name as a string from this table. The name can be preceded by a hyphen (e.g. "-XCorr") to indicate that a lower value is better. Default =<empty>
.--unitnorm T|F
– Use unit normalization (i.e., linearly rescale each PSM's feature vector to have a Euclidean length of 1), instead of standard deviation normalization. Default =false
.--test-each-iteration T|F
– Measure performance on test set each iteration. Default =false
.--override T|F
– By default, Percolator will examine the learned weights for each feature, and if the weight appears to be problematic, then percolator will discard the learned weights and instead employ a previously trained, static score vector. This switch allows this error checking to be overriden. Default =false
.--percolator-seed <string>
– When given a unsigned integer value seeds the random number generator with that value. When given the string "time" seeds the random number generator with the system time. Default =1
.--klammer T|F
– Use retention time features calculated as in "Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions" by Klammer AA, Yi X, MacCoss MJ and Noble WS. (Analytical Chemistry. 2007 Aug 15;79(16):6111-8.). Default =false
.--only-psms T|F
– Do not remove redundant peptides; keep all PSMs and exclude peptide level probability. Default =false
.--post-processing-tdc T|F
– Use target-decoy competition to assign q-values and PEPs. Default =false
.--post-processing-qvality T|F
– Replace the target-decoy competition with the method qvality to assign q-values and PEPs. Note that this option only has an effect if the input PSMs are from separate target and decoy searches. Default =false
.
-
Fido options
--protein T|F
– Use the Fido algorithm to infer protein probabilities. Must be true to use any of the Fido options. Default =false
.--fido-alpha <float>
– Specify the probability with which a present protein emits an associated peptide. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default =0
.--fido-beta <float>
– Specify the probability of the creation of a peptide from noise. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default =0
.--fido-gamma <float>
– Specify the prior probability that a protein is present in the sample. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default =0
.--fido-protein-level-pi0 T|F
– Use pi_0 value when calculating empirical q-values Default =false
.--fido-empirical-protein-q T|F
– Estimate empirical p-values and q-values for proteins using target-decoy analysis. Default =false
.--fido-gridsearch-depth <integer>
– Set depth of the grid search for alpha, beta and gamma estimation. The values considered, for each possible value of the --fido-gridsearch-depth parameter, are as follows:- 0: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36, 0.5}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05, 0.1}; gamma = {0.1, 0.25, 0.5, 0.75}.
- 1: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05}; gamma = {0.1, 0.25, 0.5}.
- 2: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.1, 0.5}.
- 3: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.5}.
0
.--fido-gridsearch-mse-threshold <float>
– Q-value threshold that will be used in the computation of the MSE and ROC AUC score in the grid search. Default =0.05
.--fido-fast-gridsearch <float>
– Apply the specified threshold to PSM, peptide and protein probabilities to obtain a faster estimate of the alpha, beta and gamma parameters. Default =0
.--fido-protein-truncation-threshold <float>
– To speed up inference, proteins for which none of the associated peptides has a probability exceeding the specified threshold will be assigned probability = 0. Default =0.01
.--fido-split-large-components T|F
– Approximate the posterior distribution by allowing large graph components to be split into subgraphs. The splitting is done by duplicating peptides with low probabilities. Splitting continues until the number of possible configurations of each subgraph is below 2^18 Default =false
.
-
Input and output
--fileroot <string>
– The fileroot string will be added as a prefix to all output file names. Default =<empty>
.--output-dir <string>
– The name of the directory where output files will be created. Default =crux-output
.--overwrite T|F
– Replace existing files if true or fail when trying to overwrite a file if false. Default =false
.--txt-output T|F
– Output a tab-delimited results file to the output directory. Default =true
.--pout-output T|F
– Output a Percolator pout.xml format results file to the output directory. Default =false
.--mzid-output T|F
– Output an mzIdentML results file to the output directory. Default =false
.--pepxml-output T|F
– Output a pepXML results file to the output directory. Default =false
.--feature-file-out T|F
– Output the computed features in tab-delimited text format. Default =false
.--list-of-files T|F
– Specify that the search results are provided as lists of files, rather than as individual files. Default =false
.--parameter-file <string>
– A file containing parameters. See the parameter documentation page for details. Default =<empty>
.--feature-file-in T|F
– When set to T, interpret the input file as a PIN file. Default =false
.--decoy-xml-output T|F
– Include decoys (PSMs, peptides, and/or proteins) in the XML output. Default =false
.--decoy-prefix <string>
– Specifies the prefix of the protein names that indicate a decoy. Default =decoy_
.--output-weights T|F
– Output final weights to a file named "percolator.weights.txt". Default =false
.--init-weights <string>
– Read initial weights from the given file (one per line). Default =<empty>
.--verbosity <integer>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default =30
.--top-match <integer>
– Specify the number of matches to report for each spectrum. Default =5
.