Running a Simple Search Using Tide and Percolator

Now that you have your environment set up and the two input files in your working directory, you can conduct the search. The search process compares each spectrum in demo.ms2 to peptides (subsequences of the proteins) in fasta files provided in a dirctory, yeast-index/. Peptides whose precursor mass is close to that of the observed spectrum are scored against that spectrum, and the top scores are reported in the output. To conduct the search, we first create a peptide index using tide-index and then execute the search using tide-search.

$ crux tide-index small-yeast.fasta yeast-index

While generating the peptide index, you will see output like this:
```
INFO: CPU: pyrrolysine.gs.washington.edu
INFO: Crux version: 2.1
INFO: Fri Feb  5 11:24:02 PST 2016
COMMAND LINE: ./crux tide-index small-yeast.fasta yeast-index
INFO: Running tide-index...
INFO: Writing results to output directory 'yeast-index'.
INFO: Reading small-yeast.fasta and computing unmodified peptides...
INFO: Writing decoy fasta...
INFO: Reading proteins
INFO: Precomputing theoretical spectra...
INFO: Elapsed time: 0.0973 s
INFO: Finished crux tide-index.
INFO: Return Code:0
```
This command produces the peptide index in yeast-index and also produces a directory crux-output containing the following files:
1. tide-index.decoy.fasta – a set of decoy proteins, derived from the proteins in the input set,
2. tide-search.params.txt – a record of all the parameters used in the search, and
3. tide-search.log.txt – a log file containing a copy of all the messages printed to the screen during the search.
Now you can run this command:

$ crux tide-search --compute-sp T demo.ms2 yeast-index

While the search is running, you will see output like this:

INFO: CPU: pyrrolysine.gs.washington.edu
INFO: Crux version: 2.1
INFO: Fri Feb  5 11:24:23 PST 2016
COMMAND LINE: ./crux tide-search --compute-sp T demo.ms2 yeast-index
INFO: Running tide-search...
INFO: Reading index yeast-index
INFO: Converting demo.ms2 to spectrumrecords format
INFO: Reading spectra file crux-output/demo.ms2.spectrumrecords.tmp
INFO: Sorting spectra
INFO: Running search
INFO: Time per spectrum-charge combination: 0.002318 s.
INFO: Average number of candidates per spectrum-charge combination: 15.204820 
INFO: Elapsed time: 0.389 s
INFO: Finished crux tide-search.
INFO: Return Code:0

The crux-output directory now contains four new files containing the search results:

tide-search.target.txt – search results in tab-delimited format.
tide-search.decoy.txt – search results from a decoy database in tab-delimited format.
tide-search.params.txt – a record of all the parameters used in the search.
tide-search.log.txt – a log file containing a copy of all the messages printed to the screen during the search.

Note that the peptide-spectrum matches (PSMs) in the tide-search.target.txt are sorted by the precursor m/z value associated with the spectrum. If you want to see which PSMs got the highest XCorr scores, you can do so like this:

$ crux sort-by-column --column-type real --ascending T crux-output/tide-search.target.txt "xcorr score" > crux-output/tide-search.target.sort.txt

The first lines of the resulting sorted output file should look like this:

file	scan	charge	spectrum precursor m/z	spectrum neutral mass	peptide mass	delta_cn	sp score	sp rank	xcorr score	xcorr rank	b/y ions matched	b/y ions total	distinct matches/spectrum	sequence	cleavage type	protein id	flanking aa
demo.ms2	85	3	497.618	1489.83	1488.82	0.936932	2430.56	1	5.20757	1	27	48	3	NFLETVELQVGLK	trypsin-full-digest	YGL135W(27)	RN
demo.ms2	118	3	1031.94	3092.8	3093.41	0.949698	2383.16	1	4.76204	1	44	108	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C(500)	KD
demo.ms2	156	3	1032.44	3094.3	3093.41	0.938358	1929.77	1	4.7476	1	43	108	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C(500)	KD
demo.ms2	18	3	1032.4	3094.18	3093.41	0.902056	1732.23	1	4.48933	1	40	108	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C(500)	KD
demo.ms2	11	2	745.269	1488.52	1488.82	0.822851	2816.6	1	4.44855	1	21	24	4	NFLETVELQVGLK	trypsin-full-digest	YGL135W(27)	RN
demo.ms2	53	2	745.749	1489.48	1488.82	0.935648	2797.78	1	4.39828	1	21	24	3	NFLETVELQVGLK	trypsin-full-digest	YGL135W(27)	RN
demo.ms2	38	3	1032.39	3094.15	3093.41	0.975741	1460.28	1	4.39627	1	37	108	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C(500)	KD
demo.ms2	42	3	1032.26	3093.76	3093.41	1.0216	1509.47	1	4.35676	1	38	108	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C(500)	KD
demo.ms2	111	3	1033.02	3096.04	3093.41	0.952727	1366.71	1	4.3498	1	36	108	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C(500)	KD
demo.ms2	62	2	745.859	1489.7	1488.82	0.913711	2766.65	1	4.34057	1	21	24	3	NFLETVELQVGLK	trypsin-full-digest	YGL135W(27)	RN

The final step is to post-process the search results using Percolator. Each spectrum has been compared to many peptides and we would like to return only the best match for each spectrum. We also expect that some fraction of the spectra will not be identifiable as peptides (due to chemical noise, multiple peptides co-eluting, poor fragmentation, etc.). The analysis step filters out those spectra and ranks the matches by quality.


$ crux percolator crux-output/tide-search.target.txt

While the analysis is running, you will see output like this

INFO: CPU: pyrrolysine.gs.washington.edu
INFO: Crux version: 2.1
INFO: Fri Feb  5 11:25:51 PST 2016
COMMAND LINE: ./crux percolator crux-output/tide-search.target.txt
INFO: Reading file crux-output/tide-search.target.txt
INFO: Running make-pin
INFO: Parsing crux-output/tide-search.target.txt
INFO: Parsing crux-output/tide-search.decoy.txt
INFO: There are 690 target matches and 690 decoys
INFO: Finished make-pin.
INFO: Percolator version 2.09, Build Date Sep 22 2015 14:29:11
INFO: Copyright (c) 2006-9 University of Washington. All rights reserved.
INFO: Written by Lukas Käll (lukall@u.washington.edu) in the
INFO: Department of Genome Sciences at the University of Washington.
INFO: Issued command:
INFO: percolator -r crux-output/percolator.target.peptides.txt -v 2 -P decoy_ --seed 1 -p 0.01 -n 0 --trainFDR 0.01 --testFDR 0.01 --maxiter 10 -s crux-output/make-pin.pin
INFO: Started Fri Feb  5 11:25:52 2016
INFO:  on pyrrolysine.gs.washington.edu
INFO: Hyperparameters selectionFdr=0.01, Cpos=0.01, Cneg=0, maxNiter=10
INFO: Reading Tab delimited input from datafile crux-output/make-pin.pin
INFO: Features:
INFO: lnrSp deltCn XCorr Sp IonFrac PepLen Charge1 Charge2 Charge3 enzN enzC enzInt lnNumSP dM absdM 
INFO: Train/test set contains 690 positives and 690 negatives, size ratio=1 and pi0=1
INFO: selecting cneg by cross validation
INFO: Selected feature number 4 as initial search direction, could separate 39 positives in that direction
INFO: Selected feature number 3 as initial search direction, could separate 38 positives in that direction
INFO: Selected feature number 3 as initial search direction, could separate 30 positives in that direction
INFO: Estimating 53 over q=0.01 in initial direction
INFO: Reading in data and feature calculation took 0.03 cpu seconds or 0 seconds wall time
INFO: ---Training with Cpos=0.01, Cneg selected by cross validation, fdr=0.01
INFO: Iteration 1 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 2 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 3 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 4 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 5 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 6 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 7 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 8 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 9 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Iteration 10 :	After the iteration step, 53 target PSMs with q<0.01 were estimated by cross validation
INFO: Obtained weights (only showing weights of first cross validation set)
INFO: # first line contains normalized weights, second line the raw weights
INFO: lnrSp	deltCn	XCorr	Sp	IonFrac	PepLen	Charge1	Charge2	Charge3	enzN	enzC	enzInt	lnNumSP	dM	absdM	m0
INFO: 0.0352	0.0014	0.1465	0.2956	-0.0130	0.0224	0.0051	-0.0025	-0.0020	0.0000	0.0000	-0.0001	-0.0049	-0.0055	0.0169	-0.9248
INFO: 0.0580	0.0069	0.2096	0.0010	-0.0922	0.0052	0.0136	-0.0052	-0.0053	0.0000	0.0000	-0.0006	-0.0077	-0.0059	0.0300	-1.1931
INFO: After all training done, 56 target PSMs with q<0.0100 were found when measuring on the test set
INFO: Found 56 target PSMs scoring over 1.0000% FDR level on testset
INFO: Merging results from 3 datasets
INFO: Target Decoy Competition yielded 116 target PSMs and 50 decoy PSMs
INFO: Tossing out "redundant" PSMs keeping only the best scoring PSM for each unique peptide.
INFO: Calibrating statistics - calculating q values
INFO: Merged list gives 9 peptides over q=0.0100
INFO: Calibrating statistics - calculating Posterior error probabilities (PEPs)
INFO: Processing took 0.67 cpu seconds or 0 seconds wall time
INFO: Elapsed time: 0.933 s
INFO: Finished crux percolator.
INFO: Return Code:0

The crux-output directory will now contain eight new files:

percolator.target.psms.txt – a list of peptide-spectrum matches (PSMs), ranked by quality,
percolator.target.peptides.txt – a list of peptides, ranked by quality,
percolator.decoy.psms.txt – a ranked list of decoy PSMs,
percolator.decoy.peptides.txt – a ranked list of decoy peptides,
percolator.pout.xml – a single XML output file containing all of the Percolator results,
make-pin.pin.xml: an intermediate XML format file that is used by Percolator.
percolator.params.txt – parameter file, and
percolator.log.txt – log file.

As before, you might want to sort the Percolator output files, this time by the "percolator score" column:


$ crux sort-by-column --column-type real --ascending T crux-output/percolator.target.psms.txt "percolator score" > crux-output/percolator.target.psms.sort.txt

The beginning of the resulting percolator.target.psms.sort.txt file will look like this:

file	file_idx	scan	charge	spectrum precursor m/z	spectrum neutral mass	peptide mass	percolator score	percolator rank	percolator q-value	total matches/spectrum	sequence	cleavage type	protein id	flanking aa
crux-output/tide-search.target.txt	1	118	3	1031.9407	3092.8000	3095.2366	8.6190662	1	0	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C	KD
crux-output/tide-search.target.txt	1	26	2	692.3773	1382.7400	1382.4467	7.0481362	2	0	6	TASEFDSAIAQDK	trypsin-full-digest	YLR043C	KL
crux-output/tide-search.target.txt	1	53	2	745.7473	1489.4800	1489.7318	6.7469611	3	0	3	NFLETVELQVGLK	trypsin-full-digest	YGL135W	RN
crux-output/tide-search.target.txt	1	62	2	745.8572	1489.7000	1489.7318	6.6489625	4	0	3	NFLETVELQVGLK	trypsin-full-digest	YGL135W	RN
crux-output/tide-search.target.txt	1	146	2	692.6772	1383.3400	1382.4467	6.6027999	5	0	5	TASEFDSAIAQDK	trypsin-full-digest	YLR043C	KL
crux-output/tide-search.target.txt	1	131	2	745.8473	1489.6801	1489.7318	6.5627294	6	0	3	NFLETVELQVGLK	trypsin-full-digest	YGL135W	RN
crux-output/tide-search.target.txt	1	50	2	651.2873	1300.5601	1301.4160	6.3757763	7	0	10	LDVDELGDVAQK	trypsin-full-digest	YLR043C	KN
crux-output/tide-search.target.txt	1	42	3	1032.2606	3093.7600	3095.2366	5.8047724	8	0	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C	KD
crux-output/tide-search.target.txt	1	90	3	1032.0006	3092.9800	3095.2366	5.4775882	9	0	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C	KD
crux-output/tide-search.target.txt	1	111	3	1033.0206	3096.0400	3095.2366	5.4364419	10	0	2	ELESAAYDHAEPVQPEDAPQDIANDELK	trypsin-full-digest	YGL009C	KD

In this output, the PSMs are ranked by "percolator score," with higher scores indicating a higher quality match. The associated statistical confidence estimate is reported as a "percolator q-value," interpreted as the minimal false discovery rate threshold at which this match is deemed significant. In the list above, all of the matches have q-values of 0, meaning that they are highly significant. The meanings of the remaining columns are described here. Note that when you run Percolator on your own computer, the results may be somewhat different than the ones reported here. This is because Percolator involves randomly subdividing the data in a cross-validation scheme (described in detail here.)