Main

An unintended consequence of whole-genome sequencing has been the birth of large-scale proteomics. What drives proteomics is the ability to use mass spectrometry data of peptides as an 'address' or 'zip code' to locate proteins in sequence databases. Two mass spectrometry methods are used to identify proteins by database search methods. The first method uses a molecular weight fingerprint measured from a protein digested with a site-specific protease1,2,3,4,5. A second method uses tandem mass spectra derived from individual peptides of a digested protein6,7 (Fig. 1). Because each tandem mass spectrum represents an independent and verifiable piece of data, this approach to database searching has the ability to identify proteins in mixtures, enabling a rapid and comprehensive approach for the analysis of protein complexes and other complicated mixtures of proteins6,8,9,10,11,12. New biology has been discovered based on fast and accurate protein identification13,14,15,16,17,18. As tandem mass spectral protein identification has proliferated, it has become increasingly important to understand the rationale of individual database search algorithms, their relative strengths and weaknesses, and the mathematics used to match sequence to spectrum.

Figure 1: Overview of the protein identification process.
figure 1

A protein mixture is digested, and the resulting peptides are analyzed by MS/MS to obtain experimental spectra. Search programs find database candidate sequences whose theoretical spectra are compared to the experimental spectrum. The best match (highest-scoring candidate sequence) defines the identified database peptide and the corresponding database protein. Validation software then determines whether the peptide and protein identifications are true or false.

In this review we discuss the prevailing fragmentation models, spectral preprocessing, methods to match tandem mass spectra to sequences and several approaches to matching tandem mass spectra of peptides whose exact sequences may not be present in the database. Space limitations restrict a detailed description of all algorithms in this rapidly expanding field. Also, some algorithms are proprietary, and thus, details on how they work are unknown. This review should supplement and update earlier reviews on database search algorithms19,20,21,22,23,24.

Peptide fragmentation and data preprocessing

In tandem mass spectrometry (MS/MS), gas phase peptide ions undergo collision-induced dissociation (CID) with molecules of an inert gas such as helium or argon25. Other methods of dissociation have been developed, such as electron capture dissociation (ECD), surface induced dissociation (SID) and electron transfer dissociation (ETD), but gas-phase CID is the most widely used in commercial tandem mass spectrometers. The dissociation pathways are strongly dependent on the collision energy, but the vast majority of instruments use low-energy CID (<100 eV)26. At low collision energies fragmentation mainly occurs along the peptide backbone bonds, whereas at higher energies fragments generated by amino acid side-chains are observed25,27. At low-energy CID, conditions normally used in triple quadrupole, quadrupole time of flight and ion trap (both linear and Paul) mass spectrometers, b-ions, y-ions and neutral losses of water and ammonia dominate the mass spectrum (Box 1). Fragmentation patterns are also strongly dependent on the chemical and physical properties of the amino acids and sequences of the peptide28,29. Most algorithms assume that peptides preferentially fragment into b- and y-ions. The distribution of intensities between b- and y-ions is the subject of intensive studies, and this distribution can vary by type of instrument (for example, ion trap as compared to Q-TOF), but this information is not yet fully exploited in most matching models. A mobile proton model30 has been proposed to explain intensity patterns observed in MS/MS, and Zhang has developed a theoretical model that predicts fragment ion intensities well31. As with any measurement process, tandem mass spectra may have some level of uncertainty. The accuracy of the mass-to-charge ratio (m/z) and the mass resolving power are limited, electronic and chemical noise may be present and ion signals may fluctuate as a result of changes in the concentrations of peptides entering the ion source. Given a precursor m/z value and a list of fragment ions, the goal is to match these to an amino acid sequence within the measurement and fragmentation uncertainty of the mass spectrometer (Fig. 1).

Some of the methods described below are used in our laboratory and illustrate a general approach for the analysis of large data sets. One source of uncertainty in analyzing tandem mass spectra of multiply charged ions is determination of the charge state of the precursor peptide, which is critical to accurately calculating a peptide's molecular weight. A highly accurate molecular weight measurement or calculation can be very effective in restricting search results. Methods to determine the charge state of ions involve deconvolution or determination of the m/z spread of isotope peaks32. On low-resolution instruments, where these methods are ineffective, MS/MS of peptides with charge states higher than 1 are typically searched twice, once calculating a molecular weight assuming that the charge state is +2 and the second time assuming the charge state is +3. Based on observations by Dancik and colleagues33 that complementary fragment ions (N-terminal and C-terminal fragment ions) can be used to improve molecular weight calculations, our group and others used a variation of this approach to determine peptide ion charge state34,35. In good-quality tandem mass spectra there are numerous complimentary ions; thus, if the precursor ion is assumed to be doubly charged, the complementary ions present in the spectrum should sum to this molecular weight. If the ion is triply charged, then the complementary fragment ions will sum to this molecular weight. Both situations are tested, and the molecular weight calculation that accounts for the most complementary ions is assumed to be the correct charge state. In addition to the above method, ion traps can also be used to perform a narrow mass range scan at resolutions sufficient to determine the charge state, but this necessitates an additional scan and reasonably abundant signal36. The newer linear ion traps with higher scan speeds can accommodate the high-resolution scan without decreasing the efficiency of data acquisition.

Most large-scale tandem mass spectrometry data is acquired using automated methods such as data-dependent data acquisition, which triggers MS/MS based on ion abundance. When the ion abundance level to trigger MS/MS is set just above the background noise level, MS/MS data is almost continuously acquired. Distinguishing data acquired from nonpeptide ions or identifying poor-quality MS/MS of peptides can potentially lower false-positive rates, as these spectra should not correctly match to peptide sequences, but most algorithms will still attempt a match to a sequence. Peptide ions selected for MS/MS at very low signal levels will produce spectra with poor signal-to-noise ratios and often incomplete sequence ions. Methods have been devised to sort the good from the bad spectra34,37,38. Recently, Bern et al. presented two different algorithms for assessing spectral quality prior to a database search: a binary classifier, which predicts whether or not the search engine will be able to make an identification, and a statistical regression, which predicts a more universal quality metric, independent of the database search program39. A quadratic discriminant analysis, a classical machine learning algorithm, was trained on a data set of manually validated good and bad spectra. Each spectrum is assigned a 'feature vector,' including number of peaks, total intensity and relative normalized intensity of the peaks that (i) differ by the mass of an amino acid, (ii) differ by a mass of 18 Da (mass of a water molecule) (iii) add up to the mass of the precursor ion and (iv) have associated isotope peaks. We report that the parameters with the best discriminant power are the relative normalized intensity of the peaks that differ by the mass of an amino acid and the relative normalized intensity of complimentary ions.

Review of database search algorithms

The goal of a tandem mass spectral database search is to identify the best sequence match to the spectrum. For tandem mass spectra with good signal-to-noise ratio and uniform fragmentation, it is reasonably straightforward to identify the correct sequence match. In situations where a tandem mass spectrum is of poorer quality or the peptide ion undergoes unusual fragmentation, an analysis may benefit from the use of multiple search algorithms. A number of algorithms and scoring models have been developed to assess the likelihood of a match. They can show different selectivity and sensitivity at the edge of good spectral quality, and some programs have enough flexibility to permit the use of different types of MS/MS data or modification patterns6,7,40,41,42,43,44,45,46,47,48,49,50,51,52,53. Four basic approaches have been developed to model matches to sequences: descriptive, interpretative, stochastic and probability-based modeling (Box 2). Some of these programs can be accessed through websites (Table 1), but most are run on local computers to allow large-scale analyses (Box 3).

Table 1 Database search programs and websites where information about these programs can be obtained

Descriptive models for database searching

SEQUEST is an example of a program that uses a descriptive model for peptide fragmentation and correlative matching to a tandem mass spectrum6. It uses a two-tiered scoring scheme to assess the quality of the match between the spectrum and amino acid sequence from a database. The first score calculated, the preliminary score (Sp), is an empirically derived score that restricts the number of sequences analyzed in the correlation analysis. Sp sums the peak intensity of fragment ions matching the predicted sequence ions and accounts for the continuity of an ion series and the length of a peptide. The original Sp score is:

where the first term in the product is the sum of ion abundances of all matched peaks, m is the number of matches, β is a 'reward' for each consecutive match of an ion series (for example, 0.075), ρ is a 'reward' for the presence of an immonium ion (for example, 0.15) and L is the number of all theoretical ions of an amino acid sequence.

The second score is a cross-correlation of the experimental and theoretical spectra. This score is referred to as XCorr. The theoretical spectrum is generated from the predicted fragment ions, the b- and y-ions for each of the sequences. In the theoretical spectrum the main ion series products are assigned an abundance of 50, a window of 1 atomic mass unit around the main fragment ions is assigned intensity 25, and water and ammonia losses are assigned intensity of 10. The theoretical and normalized experimental spectra are cross-correlated to obtain similarities between the spectra. First, a cross-correlation of the two discrete data sets, experimental data (E) and theoretical spectrum (T), is taken:

The correlation is processed and averaged to remove the periodic noise in the interval of (−75 to 75). In addition to the preliminary and cross-correlations scores, SEQUEST produces another important quantity, normalized difference of Xcorr values between the best sequence and each of the other sequences. This value, ΔCn, is important in distinguishing the best match from other lower-scoring matches. That is, ΔCn is useful to determine the uniqueness of the match. If a match is reasonably unique to a sequence, the ΔCn value will be large (>0.1). XCorr is independent of database size and reflects the quality of the match between spectrum and sequence, whereas ΔCn is database dependent and reflects the quality of the match relative to near misses.

The cross-correlation score is a sensitive measure. However, like other measures based on additive features, it is dependent on peptide mass, charge state and spectral quality. Thus it has been observed that larger peptides score higher than similar-quality smaller peptides. Very dense (potentially noisy) spectra can have high cross-correlation scores. To address these issues, a few modifications have been made to the cross-correlation score. To normalize XCorr for spectral noise and peptide size, the XCorr value is divided by auto-correlation of the experimental spectrum or by the square root of the products of auto-correlations of experimental and theoretical spectra54,55. A statistical confidence can then be readily derived from the normalized cross-correlation scores. SEQUEST has been shown to have good sensitivity and flexibility and is applicable to data generated by different types of mass spectrometers56,57.

Other programs in this group include SONAR52 and SALSA49. To determine the quality of the spectral match, SONAR uses a dot product of experimental and theoretical spectra. The dot product is the zero shift cross-correlation. SALSA scores the correspondence between the experimental fragment ion series and theoretical ion series regardless of their absolute position on the m/z axis. A virtual ruler is used with the relative separations of ions fixed and then superimposed on the experimental mass spectrum by aligning the first ion in the ion series to the fragment ion with the highest experimentally determined m/z. SALSA is adept at identifying peptides with unanticipated modifications or amino acids.

Interpretative models for database searching

PeptideSearch7 is a program based on the assumption that in tandem mass spectra there is a continuous series of fragment ions that are clearly identifiable as a short amino acid sequence (Fig. 2). A search engine has been fashioned using the partial sequence by dividing every candidate sequence into three parts: region 1 of unknown mass, region 2, containing the sequence tag and another region 3. The sequence ions associated with the sequence tag can be from the b- or y-ion series (defined in Box 1). Both possibilities are equally likely and must be considered by the algorithm. The assumption about the directionality of the partial sequence leads to the determination of masses of the region 1 (m1) and 3 (m3). For example, if the sequence ions are assumed to be b-ions, then m1 is simply the mass-to-charge ratio of the smallest ion in the series, whereas m3 is the difference between peptide mass and the mass-to-charge ratio of the largest ion in the series. The algorithm searches the database for sequences using the information from regions 1 (m1), 2 (partial sequence tag) and 3 (m3), as well as information from the peptide molecular weight, the protease specificity and mass accuracy. A sequence match is scored by computing the random probability match of each region and the N- and C-terminal amino acids expected from protease cleavage specificity. For the sequence tag it is assumed that the probabilities of all amino acids are equally likely. In this case, unique mass amino acids have a probability of 1/20, whereas amino acids with the same mass (within a specified accuracy) will have higher probability of N × 1/20, where N is the number of amino acids with the same mass. Thus, for amino acids leucine and isoleucine, or glutamic acid and lysine, this probability will be 1/10, whereas for glycine it is 1/20. The probability of randomly matching regions 1 and 3 are equal and set to the ratio of mass accuracy to the average molecular weight of amino acid residues, 1/110 for a unit mass accuracy. Also, a probability is assigned to the amino acids at the cleavage sites. Complete tryptic cleavage of a protein results in peptides terminating in one of two amino acids—lysine or arginine. Therefore, if a sequence is tryptic, the random match probability is multiplied by 1/100, and if it is half tryptic by 1/10. Combining the probabilities of all regions and cleavage probabilities, a probability that a sequence match is random is set:

Figure 2: Simplified representation of an MS/MS spectrum for the peptid e IYEVEGMR.
figure 2

The b-ion ladder is shown in red and the y-ion ladder in blue. Distances between peaks on the horizontal mass-to-charge (m/z) axis can be used to infer partial sequences of the peptide. This example shows how the partial sequence YEV can be inferred from the y-ion ladder.

The probability of a nonrandom match in a database with N amino acids would then be

In the above formula the random match probability is multiplied by 2 to account for the fact that the direction of the partial sequence is not known. As it is seen from the formula for a nonrandom match, the identification is dependent on the size of the database. In general, the larger the database, the longer the sequence tag should be for higher confidence matches.

One of the advantages of the approach taken by PeptideSearch is that it is an error-tolerant algorithm. There could be a difference between the mass of the peptide as predicted in the database and the actual peptide. For example, the database sequence could have been predicted incorrectly, or there could have been a mutation or a post-translational modification in the peptide represented by the mass spectrum. The molecular weights of these altered peptides would not agree with molecular weights predicted from the database and they will not be identified by direct searches. PeptideSearch suggests using partial information from a combination of any two out of three regions, to identify peptides and map the region of alteration. If the origin of the sample is known, then the database search can be restricted to a smaller number of known proteins (for example, a particular species), in which case searches with short partial sequences (one or two amino acids) can yield reliable results.

The scoring model of PeptideSearch assumes the probability of each amino acid occurring is the same, 1/20. It is known, however, that amino acids occur in a database at different frequencies; for example, leucine occurs eight times more frequently than tryptophan). Furthermore, the probability scores are calculated using an incomplete model. The sum of probabilities of all database sequences is not 1: the probabilities for mass accuracy and enzyme specificity are arbitrary, as they represent a different probability space, and longer sequences naturally produce smaller probabilities. Despite these problems with the scoring model, sequence tagging is a useful approach for database searching.

Other implementations of the PeptideSearch algorithm are MS-Seq, by Clauser et al., and the recently developed program GutenTag42,47. In GutenTag, the partial sequence is generated automatically before the search by using empirically derived knowledge of specific amino acid contributions to fragment ion intensities in the spectrum. The program compares the sequences derived from the database search to the tandem mass spectrum using a dot product. GutenTag uses a fragmentation model for peptides that is based on the empirical observations and works well for doubly charged peptides derived from trypsin digestion. Sequence tagging approaches are useful for the identification of peptides with unknown modifications, amino acid sequence variations or sequence errors. Manual sequence interpretation can also lead to answers from spectra with unusual fragmentation that does not fit an algorithm's fragmentation model.

Stochastic models for database searching

Stochastic methods are based on probability estimates for peptide fragmentation and the subsequent generation of tandem mass spectra. In SCOPE43, one of the early algorithms in this category, the MS/MS spectrum generation is modeled by a two-step stochastic process: fragmentation and measurement. The first step, fragmentation, enumerates all the possible fragmentation patterns of a peptide, and it determines the empirical probabilities associated with the pattern. The second step, measurement, generates tandem mass spectra from the fragments obtained in the first step, according to the distribution of the instrument measurement error.

The first step, the fragmentation probability estimation, allows the incorporation of physical and chemical properties of a peptide in the scoring process. A trivial fragmentation probability model would be, for example, to consider that the b- and y-ions of a peptide have a 100% chance of appearing in the MS/MS spectrum, the a-ions have a 50% chance and all the other possible fragment ions have a zero chance. Of course, the actual fragmentation process is more complex, and a more representative model can be derived from the analysis of large databases of manually curated spectra or by experienced mass spectrometrists.

Assuming that a large database of manually curated spectra is available, it is straightforward to generate a list of the observed fragment ions for each peptide. From here, the probability of appearance of each cleavage event in the tandem mass spectrum can be estimated by counting the number of times it is observed. In addition, using data mining techniques on this fragment data set, it is also possible to find those properties of a peptide that lead to significantly more cleavage events than would otherwise be expected.

Once the fragmentation probabilities have been estimated using one of the methods above, or a combination of the two, the second step of the algorithm is to compute the probability that a fragmentation pattern results in a tandem mass spectrum measurement. In other words, the probability of observing a collection of spectral peaks given a particular peptide fragmentation pattern is computed. The stochastic approach to this problem is to model the distribution of the measured m/z ratios, either as normal or uniform distributions centered at the m/z ratio for the predicted fragment.

Formally, the two-step process used by SCOPE can be described as follows: for a peptide p, a fragmentation pattern F and a tandem mass spectrum S, the first step of the algorithm estimates Pr(F|p), the probability of obtaining the fragmentation pattern F from the collision-induced dissociation of peptide p. The second step of the algorithm determines ψ (S|F, p), the probability of fragmentation pattern F to generate spectrum S. Finally, the probability of obtaining the spectrum S from peptide p is computed combining the two steps:

where T̄(p) is the fragment space, which contains all of the fragmentation patterns theoretically generated from peptide p. SCOPE implements a dynamic programming algorithm to efficiently compute the above formula and reports the peptide p that maximizes the score, along with its corresponding P-value.

Stochastic models to determine the best fit between a tandem mass spectrum and a sequence have been used in other programs. The program OLAV uses an approach similar to the one used in SCOPE, except that some simplifications are made to avoid overexpansion of terms51. OLAV uses the maximum likelihood ratio as the identification score. The likelihood ratio is the ratio of probabilities from two alternative hypotheses, that peptide identification is valid and the alternative that the identification is random. A program proprietary to Waters, MassLynx, uses Markovian chains to compute the statistical significance of a match, but a detailed description of the model used by the program is not available58. Most stochastic models require the use of training sets to determine the coefficients used to model features of the tandem mass spectrum. The magnitude of the coefficients may vary with the type of mass spectrometer used or other performance characteristics of the instrument, such as calibration. Therefore, the performance of these methods may be expected to depend on experimental settings, instruments used and limitations of the training set.

Statistical and probability models for database searching

This group of methods uses models based on empirically generated fragment ion probabilities45,48,51. In these methods no a priori determined probabilities are used. They generate a model that relates the sequences to a spectrum and determine the peptide identification score from this model. Thus, in the simplest models the frequencies of matches of b- and y-ions are determined and used to calculate a probability of sequence identification determined by the product of probabilities of its fragment matches. Several variations of this approach have been implemented in database searching algorithms43,45,48,51. Mascot41 uses a model analogous to the one previously developed for identifying proteins from their peptide mass fingerprint3. Mascot may also use some empirical observations about fragment intensities and ion series continuity. The actual description of the model is not available in peer-reviewed literature and therefore we are not able to describe this algorithm in detail, even though it is one of the most widely used database search programs.

Recently, a group of database search algorithms have been implemented that use collective properties of database sequences to calculate the probability that a sequence match is a random event. Thus, we have proposed to divide all database fragment ions into two groups: matches and misses46. Then, we assume that a hypergeometric probability models the frequencies of database peptides based on the number of matches. According to this model a probability that a peptide match is a random event is predicted from the hypergeometric probability of choosing K1 matches (number of matches of a peptide) in N1 trials (the number of fragment ions of the peptide) from a pool of fragments consisting of N fragments (number of all database fragments) K of which are matches (number of matches of all fragment ions to a spectrum). The hypergeometric probability of this event is:

The probability of a peptide being a random match to the tandem mass spectrum is defined in the space that comprises all peptides whose mass match the mass of the precursor peptide. The significance of a peptide match is determined as a type I error of the null hypothesis—all fragment matches are random. OMSSA, a recently developed database search algorithm, uses a similar approach, where the peptide matches are modeled after the Poisson distribution53. Database search algorithms based on the number of matches trend to spectral quality owing to the fact that a match to a background peak and a match to a sequence ion are not distinguishable. Statistical models produce a statistical confidence for a match between the spectrum and database sequences. This confidence is based on the frequency of fragment ions in the database itself, and the probability a spectrum is a random match rather then the closeness of fit to a fragment model.

Conclusions

Automated analysis of tandem mass spectra is a critical process for new analytical strategies such as 'shotgun proteomics'. As tandem mass spectrometers have improved, the acquisition of hundreds of thousands of spectra has become not uncommon, and thus, accurate approaches to identify and validate sequence matches will make this method all the more powerful. Although a variety of algorithms have been demonstrated to provide accurate matches between tandem mass spectra and sequences, all suffer from an inability to provide verifiable matches to poor-quality spectra. Reliable and sensitive methods to assess spectral quality and assign quality indices to spectra will be critical for decreasing computational load and lowering false-positive rates (Box 4). Most algorithms are very accurate for peptides that follow general rules of fragmentation, but a subset of amino acid sequences and more highly charged peptide ions deviate from these rules; thus, a better understanding of relationships between peptide sequences and fragment ion intensity will assist in designing better models for matching spectra to sequences. Additional studies to better understand the strengths and weaknesses of the various algorithms will help to design algorithms with better sensitivity and selectivity.