A general question in molecular physiology is how to identify candidate protein kinases corresponding to a known or hypothetical phosphorylation site in a protein of interest. It is generally recognized that the amino acid sequence surrounding the phosphorylation site provides information that is relevant to identification of the cognate protein kinase. Here, we present a mass spectrometry-based method for profiling the target specificity of a given protein kinase as well as a computational tool for the calculation and visualization of the target preferences. The mass spectrometry-based method identifies sites phosphorylated in response to in vitro incubation of protein mixtures with active recombinant protein kinases followed by standard phosphoproteomic methodologies. The computational tool, called “PhosphoLogo,” uses an information-theoretic algorithm to calculate position-specific amino acid preferences and anti-preferences from the mass-spectrometry data (http://helixweb.nih.gov/PhosphoLogo/). The method was tested using protein kinase A (catalytic subunit α), revealing the well-known preference for basic amino acids in positions −2 and −3 relative to the phosphorylated amino acid. It also provides evidence for a preference for amino acids with a branched aliphatic side chain in position +1, a finding compatible with known crystal structures of protein kinase A. The method was also employed to profile target preferences and anti-preferences for 15 additional protein kinases with potential roles in regulation of epithelial transport: CK2, p38, AKT1, SGK1, PKCδ, CaMK2δ, DAPK1, MAPKAPK2, PKD3, PIM1, OSR1, STK39/SPAK, GSK3β, Wnk1, and Wnk4.
- protein kinase A
- glycogen synthase kinase
the function of a given protein can be regulated by a variety of mechanisms including posttranslational modifications. One of the most prevalent posttranslational modifications involved in molecular control is phosphorylation. A change in phosphorylation, mediated by kinases and phosphatases, can result in alteration of molecular function by changing the three-dimensional configuration of the molecule or by introducing binding sites for regulator proteins. A common objective in molecular physiology is to predict, based on the amino acid sequence of a given protein, what protein kinase is responsible for a given phosphorylation event. For example, if a particular serine is phosphorylated and the amino acids located two positions and three positions upstream (in the NH2-terminal direction) are arginines, then a reasonable hypothesis is that protein kinase A may be responsible for the phosphorylation (R-R-X-S motif).
Among the 21,000 protein-coding genes in the genome, just over 500 code for protein kinases (21). A general question is: For a given phosphorylation site in a particular protein, how can we predict which protein kinase is responsible for the phosphorylation event? To answer this question, it would be useful to know the substrate target preferences for each of the kinases coded by the genome; i.e., among all protein kinases, can unique maps (motifs) be identified which allow each phosphorylation site in a particular protein to be mapped to a specific kinase? Previous efforts to generate such maps (known as “sequence logos”) have utilized two chief approaches: 1) generation of substrate target logos from curated databases (2), and 2) generation of substrate target logos from peptide arrays (24, 35). Thus far, target sequence logos have been generated for only a minority of protein kinases. Here, we introduce two new tools to further address this problem. First, we introduce a method that uses protein mass spectrometry to identify sites phosphorylated from in vitro incubation of dephosphorylated, denatured proteins with active, recombinant protein kinases. This approach uses phosphoproteomic methodologies that we have published in prior studies (9, 10). Second, we introduce a new software tool called “PhosphoLogo” that uses information theory to describe target-sequence preferences from either mass spectrometry-generated data or from phosphopeptide data sets downloaded from curated databases. PhosphoLogo also identifies disfavored amino acids in particular positions relative to the phosphorylated amino acid. We then utilize the mass spectrometry approach to profile target preferences for protein kinase A and 15 other protein kinases that play potential roles in regulation of epithelial transport (CK2, p38, AKT1, SGK, PKCδ, CaMK2δ, DAPK1, MAPKAPK2, PKD3, PIM1, OSR1, STK39/SPAK, GSK3β, Wnk1, and Wnk4). The resulting knowledge of protein kinase preferences and anti-preferences may be useful both in systems biology studies of signaling networks and to gain insight on protein kinase-substrate structural interactions.1
Animal procedures were approved by the National Heart, Lung, and Blood Institute Institutional Animal Care and Use Committee (protocol H-0110). Tissues from kidneys, liver, brain, and the small intestine from rapidly euthanized rats were separately isolated on ice and homogenized (Omni International Homogenizer, 15-s pulse, 15 s × 5) in 1 × Complete Mini Protease Inhibitor Cocktail (Roche, Mannheim, Germany). Protein concentrations were measured using the BCA method (Thermo Fisher Scientific, Waltham, MA). Samples were stored at −80°C.
Protein (250 μg) from the homogenates was combined (equal amounts from each tissue) with 5 μl (2,000 units) λ-phosphatase protein (New England BioLabs, Cambridge, MA) in 1 × MnCl2, and 1 × Complete Mini Protease Inhibitor Cocktail (Roche), and brought up in NEBuffer for Protein MetalloPhosphatases (New England BioLabs) to a volume of 43 μl. Samples were incubated on a thermomixer at 30°C for 20 h. To denature proteins in the mixture and inactivate λ-phosphatase, samples were heated (65°C for 1 h on a thermomixer) followed by removal of MnCl2 via buffer exchange into 2.5 × kinase reaction mix, containing one part EDTA-free 10 × Complete Mini Protease Inhibitor Cocktail (Roche), one part 10 × Kinase Buffer (500 mM Tris·HCl, 100 mM MgCl2, New England BioLabs), and two parts H2O. The buffer exchange was performed using 10 kDa-cutoff Amicon Ultra-0.5 ml Centrifugal Filters (Millipore, Billerica, MA) spun at 14,000 g for 2 × 20 min. Retentate was brought up to a volume of 60 μl with 2.5 × kinase reaction mix. In preliminary experiments, results were compared with and without the 65°C heat denaturation step. In these experiments, unheated samples underwent chemical inhibition of phosphatases (50 mM sodium fluoride and 10 mM sodium orthovanadate, New England BioLabs). In preliminary studies, the λ-phosphatase treatment reduced the baseline detection of phosphopeptides by LC-MS/MS analysis from 20.3% [of all peptides identified after immobilized metal affinity chromatography (IMAC) enrichment] down to 4.3%.
Samples were combined with desired concentration of the kinase of interest, 2.67 mM ATP (Cell Signaling Technology, Danvers, MA), and EDTA-free 1 × Complete Mini Protease Inhibitor Cocktail (Roche). Samples were brought up to a volume of 150 μl in H2O to achieve a 1 × kinase reaction mix concentration (50 mM Tris·HCl, 10 mM MgCl2) and a rat-tissue protein concentration of 6 μg/μl. All kinases are active purified recombinant proteins purchased commercially. The optimal kinase concentrations during the incubations were empirically determined for each kinase. These concentrations were 1.0 μM for PKA (New England BioLabs); 0.9 μM for CK2 (New England BioLabs), p38 (Cell Signaling Technology), AKT1 (Cell Signaling Technology), SGK1 (CarnaBio USA, Natick, MA), PKCδ (Cell Signaling Technology), MAPKAPK2 (SignalChem), PKD3 (Cell Signaling Technology), PIM1 (SignalChem), and GSK3β (CarnaBio USA); 0.6 μM for CaMK2δ (SignalChem) and DAPK1 (SignalChem); and 0.4 μM for STLK3 (CarnaBio USA), OSR1 (CarnaBio USA), Wnk1 (CarnaBio USA), and Wnk4 (CarnaBio USA). All other reactant concentrations were held constant for the incubations except a solution of 0.03 μg/μl calmodulin, 1 mM Tris pH 7.3, and 0.5 mM CaCl2 was added to the reaction mix for the Ca2+/calmodulin-dependent kinases, CaMK2δ and DAPK1. Samples were incubated at 30°C for 24 h. To halt the kinase reaction and denature the protein, 450 μl of 8 M urea buffer [0.48 g of urea, 50 μl of 1 M Tris pH 8, 15 μl of 5 M NaCl, 10 μl of EDTA-free 100 × Halt Protease and Phosphatase Inhibitor Cocktail (Thermo Fisher Scientific), and 925 μl H2O] was added to each 150-μl sample.
Phosphopeptide digestion, enrichment, LC-MS/MS analysis, and identification.
Following the kinase incubation, the proteins underwent a standard phosphoproteomic analysis utilizing Ga3+ IMAC phosphopeptide enrichment, and LC-MS/MS analysis as described by Hoffert et al. (9). Tryptic peptides were analyzed on an Eksigent nanoflow LC system (Dublin, CA) connected to an LTQ Orbitrap Velos mass spectrometer (Thermo Fisher Scientific) equipped with a nano-electrospray ion source. Peptides were loaded onto a peptide trap cartridge (Agilent Technologies, Palo Alto, CA) at a flow rate of 6 μl/min. The trapped peptides were then fractionated with a reversed-phase PicoFrit column (New Objective, Woburn, MA) using a linear gradient of 5–35% acetonitrile in 0.1% formic acid. The gradient time was 45 min at a flow rate of 0.25 μl/min. Precursor masses (MS1 scans) were acquired in the Orbitrap, and fragmented product masses (MS2 scans) were acquired either in the linear ion trap using collision-induced dissociation (CID) or in the Orbitrap using higher-energy collision-induced dissociation (HCD).
Two search algorithms were used to identify peptide ions from the mass spectra, viz. SEQUEST running on Proteome Discoverer software (version 1.2, Thermo Fisher Scientific) and InsPecT (version 20081014). We used a concatenated database containing both the forward and reversed complement of the Rat Refseq Database (National Center for Biotechnology Information, released on October 6, 2010, 29,392 entries) which included a list of common contaminating proteins from other species. Precursor ion tolerance was 25 ppm, while fragment ion tolerance was 1.0 Da for CID and 0.05 Da for HCD. Up to three missed trypsin cleavage sites were allowed. Static modifications included carbamidomethylation of cysteine (+57.021 Da). Variable modifications included oxidation of methionine (+15.995 Da), phosphorylation of serine, threonine, or tyrosine (+79.966 Da), and deamidation of asparagine and glutamine (+0.984 Da). Known contaminant ions were excluded. Data sets were filtered to include <1% false discovery hits (estimated based on target-decoy analysis). Phosphorylation site assignment was made using in-house software (dynamic programming algorithm currently under review) for the SEQUEST results and Phosphate Localization Score (PLS) for the InsPecT results (PLS > 7). The search results from the two algorithms were merged and filtered for phosphopeptides. To eliminate ambiguous identifications, the two algorithms had to yield the same peptide identifications including modifications if the spectrum was interpreted by both algorithms. The in-house software ProMatch (29) was used to identify the protein corresponding to each peptide sequence. For peptide sequences that matched to more than one protein, only a single protein was used, with matches to known proteins preferred over matches to predicted proteins as described by Tchapyjnikov et al. (29). Using the protein accession number, peptide sequences were extended using in-house software PTM Centralizer (URL: http://helixweb.nih.gov/ESBL/PtmCentralizer/) to include the desired number of amino acids in positions surrounding the phosphorylated site. If a phosphopeptide was identified from multiple spectra, that peptide was included only once on the peptide list for the subsequent logo generation step.
Background subtraction and logo generation.
Endogenous “background” phosphopeptides identified in the no-kinase controls were pooled for all experiments into a single, comprehensive list (Supplemental Data Set S1 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S1.xls). Any background phosphopeptides present in the list of identified phosphopeptides for a particular kinase sample were removed. Using the PhosphoLogo software described below, substrate preference motifs were generated using the final list of same-length, centered, unique, background-subtracted phosphopeptide sequences.
PhosphoLogo software implementation.
To represent kinase sequence preference motifs, we developed software called PhosphoLogo that uses sequence logos based on information content, as first introduced by Schneider and Stephens (26). Using the list of peptide sequences for the kinase of interest, the observed probability P(a,i) of finding a particular amino acid a in position i relative to the phosphorylated residue is calculated, where positive positions represent amino acids on the COOH-terminal side of the phosphorylated residue. The information content IC(a,i) of the amino acid a at the position i compares the observed probability P(a,i) to the background probability Pref(a) (26, 27).
The reference probabilities, Pref(a), for the amino acids can be user-specified, calculated from the input sequences, or assigned to the predefined rat or human proteome-wide probabilities. Note that if an amino acid is favored in a particular position, the IC(a,i) value will be positive; if the amino acid is disfavored, the IC(a,i) value will be negative.
PhosphoLogo displays logos representing the favored amino acids in each position and thus only takes into account the positive IC(a,i) values [see below for a treatment of “anti-logos” representing the amino acids with negative IC(a,i) values]. A column of characters at each position represents the overrepresented residues. Each residue's size is proportional to its IC(a,i) value, and the total column height is proportional to the sum of the favored residues' IC(a,i) values. The characters are ordered from top to bottom in order of decreasing IC(a,i) values.
The software also allows the user to select from a variety of representation options. The “adjust modified reference probabilities” option accounts for the fact that only certain residues are found in the phosphorylated (or otherwise modified) position, and thus the probability of observing one of these residues in the zero position is greater than its genome-wide probability. Selecting this option results in the individual background probabilities of the residues observed in position zero to be normalized by the sum of background probabilities of only the residues found in this position. Thus, for a phosphorylated residue (S, T, or Y), the adjusted reference probabilities for S, T, and Y are given by:
This renormalization scales the IC value of the phosphorylated position to correctly represent the true information content at that position. The “grouping” option allows the user to group amino acids based on default charge or size groupings, or based on user-defined groups. This option is useful for discerning motif trends if a particular kinase has a preference for a certain type of amino acid (side chain charge, polarity, or size).
PhosphoLogo calculates a χ2-statistic for each amino acid in each position and displays in the sequence logo only the statistically significant amino acids in each position. To calculate the statistic for a given amino acid in a particular position, 2 × 2 contingency tables for the observed and expected frequencies for observing the given amino acid (vs. any other amino acid) in a particular position (vs. all other positions excluding the phosphorylated position zero) are calculated based on the observed input frequencies. The χ2-statistic is defined as the sum of the square of the difference between the observed and the expected value divided by the expected value for each of the four pairs of entries in the two tables. A P-value is calculated from the χ2-statistic. The P-value significance level α is user-defined; only amino acids with a P-value less than α are displayed in the logo.
PhosphoLogo also includes two other visualization options: a relative frequency plot and a disfavored residue sequence logo, or “anti-logo” (1). The relative frequency plot displays the characters observed at each position, with character height proportional to the observed frequency. The anti-logo (with a default light gray background color) displays only the disfavored residues and uses a modified version of the information content formula, relying on the complements of the observed and reference residue probabilities:
The PhosphoLogo program was written in Java. We have made it available as a web application (URL: http://helixweb.nih.gov/PhosphoLogo/).
We present a new MS-based method to profile protein kinases based on their preferences for particular amino acids at positions surrounding the phosphorylated residues of target proteins. The method consists of incubating the kinase of interest with a heterogeneous mixture of dephosphorylated, denatured proteins from multiple rat tissues followed by large-scale phosphoproteomic profiling to identify phosphopeptides (see methods and Fig. 1). After data processing and background removal, the resulting phosphopeptide list serves as input to PhosphoLogo, software we created that uses an information-theoretic algorithm and χ2-significance filtering to visualize kinase substrate preferences as a single sequence logo (Fig. 2). Each residue's height in the logo is proportional to how overrepresented that residue is relative to its background frequency in that particular position.
PKA analysis and method evaluation.
We optimized the method using cAMP-dependent protein kinase A catalytic subunit α (referred to as “PKA” throughout) as our model kinase, which was used to carry out in vitro phosphorylation in a protein mixture from rat kidney, brain, liver, and small intestine. After subtracting background peptides (i.e., endogenous phosphopeptides that remain after phosphatase treatment), we identified 934 unique in vitro substrate peptides for PKA (Supplemental Data Set S2 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S2.xls) that we used as input to PhosphoLogo to create a sequence logo (Fig. 3A). For all logos presented, residues in the phosphorylated position are assigned reference probabilities to account for the fact that only S, T, and Y are found in this position. Labeling the phosphorylated position as position 0, the information content in positions −5 to +5 exceeds that of the more distant positions consistent with the known range of interaction between substrate peptides and the PKA binding cleft based on structural studies (17, 20, 37).
To provide a statistical filter, we incorporated χ2-testing in PhosphoLogo, which calculates a P-value for each amino acid in each position and displays the resulting P-values as a line graph (Fig. 3B). In PhosphoLogo, the user defines a significance threshold (α) by which to filter the residue P-values. We set α = 0.05/20 = 0.0025 to correct for multiple testing (Bonferroni criterion, 20 amino acids).
Figure 3A shows the resulting χ2-filtered logo for PKA (α = 0.05/20 = 0.0025 to correct for multiple testing using Bonferroni criterion for 20 amino acids), shortened to 13 positions. Note PKA's preferences for basic amino acids (R and K) at −3 and −2, glycine (G) at −1, aliphatic amino acids (L, I, V, M) at +1 and +2, and acidic amino acids (D and E) at +3. Comparing this sequence logo to a compilation of PKA's preferences in the Human Protein Reference Database (HPRD, URL: http://www.hprd.org/serine_motifs) reveals that preferences for the basic amino acids at −3 and −2, aliphatic amino acids at +1 and +2, and acidic amino acids at +3 had been identified previously, but the glycine preference at −1 had not been reported. We also compared our PKA logo to logos generated from data produced using two other approaches: compilation of known peptide substrates from literature and combinatorial peptide array assays. Comparing the logo generated from our MS data (Fig. 4A) to a logo generated using PhosphoLogo from data downloaded from PhosphoSitePlus (URL: http://www.phosphosite.org/substrateSearchViewAction.do?id=1021&type=Protein), consisting of 959 peptide substrates (Fig. 4B and Supplemental Data Set S3 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S3.xls), reveals that the two logos are quite similar. However, the −3 and −2 basic amino acid preferences in the logo generated from the PhosphoSitePlus data are exaggerated relative to all other positions, compared with the logo from our MS data. In contrast to logos from our method and the compilation method, the PKA logo generated using a combinatorial peptide array assay (Fig. 4C) from the NetPhorest Kinase Motif Atlas (23) shows only PKA's preferences for basic residues at the −3 and −2 positions. The alternative preference for K or R at position −2 is well documented in the literature (see HPRD: http://www.hprd.org/browse/serine_motifs).
The dependence of kinase specificity on residue charge or size was explored using PhosphoLogo's ability to group amino acids based on these characteristics. We applied the two grouping options to our PKA data and filtered with α = 0.05/4 = 0.0125 (to correct for multiple testing using Bonferroni criterion for 4 amino acid groups). The charge-grouped logo (Fig. 5A) further emphasizes PKA's strong preference for basic residues at the −3 and −2 positions, strong preference for aliphatic residues at +1, and weaker preferences for basic residues at −5 and −4, aliphatic residues at +2, and acidic residues at +3 to +5. The size-grouped logo (Fig. 5B) shows PKA's preference for large residues at −3 and −2, a small residue at −1, and mid-sized residues in positions downstream from the phosphorylated residue. The biological significance of these observed preferences in terms of PKA's crystal structure is addressed in the discussion.
In addition to the information content logo, another output of PhosphoLogo is the frequency plot, which shows the observed relative frequencies of each amino acid (or group) in each position. The frequency plot for PKA (Fig. 5C) shows the same overrepresented residues as the information content logo, but also emphasizes the fact that while particular amino acids are favored in certain positions, the presence of a given residue in any given position is not an absolute requirement for a particular peptide to be phosphorylated (see discussion).
PhosphoLogo can also generate a disfavored residue logo, or “anti-logo,” which is calculated using the complement of the residues' observed and reference probabilities (see methods). Several studies have concluded that the lack of particular amino acids in fixed positions may be equally important in determining protein kinase substrate specificity as is the presence of other amino acids in those positions (1, 5, 31, 39). Thus, the disfavored-residue information may aid in eliminating particular kinase families when trying to identify the protein kinase responsible for a specific phosphorylation event. The filtered anti-logo for PKA (Fig. 5D) shows that acidic and certain aliphatic residues are disfavored in the −3 to −1 positions, that basic residues are disfavored in the +1 to +5 positions, and confirms that threonines and tyrosines are disfavored acceptors of PKA phosphorylation.
CK2 and p38 analyses.
We tested if the method was more broadly applicable by profiling kinases from two other major protein kinase classes: acidophilic and proline-directed. Therefore, we profiled casein kinase II (CK2, acidophilic), consisting of a constitutively active α2β2 heterotetramer, and p38α (proline-directed). Like PKA, their substrate preferences have already been reported in the literature based on other methodologies. Using the MS-based method, we identified 335 unique phosphorylation sites targeted by CK2 (Supplemental Data Set S4 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S4.xls). The phosphopeptide sequences were analyzed using PhosphoLogo, with the resulting χ2-filtered sequence logo shown in Fig. 6A (α = 0.0025). The CK2 logo shows the expected preference for acidic amino acids (D and E) at the +1 and +3 positions, and also shows a weak G/R preference at −3 and an R/Q preference at −6. The logo generated in PhosphoLogo using PhosphoSitePlus's 538 peptide substrates (Fig. 6B) also shows the strong acidic preferences at the +1 and +3 positions and weaker preferences at other positions. In contrast, the Kinase Motif Atlas CK2 logo generated using data from a combinatorial array assay (23) (Fig. 6C) shows the acidic preferences at surrounding positions, but no other preferences.
Using the MS-based method, we identified 587 unique phosphorylation sites targeted by p38 (Supplemental Data Set S5 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S5.xls). The phosphopeptide sequences were analyzed in PhosphoLogo, resulting in the filtered logo in Fig. 6D (α = 0.0025). The logo shows the expected proline-directed motif, with a strong proline (P) preference at +1 and a weaker proline preference at −2. There is also a preference for alanine (A) although weaker than for proline at +1, which may have ramifications for mutational analysis of kinase targets. The logo shows that p38 preferentially phosphorylates threonine (T) residues. According to HPRD, there is prior evidence for the proline preference at +1 and the preferential phosphorylation of threonine, but no prior evidence for the alanine preference at +1. The logo from PhosphoSitePlus's 164 known peptide substrates (Fig. 6E) shows a strong proline preference at the +1 position and a weak aliphatic preference at −1. However, the logo derived from PhosphoSitePlus data does not capture p38's preference for alanine at +1 or proline at −2, along with p38's preferential phosphorylation of threonine residues. We also compared our p38 logo to the logo from the Kinase Motif Atlas (23) (Fig. 6F). This logo shows p38's proline preferences at −2 and +1 and a very weak aliphatic preference at −1. In addition, specificity of MAP kinases is in part determined by the presence of docking motifs that occur at variable distances from the phosphorylation site (6). A search for these targeting motifs [either F-X-F-P (6) or (R/K)-X-X-X-X-L-X-L (36)] in the list of proteins found to be phosphorylated by p38 showed their presence in 122 out of 587 target proteins (Supplemental Data Set S5 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S5.xls).
In additional experiments, we tested the effect of heat denaturation of proteins on the logos obtained for PKA, CK2, and p38 (see methods). In the absence of heat denaturation, the logos obtained were nearly identical, despite a lower yield of phosphorylated peptides (data not shown). The logos were also essentially unaffected by changes in the type of residue reference probability used. While the logos presented in results were generated from reference probabilities calculated from the input peptide sequences, nearly identical logos were generated using rat proteome or equiprobable reference probabilities (data not shown), pointing to the robustness of the method, particularly when a large number of input peptide sequences is used.
Analyses of additional kinases.
Following the initial PKA, CK2, and p38 experiments, we profiled 13 additional protein kinases representative of various classes (Fig. 7 and Supplemental Data Set S6 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S6.xlsx). These kinases were selected based on their expression in the renal distal nephron and the collecting duct and their potential involvement in the vasopressin-regulated signaling network resolved by Hoffert et al. (9). The kinases consisted of members of the AGC family (AKT1, SGK, and PKCδ), CAMK family (CaMK2δ, DAPK1, MAPKAPK2, PKD3, and PIM1), STE family (OSR1 and STK39 [also known as SPAK]), CMGC family (GSK3β), and Wnk family (Wnk1 and Wnk4).
The relationship among these kinases has been demonstrated by the dendrogram in Fig. 7, which maps the kinases according to sequence similarities in their kinase domains. This clustering effectively delineates the represented kinase classes as evident by the proximity of kinases within the same class on the dendrogram. The substrate preferences commonly associated with the various kinase classes (e.g., basophilic region upstream of phosphosite in AGC class and proline in +1 position relative to phosphosite in CMGC class) appear in many of the corresponding logos generated with PhosphoLogo. Interestingly, GSK3β, a kinase generally viewed as requiring a phosphorylated residue four amino acids COOH-terminal to the target serine (S-X-X-X-pS where pS is a phosphoserine) downstream for site specific phosphorylation (8), displays a proline-directed motif similar to p38, the other kinase represented in the CMGC class (see discussion). Although the logos are most often looked to for kinase substrate information, the anti-logos may also provide additional criteria for kinase substrate specificity, such as the disfavored basophilic residues downstream of the phosphorylation site in many kinases in AGC and CAMK classes including AKT1, SGK, CaMK2δ, DAPK1, MAPKAPK2, PIM1, and as previously described in PKA. Reference data extracted from PhosphoSitePlus indicating in vivo/in vitro evidence for many of the kinase-substrate pairs found in this study are also reported in Supplemental Data Set S6 (http://helixweb.nih.gov/PhosphoLogo/Data_File_S6.xlsx).
Here, we introduce a new methodology using protein mass spectrometry-based phosphoproteomic techniques and specially designed software to identify substrate positional amino acid preferences for individual protein kinases. Another group, Huang et al. (12), described a similar mass spectrometry technique but sought to identify candidate substrates on a smaller scale (61 PKA substrate peptides and 12 PKG substrate peptides) rather than to systematically determine kinase preference motifs. We profiled 16 well-studied protein kinases and compared the resulting motifs to motifs determined by other methods to confirm the validity of the mass spectrometry-based method. We created custom software, PhosphoLogo, to visualize the motifs derived from the MS-identified phosphopeptides. We have made PhosphoLogo available as a web application (URL: http://helixweb.nih.gov/PhosphoLogo/).
From this study, we conclude that the described methodology reproduces the general characteristics of the established substrate preference motif for PKA, as well as the other kinases, and complements other profiling approaches previously used in obtaining target sequence preferences (see below). The newly introduced online software tool, PhosphoLogo, allowed us to calculate both position-dependent amino acid preferences and “anti-preferences” for multiple protein kinases with potential roles in regulation of epithelial transport. The introduction of anti-preference profiling revealing disfavored amino acids in particular positions is anticipated to expand our ability to predict which protein kinase is responsible for a given phosphorylation event beyond the present level. However, a crucial caveat remains—because of the potential overlap in target preferences among members of the 518-member protein kinase family, the preferences and anti-preferences derived from this mass spectrometry-based approach are unlikely to allow for unequivocal protein kinase assignment from a single target sequence alone. Nevertheless, protein kinase substrate preference profiling using this approach is likely to be useful both in systems biology studies of signaling networks and to gain insight on protein kinase-substrate structural interactions. The remainder of the discussion explores these general conclusions in greater detail.
A complement to existing profiling approaches.
The approach introduced in this paper adds to previously existing techniques for profiling protein kinase substrate specificities, namely, combinatorial peptide array assays (24, 35) and curation of kinase peptide substrates as determined by individual reductionist studies (2). Our new approach and the two previous approaches could be viewed as complementary; each has its own advantages and limitations. Based on our analysis of well-studied protein kinases (PKA, CK2, and p38), it appears that the three approaches identify similar motifs for a given kinase. However, our method has several key advantages: 1) scalable sensitivity which allows potentially greater resolution versus combinatorial peptide array assays by using diverse tissue samples as starting material, 2) the lack of dependence on radiolabeling as used in combinatorial peptide array assays, 3) the lack of expectation bias seen with the curation method, 4) the ability to carry out statistical filtering providing a quantitative means of recognizing true signal amid background noise, 5) the use of more representative substrates (a heterogeneous mixture of proteins instead of synthetic peptides) that could result in a more physiologically relevant motif, and 6) the unbiased identification of novel candidate in vivo substrates for follow-up studies. On the other hand, our method has potential limitations relative to the other methods, namely, the associated mass spectrometry costs and required expertise. Also, our method could potentially incorporate biases owing to selectivity of the chromatography used to isolate phosphopeptides (IMAC vs. TiO2) or owing to the generation of tryptic peptides too short or too long to detect. These biases could potentially be reduced by varying the chromatography or the protease used. In general, any method used for kinase substrate motif profiling will be limited by the lack of availability of many kinases in active form. Furthermore, kinase incubation assays may need to be optimized for less efficient or more selective kinases to achieve sufficient signal-to-noise.
PhosphoLogo features providing additional preference information.
The PhosphoLogo software uses an information-theoretic algorithm to determine the favored amino acids in each position around the phosphorylated residue and plots them in sequence logo format. This program is in some ways similar to enoLOGOS (32), which also uses an information-theoretic algorithm. However, to deal specifically with kinase target profiling data in this study, we have implemented the following features not present in enoLOGOS: 1) versatile grouping of amino acids according to charge or side-chain size to probe the basis of site dependency; 2) independent assignment of amino acid background frequency at position 0, the phosphorylation site, accounting for the fact that phosphoproteomic site assignment programs typically allow assignment of phosphorylation only to a serine, threonine or tyrosine; 3) statistical filtering of data using χ2-testing; and 4) generation of information-based disfavored-residue “anti-logos” to visualize underrepresented amino acids in each position.
Assignment of kinase to a phosphorylation event.
A major objective of this study was to use our mass spectrometry-based approach to generate substrate motif logos that can be used for prediction of the protein kinase responsible for phosphorylation at a given site in a particular protein. Such information has been a key component of research aimed at identifying signaling networks at a systems biology level (18, 19, 23, 34). However, as already pointed out in prior studies (18), in most cases additional types of information are needed to unequivocally assign a protein kinase to a phosphorylation event, viz. factors that predict whether or not a kinase and its putative substrate are colocalized and can physically interact in the same subcompartment of the same cells. As an extreme example, if a given protein kinase is not expressed in the cell of interest, it can be eliminated as a candidate to phosphorylate a protein in that cell, even if the site is a perfect match with the derived target preference. [Rat inner medullary collecting ducts appear to express only 200 of the 518 protein kinases coded by the genome (30).] Beyond this, a protein kinase present only in a given subcellular compartment is not a good candidate to phosphorylate a protein limited to another compartment. Hence, to assign kinases to a particular phosphorylation network like that described for vasopressin-signaling by Hoffert et al. (9), it will be necessary to develop cell-specific databases of protein-protein associations derived from immunocytochemical localization studies and/or large-scale MS-based binding interaction studies. The computational resource called NetworKIN (19) offers one way to integrate localization information with kinase substrate specificities, although not in a cell type-specific manner.
Part of the problem is that a given phosphorylation site can often be phosphorylated by multiple kinases. A familiar example is the transcription factor CREB (28), which can be phosphorylated at the regulatory site Ser133 by at least three different basophilic kinases: protein kinase A, p90 ribosomal S6 kinase, and Ca2+/calmodulin-dependent protein kinases. Another example derives from the data presented in this paper, viz. Ser256 of the vasopressin-regulated water channel aquaporin-2, known to be phosphorylated by protein kinase A, but also seen in this study to be capable of being phosphorylated by two other basophilic kinases, viz. Akt1 and protein kinase Cδ (Supplemental Data Set 6 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S6.xlsx).
An additional factor limiting our ability to assign a protein kinase to a particular phosphorylation site is evident from our mass spectrometry data, viz. a lack of a high level of fidelity for all of the kinases tested. Although the sequences surrounding most of the phosphorylation sites resulting from incubation with a given kinase generally overlap the sequence preference logo reported, few of them correspond exactly to the preference logo at every amino acid position (Supplementary Dataset S6 at http://helixweb.nih.gov/PhosphoLogo/Data_File_S6.xlsx). An important example is PKA. The oft-quoted target sequence preference for PKA is R-R-X-S. In this study we confirm previous observations (summarized at http://www.hprd.org/serine_motifs) that there is an alternative preference for a different basic (positively charged) amino acid, viz. lysine (K), at position −2. An example that is relevant to the problem of transport regulation in the kidney is the demonstration that the bumetanide-sensitive Na-K-Cl cotransporter of the thick ascending limb (Slc12a1) is phosphorylated by PKA at Ser874, a site with a lysine at position −2 (7).
Again, these findings are consistent with conclusions made by Linding et al. (18) indicating that motif specificity accounts for only a fraction of the information that determines which kinase phosphorylates a given substrate. The addition of information about ‘anti-preferences', amino acids disfavored in particular positions, may improve the predictions that can be made, but it remains unlikely that specific protein kinases can be assigned to all phosphorylation sites in a given cell type without added localization information.
Nevertheless, when trying to determine the kinase responsible for a given phosphorylation event, the number of possibilities can be narrowed down to particular protein kinase subfamilies based on the general characteristics of the target sequence logo. Most of the serine-threonine kinases fall in three main groups (Fig. 7): basophilic kinases that target sequences with basic amino acids (arginines and lysines) upstream from the phosphorylation site, acidophilic with acidic amino acids (glutamic acid or aspartic acid) downstream from the phosphorylation site, or proline-directed with prolines in position +1 and/or −2 from the phosphorylated amino acid (7, 16, 25). Here, basophilic kinases correspond to members of the AGC family or the calmodulin-dependent (CAMK) family of kinases, proline-directed kinases are usually members of the MAP kinase or cyclin-dependent kinase families (CMGC group), and acidophilic kinases are generally members of the casein kinase family. Beyond this, observing a proline in the +1 position tends to exclude AGC and CAMK kinases (39), an observation supported by our analysis of disfavored amino acids in this position for PKA, a member of the AGC kinase family (Fig. 5D).
One interesting finding with regard to the general classification of kinases is that glycogen synthase kinase 3β (GSK3β) was classified with the proline-directed group of kinases, both in terms of the dendrogram relating its sequence to other kinases and in terms of the derived target sequence preference logo (Fig. 7). This proline-directed nature contrasts with the target sequence preference exemplified by GSK3β-dependent phosphorylation of β-catenin or glycogen synthase in which GSK3β phosphorylates serines (or threonines) four amino acids upstream from another phosphoserine (S/T-X-X-X-pS/pT where pS/pT is a previously phosphorylated serine or threonine) resulting in a cascade of phosphorylation events (4). A list of the substrates of GSK3β on a curated database (PhosphositePlus, http://www.phosphosite.org/substrateSearchViewAction.do?id=988&type=Protein) revealed a large number of targets with proline in position +1 consistent with our results. It seems possible that the substrate specificity can switch from S/T-X-X-X-pS/pT to S/T-P depending on phosphorylation of GSK3β at one of the several known phosphorylation sites in the molecule (33). A logical possibility would be Tyr216, the tyrosine in the equivalent position to that required for activity of mitogen-activated protein (MAP) kinases. In MAP kinases the equivalent site is the target for regulation by MAP kinase kinases (MEK homologs) (13).
Two additional protein kinases whose target sequence preference logos have proline in position +1 are the WNK proteins, Wnk1 and Wnk4. In contrast to GSK3β, these two kinases do not show a high degree of sequence similarity to the classical proline-directed kinases (MAP kinases and cyclin-dependent kinases).
Chemical and structural properties of protein kinases.
Aside from its possible role in systems biology studies of signaling pathways, protein kinase substrate preference profiling is useful for investigation of the properties of the protein kinases themselves. The presence of amino acid positional preferences is indicative of steric and charge constraints at the binding cleft of the protein kinase molecule. A good example derives from an examination of the target sequence preference found for protein kinase A (Fig. 4) vis-a-vis the published three-dimensional structure of murine protein kinase A (38) [Protein Data Bank (PDB): 1ATP] (illustrated in Fig. 8). Here, we have mapped the substrate preferences determined in this study to the structure of the PKA catalytic subunit with the docked inhibitor peptide from the regulatory subunit of PKA. PKA has polar and charged surface groups (oxygens colored red) that form favorable electrostatic interactions with the peptide's arginines at −6, −3, and −2 (nitrogens blue). The large aliphatic residue in position +1 fits into a hydrophobic pocket on the kinase. The preferred acidic amino acid in position +3 (oxygens red) interacts favorably with lysine (blue) on the kinase [see Johnson et al. (17) Table 4, which lists these interacting PKA surface residues]. Profiling PKA using our method demonstrated a preference for glycine at −1 position, which to our knowledge is a novel observation. A glycine preference was also seen for two other basophilic kinases, viz. SGK and PKCδ (Fig. 7). The enhanced peptide chain flexibility bestowed by glycine likely helps to ensure optimal positioning of the phospho-acceptor serine at the zero position for phosphate transfer as well as positioning of the nearby up- and downstream peptide residues. Masterson et al. (22) concluded that formation of the kinase-peptide-ATP complex is entropically driven (determined using a peptide with alanine at P-1); presumably a glycine would also promote a “dynamically loose” complex and the mechanism of “conformational selection rather than an induced-fit mechanism.” In fact, many independent structural studies have demonstrated the importance of substrate flexibility for phosphorylation to occur (31). Evidence that further supports our observation is that a glycine is also present at −1 position relative to the pseudophospho-acceptor site in the pseudophosphorylation motif of PKA type I-α and -β regulatory subunits (RRGAI and RRGGV, respectively, pseudophospho-acceptor positions underlined. Note the overall similarity of these sequences to the derived preference motif in this study). The pseudophosphorylation motif is known to naturally occupy the catalytic site of PKA catalytic subunits during their inactive state. [Similar pseudosubstrate or autophosphorylated inhibitor peptides are present in PKCα (RKGAL) (11) and in CaMK2α or δ (HRQETVD) (3), respectively]. Overall, our sequence preference motif for PKA is not only consistent with other kinase profiling methods, but is consistent with constraints derived from the known PKA structure.
To recapitulate, we have developed a mass spectrometry-based approach to determine the positional amino acid target substrate preferences for particular protein kinases. To support the method, we have developed PhosphoLogo, a computer program that summarizes site-specific preferences in terms of sequence logos and anti-logos. The results presented are consistent with known preferences and show even greater positional information than what was previously known for the 16 protein kinases profiled. Because the conditions for the kinase incubation must be optimized for each protein kinase of interest, it may be impractical to carry out kinome-wide profiling in a single broad study with a standard set of conditions. However, applied progressively across the kinome, the method promises to aid substantially in elucidation of signaling networks and investigation of protein kinase-peptide interactions.
No conflicts of interest, financial or otherwise, are declared by the author(s).
J.D., R.G., J.D.H., M.A.K., and T.P. conception and design of the research; J.D. and R.G. performed the experiments; J.D., R.G., D.B., F.S., P.J.S., M.A.K., and T.P. analyzed the data; J.D., R.G., D.B., F.S., J.D.H., P.J.S., M.A.K., and T.P. interpreted the results of the experiments; J.D., R.G., D.B., F.S., P.J.S., and T.P. prepared the figures; J.D., M.A.K., and T.P. drafted the manuscript; J.D., D.B., J.D.H., P.J.S., M.A.K., and T.P. edited and revised the manuscript; J.D., R.G., D.B., F.S., J.D.H., P.J.S., M.A.K., and T.P. approved the final version of the manuscript.
A portion of the data in this paper was presented as part of the Davson Lecture of the American Physiological Society at the Experimental Biology 2012 Meeting in San Diego, California. We thank Guanghui Wang of the National Heart, Lung, and Blood Institute Proteomics Core Facility (Marjan Gucek, Director) for assistance with the mass spectrometry. Steven Shaw provided valuable early advice regarding protein kinase specificities. Thomas Schneider provided advice on construction of the sequence logo algorithm. Chou-Long Huang provided reagents and advice on Wnk1 and Wnk4 kinases. The study was carried out in the intramural program of the National Heart, Lung and Blood Institute (Project ZO1-HL001285, M. A. Knepper) and the Center for Information Technology (Project CT000265-16, P. J. Steinbach).
↵1 This article is the topic of an Editorial Focus by Ewout J. Hoorn and Marcel E. Meima (10a).