BioInfoBank Institute The RPSP program detects signal peptides in proteins. The method is based on neural networks trained on short sequence fragments of proteins extracted from the Swiss-Prot database. The RPSP is able to deal separately with prokaryotic and eukaryotic sequences.
FP7 Partner
Add BioInfo.PL bioinformatics lab to Your FP7 application

Rapid Prediction of Signal Peptides - RPSP

Paste your amino acid sequence here: (Help)

or upload a FASTA format file:

This sequence is of procaryotic eucaryotic undetermined origin.

E-mail results to this address (optional): E-mail subject:

Examples

Download RPSP

Training Datasets

Click here to download the training datasets.

The information on protein sequences used to create datasets was acquired from the Swiss-Prot database. Both, the positive set containing sequences with experimentally verified signal peptides, and the negative set containing sequences of cytoplasmic and nuclear proteins were split into two subsets: eukaryotic and prokaryotic sequences. All unusual, controversial, extremely short or long sequences were excluded from training datasets. Finally, redundancy in datasets was reduced to 60% sequence identity level using the CD-HIT clustering tool. Because, the negative sets were much bigger then the positive sets, the negative datasets were reduced approximately to the sizes of positive datasets, to avoid bias during training and testing of the neural networks. All datasets were split in a 5:1 proportion into training and testing sets.

Supplementary Materials

About

The RPSP program facilitates rapid identification of signal peptides in proteins. The predictor is based on neural networks trained on short sequence fragments of proteins extracted from the Swiss-Prot database. The method is able to deal separately with prokaryotic sequences and eukaryotic sequences. The accuracy of the method is comparable with other prediction tools. Because of its high speed and portability, the method can be applied easily on genome-wide data sets.

  • The RPSP program uses two neural networks with feed-forward, multi-layer architecture and back-propagation learning algorithm. The first network determines if given residue belongs to the signal peptide or not. As the inputs for the neural network we used a symmetric sliding window with 27 amino-acids for eukaryotes and 19 amino-acids for prokaryotes. The output layer is a single neuron providing the S-score of prediction. The second neural network recognizes the cleavage sites (first residue in the mature protein, i.e. position: +1). The input for the neural network is an asymmet-ric sliding window with 20+4 residues for eukaryotes and 21+3 amino-acids for prokaryotes. The single neuron in the output layer provides the C-score of prediction. The discrimination (signal pep-tide / non-signal peptide) and cleavage side prediction is based on the S-score and the C-score.
  • The performance analysis was conducted on an independent test set that was not used during the learning procedure. The results are comparable with the other tools such as SignalP 3.0 (Bendtsen, et al., 2004). However, an additional advantage of the RPSP is its high efficiency of prediction signal peptides in protein sequences without specifying theirs source. Using the neural networks de-signed for both eukaryotic and prokaryotic protein on random se-quence dataset we obtained much better results than using the neu-ral networks trained only on eukaryotes or prokaryotes.
  • The method is very fast. It enables the analysis of our full benchmark set of 959 proteins within about 2 seconds on a Linux machine with 2 GHz CPU and 512MB RAM.
  • Availability: The RPSP web service is integrated into GRDB2 gene relational database ver. 2.0 and its local version can be downloaded from web page.

This work was supported by EC within BioSapiens (LHSG-CT-2003-503265), GeneFun (LSHG-CT-2004-503567) 6FP projects and the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and 2P05A00130).