Signal peptides (SPs) are short peptide chains located at the N-terminus of proteins, primarily responsible for guiding newly synthesized proteins to their correct cellular compartments or secreting them outside the cell. In Bacillus subtilis, the secretion efficiency of signal peptides is crucial for the production of recombinant proteins. Despite several studies exploring factors that influence signal peptide secretion efficiency, there has been a lack of efficient and accurate tools for predicting the secretion efficiency of specific signal peptides. SecEff-Pred is an automated machine learning-based web server designed to predict the secretion efficiency of signal peptides in Bacillus subtilis. The SecEff-Pred model demonstrates excellent performance on the α-amylase dataset, achieving an accuracy of 84.83%, with AUROC and recall values of 91.11% and 86.33%, respectively. Additionally, SecEff-Pred shows reasonable classification performance in predicting the secretion efficiency of alkaline xylanase and esterase.
The model was trained using AutoGluon. AutoGluon is an open-source automated machine learning (AutoML) framework developed and maintained by Amazon Web Services. It provides a comprehensive suite of fully automated machine learning tools that can automatically handle tasks such as feature engineering, model selection, and hyperparameter tuning. For the input amino acid sequences, AutoGluon initially appends a CLS token at the beginning of each sequence. These sequences are then passed through an embedding layer, which converts each amino acid into its corresponding embedding vector, with positional encoding added to preserve the positional information within the sequence. Subsequently, these embedding vectors are fed into a Transformer encoder module, where multi-head self-attention mechanisms and feed-forward neural networks extract feature representations for each amino acid. After processing, the final hidden state of the CLS token is extracted as a comprehensive representation of the entire sequence via an embedding layer. Finally, this hidden state of the CLS token is fed into a Multi-Layer Perceptron (MLP) classifier.
Article is ongoing.