ProAll-D: protein allergen detection using long short term memory - a deep learning approach

Background An allergic reaction is the immune system's overreacting to a previously encountered, typically benign molecule, frequently a protein. Allergy reactions can result in rashes, itching, mucous membrane swelling, asthma, coughing, and other bizarre symptoms. To anticipate allergies, a wide range of principles and methods have been applied in bioinformatics. The sequence similarity approach's positive predictive value is very low and ineffective for methods based on FAO/WHO criteria, making it difficult to predict possible allergens. Method This work advocated the use of a deep learning model LSTM (Long Short-Term Memory) to overcome the limitations of traditional approaches and machine learning lower performance models in predicting the allergenicity of dietary proteins. A total of 2,427 allergens and 2,427 non-allergens, from a variety of sources, including the Central Science Laboratory and the NCBI are used. The data was divided 80:20 for training and testing purposes. These techniques have all been implemented in Python. To describe the protein sequences of allergens and non-allergens, five E-descriptors were used. E1 (hydrophilic character of peptides), E2 (length), E3(propensity to form helices), E4(abundance and dispersion), and E5 (propensity of beta strands) are used to make the variable-length protein sequence to uniform length using ACC transformation. A total of eight machine learning techniques have been taken into consideration. Results The Gaussian Naive Bayes as accuracy of 64.14 %, Radius Neighbour's Classifier with 49.2 %, Bagging Classifier was 85.8 %, ADA Boost was 76.9 %, Linear Discriminant Analysis has 76.13 %, Quadratic Discriminant Analysis was 84.2 %, Extra Tree Classifier was 90%, and LSTM is 91.5 %. Conclusion As the LSTM, has an AUC value of 91.5 % is regarded best in predicting allergens. A web server called ProAll-D has been created that successfully identifies novel allergens using the LSTM approach. Users can use the link https://doi.org/10.17632/tjmt97xpjf.1 to access the ProAll-D server and data.


Fig 1. Raw data in fasta file format
The above figure represents the protein sequence in fasta file format. The symbol ">" indicates the data is in fasta format and it also specifies the beginning of the new sequences. The numerical present after the '>' is the protein accession number and scientific name of the protein.
Since each sequence is present in multiple lines so it has to be converted to a single line. For this conversion, the Linux command has been used.

Methods
The E-descriptor values derived by Venkatarajan et al [23] has been considered to describe the features of proteins.

ACC transformation
Auto Cross Covariance includes both Auto Covariance and Cross Covariance. Here the 5 E Descriptors has been considered and ACC transformation for converting amino acid sequence to a sequence of numbers so that we can apply classification algorithms to them Let's take an example, we have a sequence as follows: ARN length = 3s # The respective E Descriptors for each amino acid are as follows: The ACC transformation on the E Descriptor sequence is done by using the respective formulas for autocovariance and cross-covariance. The autocovariance is calculated between the same E Descriptor, for eg auto covariance between E1 and E1 is represented as AC11, then we also incorporate the lag value, which in our case ranges from 1 to the length of minimum sequence i.e. from 1 to 5. so now finally: The autocovariance between E1 and E1 and lag=1 will be represented as AC111 The autocovariance between E1 and E1 and lag=2 will be represented as AC112 The autocovariance between E2 and E2 and lag=1 will be represented as AC221, etc between different E Descriptor values, this means that the cross-covariance values will be represented as AC121, AC131, AC145, AC431, etc.
The autocovariance and cross-covariance values for each amino acid sequence are combined, then the sequence is said to be ACC transformed these ACC values are the attributes on the basis of which classification algorithms can be applied.

Fig 6. Epoch of LSTM Algorithm
When compared to the above classification methods, LSTM performed well. The model has been developed with the creation of a network with two different activation functions namely -SoftMax and relu. Rmsprop has been used as an optimizer for the model. The network has been trained with 100 epochs each with a batch size of 40.

WEB SERVER (ProAll-D)
A web server namely, ProAll-D has been developed to predict the potential allergens using the LSTM algorithm. It is developed using the Python Django framework which is fast and user-friendly. The detailed functioning of the webserver has been described in this section as mentioned in methods.

Fig 7. Command for executing the webserver
Firstly the path of the web app folder should be copied and pasted in cmd. Then within the allergen_GUI folder, there is a python file namely "manage.py" to execute this the second command has to be followed.
Once the above commands are executed, we get the link of the local host ("http://127.0.0.1:8000/) Copy-paste the link in the browser

Fig 8. Interface of ProAll-D
There are three different sections namely Home, Datasets, and Method Description. In the home section, the user enters the protein sequence in a one-letter code, where the models predict whether the entered sequence is allergenic or non-allergenic. The data-set part consists of the data considered in this research in the fasta file format. The Method description provides the user with a brief description of the methodologies that we have considered.

Fig 9. Dataset Section
The Dataset section consists of two links that navigate to allergen and non-allergen data, which is considered in the current research. Method Description provides a brief description of the entire process.

Fig 10. Working of ProAll-D
The user has to enter the protein sequence in a one-letter code in the home section. The model predicts whether the entered sequence is allergenic or non-allergenic. Suppose if the user enters a character apart from 20 naturally occurring amino acids, then we get a message as an undefined character for the entered sequence.
Here the entered character is "Z" which is not a part of the aminoacid character, then we get the result as: