Automated drug design for druggable target identification using integrated stacked autoencoder and hierarchically self-adaptive optimization - Scientific Reports

The present study introduces Hierarchically Self-Adaptive PSO (HSAPSO), a novel variant that dynamically adjusts inertia weight and learning coefficients during optimization. By embedding hierarchical learning mechanisms, HSAPSO enables faster convergence and enhanced generalization across data folds. While previous methods offer substantial gains through static or semi-adaptive optimization, HSAPSO provides a fully adaptive strategy that ensures superior convergence with fewer iterations and greater robustness. Its offline optimization structure mitigates the computational overhead typically associated with metaheuristics, focusing on improving generalization and stability in test-time deployment. Collectively, the proposed optSAE + HSAPSO framework represents a methodological advancement over existing strategies by combining deep feature extraction with adaptive metaheuristic optimization, resulting in high accuracy, efficient inference, and resilient model behavior in real-world drug discovery scenarios.

Figure 1 presents the essential stages of the proposed framework for drug design, combining machine learning methodologies with sophisticated optimization strategies. The proposed framework combines advanced machine learning and optimization techniques to address drug classification challenges. It begins with protein data preprocessing for quality assurance, followed by dual-feature extraction, leveraging contextual embeddings (e.g., ProtBERT) and evolutionary features.

A SAE handles classification, optimized through a Hierarchically Self-Adaptive PSO (HSAPSO) algorithm, which adapts dynamically for robust parameter tuning.

To classify potential druggable proteins, protein data was collected from the DrugBank and Swiss-Prot databases, which include comprehensive collections of protein sequences (see Table 1). The dataset under study consists of 2543 protein sequences, comprising 1224 druggable target proteins from DrugBank and 1319 non-target proteins from Swiss-Prot. These databases focus on proteins that act as drug targets in the human body, with classification based on various features, including structure, function, and biological interactions. The raw data consists of amino acid sequences stored as textual strings, varying in length and complexity, without any predefined numerical dimensions. These sequences were then transformed into numerical features suitable for machine learning models through feature extraction methods. Key features from the protein sequences were derived using approaches such as dipeptide composition, which encodes each sequence into a numerical vector representation for subsequent analysis.

To construct a robust and unbiased dataset for predicting potential druggable proteins, we carefully selected 1319 non-target proteins from Swiss-Prot. Given that Swiss-Prot contains over 20,000 human proteins, it was essential to filter this large dataset to retain only high-quality and biologically relevant proteins. One major challenge was redundancy in protein sequences, where many entries shared high sequence similarity, potentially introducing bias into machine learning models. To address this, we applied Cluster Database at High Identity with Tolerance (CD-HIT), a widely used clustering algorithm, to eliminate redundant sequences and retain unique representative proteins. In this study, CD-HIT was used with a fixed identity threshold of 90%, ensuring removal of highly similar sequences while preserving diversity. This preprocessing step helped mitigate data leakage and improved generalization.

This step significantly improves computational efficiency and ensures meaningful learning from distinct protein sequences. Additionally, many proteins in Swiss-Prot lack complete annotations or validated functional data, making them unsuitable for inclusion in predictive modeling. To enhance the reliability of the dataset, we excluded poorly characterized proteins and retained those with sufficient biological information. Another key factor in the selection process was dataset balancing, as imbalanced datasets can lead to biased classification models. To ensure fair and unbiased learning, we carefully matched the number of druggable proteins (1224) and non-target proteins (1319) to create a balanced dataset. This selection approach follows best practices in bioinformatics and computational drug discovery to construct a high-quality, non-redundant, and biologically informative dataset, facilitating accurate and generalizable predictions.

Moreover, the classification of druggable versus non-druggable targets was determined using established biochemical and pharmacological criteria, supported by curated datasets and experimental validation. Druggable targets are defined as proteins or biomolecules that exhibit high binding affinity with small molecules and have been experimentally validated in databases like DrugBank and ChEMBL. These targets typically demonstrate K, K, or IC50 values in the nanomolar to micromolar range, indicating their potential for therapeutic intervention. Conversely, non-druggable targets are proteins that lack significant binding interactions with known bioactive compounds or have shown poor ligandability in high-throughput screening assays. The dataset was constructed by integrating information from experimentally validated drug-target interactions (DTIs), ensuring that each classification is based on real-world pharmacological data. Additionally, computational filters, such as molecular docking scores and structural ligandability assessments, were used to improve the reliability of this classification.

Protein data preprocessing is a critical step to ensure the quality and relevance of input data before entering the feature extraction phase. The process begins with data collection from reliable protein databases such as UniProt, PDB, or other curated repositories. These sources provide detailed structural and functional information about proteins, including sequences, annotations, and three-dimensional configurations. The raw data often contains inconsistencies, missing values, or redundant information, which must be addressed through cleaning and normalization. Cleaning involves removing duplicate entries, filling missing information using imputation techniques, and correcting errors in sequences. Normalization ensures that data from diverse sources are standardized, enabling compatibility across analytical tools. For example, sequence lengths are truncated or padded to uniform sizes to maintain consistency during processing.

Once the raw data is cleaned and standardized, additional preprocessing techniques such as encoding and transformation are applied to make the data machine-readable. For sequence-based analyses, amino acid sequences are often transformed into numerical representations using encoding schemes like one-hot encoding, position-specific scoring matrices (PSSMs), or embeddings generated by models like ProtBERT. Structural data may undergo geometric transformations or be converted into graph-based representations, where nodes represent atoms or residues, and edges indicate bonds or interactions. These representations capture both spatial and sequential information. Moreover, noise reduction techniques such as low-pass filtering or Principal Component Analysis (PCA) are employed to eliminate irrelevant variations, preserving critical features while reducing dimensionality. By ensuring the data is accurately prepared and adequately transformed, the preprocessing phase sets the stage for effective and meaningful feature extraction in the subsequent steps.

The post-processing steps and fundamental components of the workflow, including the integration of key optimization strategies, are further detailed in Fig. 2, which provides a comprehensive representation of the core algorithms and methodologies employed in the optSAE + HSAPSO framework.

The proposed feature extraction method combines contextual embeddings from pre-trained models with evolutionary features derived from multiple sequence alignments (MSAs). This dual approach captures both global context and local evolutionary constraints, providing a comprehensive feature set for downstream machine learning tasks. Pre-trained models, such as ProtBERT and ESM, transform protein sequences into high-dimensional embeddings. These embeddings encode the sequential and contextual relationships between amino acids. Given a protein sequence , where A represents the i-th amino acid:

Here, is the embedding matrix, d is the embedding dimension, and denotes the pre-trained model. Each row captures the contextual representation of amino acid A. The embeddings are then averaged across the sequence to produce a fixed-size vector:

This sequence-wide representation captures global properties of the protein. Evolutionary features highlight conserved regions critical to protein function. MSAs align the target protein sequence with homologous sequences to calculate conservation metrics. Given an MSA M, where M represents the j-th residue of the i-th aligned sequence, the conservation score for position j is calculated as:

where is the frequency of amino acid k at position j, and mmm is the total number of amino acid types. This entropy-based score measures sequence variability, with lower values indicating higher conservation. The conservation scores are normalized and concatenated into a feature vector:

To leverage both contextual embeddings and evolutionary insights, the feature vectors are integrated through concatenation:

Here, denotes vector concatenation. The combined feature vector captures both global context and local evolutionary constraints. Although attention mechanisms are commonly used to weight heterogeneous features, in this work we adopted a simple concatenation strategy to merge the learned representations of protein and drug features. This design choice ensured computational efficiency and interpretability, while maintaining strong predictive performance:

where α is a learnable parameter that balances the contributions of contextual and evolutionary features.

Moreover, to ensure uniform scaling and reduce redundancy, the integrated feature vector is normalized and subjected to dimensionality reduction using Principal Component Analysis (PCA):

This transformation minimizes noise while preserving the most informative dimensions for predictive tasks. The final reduced feature vector is used as input for downstream machine learning models. Its compact yet rich representation enhances the efficiency and accuracy of predictions, such as drug-target interactions or protein function annotation. This advanced feature extraction framework combines the strengths of contextual embeddings and evolutionary metrics to produce a robust representation of protein sequences. By integrating global and local features, the approach captures the intricate patterns necessary for identifying druggable proteins and provides a scalable solution for large datasets.

The use of a Stacked Autoencoder (SAE) in the second stage for feature extraction and classification is driven by its ability to handle high-dimensional, complex data efficiently. Unlike a plain Autoencoder (AE), which consists of a single encoding and decoding layer, the proposed SAE features multiple encoding and decoding layers, allowing it to learn hierarchical feature representations. SAEs excel at reducing dimensionality while preserving critical information, capturing non-linear relationships, and learning structured representations essential for drug discovery tasks. A key characteristic of SAEs is their dual functionality:

Additionally, SAEs demonstrate robustness to noisy or incomplete data, ensuring reliable performance in real-world biological datasets. This two-stage approach enhances feature representation quality and boosts classification accuracy by leveraging both hierarchical feature learning and optimization techniques. Following feature extraction, the encoded features are utilized in a classification model. In this framework, a SAE is employed for feature learning and dimensionality reduction before classification. Notably, the term optSAE refers to a structurally enhanced version of the base SAE architecture. While the original SAE employs a standard three-layer encoding/decoding scheme, optSAE was designed with a deeper architecture and additional latent capacity. This refined structure was established prior to hyperparameter tuning and aims to improve representational learning. Subsequently, its core hyperparameters -- including latent dimension size, dropout rate, and learning rate -- were optimized using HSAPSO, enabling superior convergence and generalization performance.

The proposed SAE consists of three encoding layers (256, 128, and 64 neurons) and three decoding layers (64, 128, and 256 neurons). Each layer learns progressively abstract feature representations, enabling the model to effectively capture intricate drug-related patterns. This deep architecture significantly differs from a standard autoencoder, which typically consists of only a single encoding and decoding transformation. The encoding layers successively compress the input data into a latent representation, which retains the most informative features while discarding redundant noise. The decoding layers then attempt to reconstruct the original input from this compressed space, ensuring that the learned feature embeddings maintain essential structural and functional characteristics. Thus, the SAE undergoes a two-stage training process:

During the pretraining phase, the encoding layers progressively compress the input feature into a latent representation using the following transformation:

where represents the encoder's weight matrix, is the bias vector, and f is the activation function (e.g., ReLU or sigmoid). The decoding layers aim to reconstruct x from the latent representation z to ensure that the extracted features retain relevant information:

where represents the decoder's weight matrix, is the bias vector, and g is an activation function such as sigmoid. The reconstruction error, minimized during unsupervised pretraining, is defined as:

where m is the number of training samples, denotes the i-th input sample, and represents its reconstruction. Once pretraining is complete, the encoder's output (64-dimensional latent representation) is not directly used for classification. Instead, it is passed through a fully connected dense layer with two neurons, where a softmax function is applied:

where z' = f(W z + b) is the transformed latent representation after passing through the final dense layer W ∈ , and b ∈ is the corresponding bias vector. This ensures that the classifier operates on a two-dimensional space, which aligns with the binary classification task (drug-target vs. non-target). The supervised fine-tuning process minimizes the cross-entropy loss:

where is the one-hot encoded label for class j of sample i. The total loss function integrates both reconstruction loss (for feature learning) and classification loss (for predictive accuracy), balanced by λ as a weighting factor:

The parameters are optimized using gradient-based methods, such as stochastic gradient descent (SGD), to minimize :

where represents the weights and biases of the encoder, decoder, and classifier in the SAE. These parameters are updated iteratively using SGD, ensuring effective learning and improved model performance.

While gradient-based methods are effective in optimizing network parameters, they do not address the selection of optimal hyperparameters, which significantly influences the performance of the SAE model. The SAE architecture consists of multiple encoding and decoding layers, requiring fine-tuning of several critical hyperparameters. Table 2 summarizes the key hyperparameters considered for optimization.

Simultaneously optimizing all SAE parameters is computationally prohibitive and often results in inefficient training. A more effective approach is to concentrate on the key hyperparameters that significantly impact model convergence and generalization. This focused optimization reduces the search space's dimensionality, accelerating convergence. Furthermore, it enhances the model's generalization by mitigating overfitting caused by suboptimal configurations. Finally, it minimizes computational overhead, as hyperparameter tuning is limited to the training phase and does not affect inference-time efficiency. To systematically explore this high-dimensional hyperparameter space, we propose Hierarchically Self-Adaptive PSO (HSAPSO), an enhanced extension of standard PSO that integrates multi-layered adaptation mechanisms, ensuring both efficient exploration and stable convergence.

HSAPSO is specifically engineered for hyperparameter optimization, a distinct process from model weight optimization performed by gradient-based methods like SGD. Unlike SGD, which iteratively updates network weights during training, HSAPSO operates before training begins. It dynamically selects optimal hyperparameter values -- such as the number of neurons per layer, learning rate, dropout rate, and latent dimension -- to maximize model performance. By automating this selection, HSAPSO eliminates the need for manual tuning, ensuring the model is optimally configured from the start. This leads to improved convergence efficiency and enhanced overall generalization.

HSAPSO introduces three key innovations over standard PSO, making it a highly effective approach for hyperparameter optimization. Hierarchical adaptation dynamically balances exploration and exploitation across different learning stages, ensuring an optimal search trajectory. Fitness-based dynamic subgrouping clusters particles based on similarity in fitness scores, facilitating localized refinement and improving search efficiency. Self-adaptive parameter tuning continuously adjusts hyperparameter search behavior in response to real-time feedback, allowing the algorithm to adapt to varying optimization landscapes. These advancements collectively enable HSAPSO to efficiently optimize SAE hyperparameters, significantly enhancing model performance and convergence stability. Therefore, HSAPSO operates at three hierarchical levels, integrating local adaptation, subgroup-level learning, and global search refinement:

Here, is the inertia weight that controls the influence of a particle's previous velocity, while c,c,c are the cognitive, social, and subgroup learning factors, respectively. The random factors introduce stochasticity, and represent the personal, global, and subgroup best positions. Including provides an additional layer of guidance, enhancing convergence. HSAPSO introduces dynamic subgrouping, where particles form clusters based on similarities in fitness or spatial proximity. Subgroup formation enables localized refinement and independent evolution within clusters. Subgroup leaders periodically communicate with the global swarm to exchange information.

where and represent the fitness values of particles i and j, respectively. This process clusters particles with comparable fitness values. Subsequently, clustering algorithms, such as k-means or density-based methods, are employed to create dynamic subgroups. These subgroups then evolve independently, with their respective leaders updated based on their local best fitness. This dynamic subgrouping mechanism preserves diversity within the swarm, effectively preventing premature convergence, and guarantees that distinct regions of the search space are thoroughly explored.

Here, and are the maximum and minimum inertia weights, while MaxIter and Iter represent the total and current iterations. This mechanism shifts from exploration to exploitation over time, enhancing convergence and stability. The algorithm employs hierarchical fitness memory, storing solutions at three levels: global (), subgroup (), and temporal (). The hierarchical best position is defined as:

where α,β,γ are dynamically adjusted weights based on optimization performance, and represents the best position over specific time windows. This multi-tier memory prevents information loss and ensures effective guidance throughout optimization stages.

To validate the effectiveness of HSAPSO, we conducted extensive experiments comparing its performance against several established hyperparameter optimization methods. These included Grid Search, a brute-force approach that exhaustively evaluates all possible hyperparameter combinations; Bayesian Optimization, which employs probabilistic modeling to guide the search process efficiently; and Standard PSO, which lacks the hierarchical adaptation mechanisms of HSAPSO. Additionally, we compared SGD with manually chosen hyperparameters to SGD with HSAPSO-optimized hyperparameters, evaluating their impact on model convergence and performance. The results demonstrated that HSAPSO significantly accelerates convergence and improves final model accuracy, consistently outperforming other optimization techniques in both efficiency and generalization.

HSAPSO is particularly well-suited for optimizing SAE hyperparameters due to its ability to dynamically balance exploration and exploitation, ensuring efficient navigation of the hyperparameter space. Its hierarchical learning mechanism enables it to handle high-dimensional search spaces effectively, allowing it to identify optimal configurations with greater precision. Furthermore, its adaptive subgrouping strategy prevents overfitting and premature convergence by maintaining population diversity throughout the optimization process. These attributes make HSAPSO a highly robust and scalable hyperparameter optimization method, particularly advantageous for complex applications such as drug discovery, where fine-tuning deep learning architectures is crucial for achieving reliable predictive performance.

The scalability, adaptability, and precision of HSAPSO further reinforce its effectiveness in optimizing SAE parameters. Its dynamic subgrouping and hierarchical learning mechanisms enable efficient management of high-dimensional search spaces, while its ability to incorporate environmental feedback ensures adaptation to the SAE's loss landscape. Additionally, the use of hierarchical fitness memory and subgroup refinement enhances convergence toward optimal configurations. By systematically addressing challenges such as overfitting, premature convergence, and computational complexity, HSAPSO offers a robust and computationally efficient solution for hyperparameter optimization, making it particularly well-suited for high-stakes applications such as drug design.

The implementation of the proposed framework begins with data preprocessing, where protein sequences from DrugBank and Swiss-Prot were standardized for consistency. The dataset includes 2543 protein sequences, comprising 1224 druggable targets and 1319 non-targets. The sequences, initially stored as plain amino acid strings, were processed and transformed into numerical feature vectors using dipeptide composition, yielding 400-dimensional vectors that capture the frequency of amino acid pairs. These features were normalized using min-max scaling to enhance model training efficiency. The dataset was initially partitioned into 90% training and 10% testing subsets, maintaining a balanced distribution of druggable and non-druggable proteins. To enhance model robustness and avoid overfitting, we employed k-fold cross-validation exclusively within the training set. After empirical evaluation of multiple values for k (ranging from 5 to 10), we selected 6-fold cross-validation as the optimal setting, offering a strong balance between computational efficiency, performance stability, and generalization. In this setup, the training partition was divided into six equally sized folds; in each iteration, five folds (75% of the total data) were used for training and one fold (15%) served as a temporary validation set. This rotating validation scheme ensured that every sample participated in both training and validation, providing a more reliable estimate of model performance across unseen data and improving the consistency of hyperparameter tuning outcomes. Following cross-validation, the final model was retrained on the full 90% training set using the best hyperparameter configuration and then evaluated on the fixed 10% test set to assess generalization performance. In addition, we used Extreme Gradient Boosting (XGBoost, version 1.7.6; https://xgboost.ai) and Stacked Autoencoder (SAE) implemented in MATLAB (version R2023b; https://www.mathworks.com/products/matlab.html) to simulate both our proposed model and baseline approaches for comparative evaluation.

The classification was conducted using a SAE, optimized through the HSAPSO. The SAE architecture consisted of three encoding layers with 256, 128, and 64 neurons, a bottleneck layer of 32 neurons, and symmetrical decoding layers. ReLU activation was applied in hidden layers, with dropout (rate: 0.3) to prevent overfitting, and a softmax output layer for classification. The HSAPSO dynamically optimized hyperparameters, including the number of neurons in encoding layers (128, 512), learning rate (0.0001, 0.01), dropout rate (0.1, 0.5), and latent dimension (16, 64). Final optimized parameters included encoding layers of 256, 128, and 64 neurons, a learning rate of 0.001, and a dropout rate of 0.3. HSAPSO was configured with a population of 30 particles over 50 iterations, utilizing cognitive, social, and subgroup factors (2.0, 2.0, 1.5), with a dynamically decaying inertia weight (0.9 to 0.4). Training was conducted on a high-performance system with NVIDIA RTX 3080 GPU and Python-based tools, with early stopping to ensure optimal performance. This robust pipeline achieved highly accurate and efficient classification of druggable proteins.

Notably, all models were trained using early stopping based on validation loss, with a patience threshold of 10 epochs. While a maximum of 100 training epochs was allowed, convergence often occurred well before this limit. This approach ensured efficient training and helped prevent overfitting, particularly for deeper models such as optSAE.

Moreover, we employed SGD as our primary optimization method, prioritizing its stability and well-established convergence properties in deep network training. While adaptive optimizers such as Adam and AdamW offer faster initial convergence, they typically incur a greater computational cost per iteration, especially in large-scale deep learning models. Given the computational demands of optimizing a stacked autoencoder, SGD was selected to promote efficient memory utilization and avoid over-reliance on adaptive moment estimation, which can sometimes hinder generalization in deep architectures. Nevertheless, we performed supplementary experiments comparing SGD with AdamW, and the findings are detailed in subsequent sections.

The two-class mode analysis criteria are employed to determine the correlation between predictions and error estimations. Accuracy, sensitivity, and specificity serve as the primary metrics for evaluating the method's performance. These metrics are calculated based on the following variables:

In evaluating the possibility of converting a sample into a drug, additional performance criteria include the detection ratio, false alarm ratio, and the balance between these two metrics. Other considerations, such as efficiency estimation, are equally important and include factors like execution speed, responsiveness, and error tolerance. These factors assess the algorithm's ability to correctly classify samples for potential drug conversion.

The detection ratio is calculated as the proportion of correctly identified druggable samples, validated by expert opinion across various laboratory conditions. Such metrics are considered essential for assessing the performance of the proposed model and are consistent with those used in similar studies. These criteria provide a robust foundation for evaluating the efficiency and reliability of the proposed approach.

The performance of the proposed framework was evaluated using standard classification metrics, including accuracy, sensitivity, specificity, and F1-score, to ensure a comprehensive assessment of its predictive capabilities. These metrics were calculated based on true positive, true negative, false positive, and false negative rates, providing a detailed analysis of the model's ability to classify druggable and non-druggable proteins effectively. Additionally, to assess the computational efficiency, the time complexity of the algorithm was analyzed. The computational complexity of the proposed framework is primarily focused on the training phase, encompassing feature extraction, SAE training, and HSAPSO optimization. For N training samples, the complexity of feature extraction, using dipeptide composition, is:

Together, the total training complexity ensures robust optimization and efficient learning, while the lightweight testing phase, applied to N samples, involves minimal computation.

To quantitatively evaluate the variability in accuracy, the coefficient of variation (CV), a normalized measure of dispersion, is calculated alongside the standard deviation (σ). The equations are:

where is the accuracy for each trial, is the mean accuracy, and n is the number of trials. Additionally, to assess the dispersion further, the range of accuracies (R) and mean absolute deviation (MAD) are computed as follows:

where and are the maximum and minimum accuracies, respectively. pharmaceutical applications. To complement these metrics, the variance ratio test is used to compare the variability of this model with competing methods. The formula is:

where σ and σ represent the standard deviations of the proposed model and a baseline model, respectively. Moreover, to ensure an objective and well-calibrated evaluation of the model, threshold values for accuracy, sensitivity, and specificity were determined through a combination of empirical analysis and statistical optimization. Initially, a default decision threshold of 0.5 was applied, aligning with standard binary classification practices, where predictions with a probability ≥ 0.5 were labeled as druggable proteins and those < 0.5 as non-druggable. To refine this threshold, Receiver Operating Characteristic (ROC) curve analysis was performed, and the optimal cutoff was identified using Youden's Index (J = Sensitivity + Specificity - 1), ensuring a balanced trade-off between sensitivity and specificity. Further robustness validation was conducted using five-fold cross-validation, where the selected threshold was assessed across multiple dataset partitions to confirm its consistency and stability. This systematic approach not only optimized classification performance but also mitigated potential biases arising from dataset imbalance, reinforcing the reliability and generalizability of the proposed model.

Quick News Spot

Automated drug design for druggable target identification using integrated stacked autoencoder and hierarchically self-adaptive optimization - Scientific Reports

POPULAR CATEGORY

misc

entertainment

corporate

research

wellness

athletics