AI is changing the way that cyclic peptides are discovered. The traditional paradigm of empirical screening is being replaced by a predictive, data-driven approach. The conventional approach to finding cyclic peptide leads is to first generate large libraries of compounds using either biological display technologies or combinatorial chemistry and then screen them in a low-throughput functional assay. This approach can take several months and typically only samples a tiny fraction of the available chemical space, at significant time and material cost. In contrast, the AI approach flips the traditional workflow on its head by using machine learning models trained on large datasets of peptide sequences, structures, and biological activities to predict which sequences and topologies are most likely to bind to a particular target, be protease-stable, and cell-permeable. This information can then be used to guide the rational design and synthesis of a smaller, more focused set of compounds for experimental validation. This greatly accelerates the discovery process, reduces the time from years to weeks, and reduces the experimental load of synthesising and testing inactive compounds. AI can also be used in conjunction with molecular docking, generative models, and active learning to create a virtuous cycle where each prediction guides the synthesis of a new set of compounds, each experimental result is used to update the model, and each new iteration expands the chemical space that can be explored. The AI approach is not just a faster way of screening, but it is also a creative tool that can propose novel scaffolds, predict optimal cyclisation sites, and help to navigate the complex trade-off between potency and druglikeness.
Fig. 1 Conceptual framework of the AI-driven bioactive peptide discovery pipeline for metabolic diseases.1,2
AI-directed screening can be viewed as a shift from the paradigm of exhaustive experimental screening. Libraries of peptides are traditionally created (via phage display, split-and-pool synthesis or other encoded bead approaches) and put through multiple rounds of affinity selection. These methods are very powerful but library sizes are still finite, propagation bias can be introduced and screening depth is often limited by Sanger sequencing. The key advantage of AI-screening is that it has no intrinsic limitations on library size, nor does it require the physical synthesis of peptides prior to read-out. Instead it analyses existing databases of peptides, protein targets and known activity and is able to generalise and predict novel peptides with high likelihood of binding to a target of interest. This can be particularly useful for cyclic peptides where the sequence, size and crosslink all make non-linear contributions to bioactivity. The computational methods used to model peptides are well suited to detecting and leveraging these intricate relationships; for example latent motifs that correlate with target activity can be detected in data while combinations that lead to aggregation or metabolic liabilities can be penalised during training. AI methods also cover an array of other techniques; for example generative algorithms can suggest entirely new starting points that are outside of known natural or synthetic peptides, reinforcement learning methods can optimise peptides for a multi-parameter fitness function (potency, selectivity, permeability) and active learning approaches can direct users on which peptides should be synthesised next, based on where models are least certain. This can allow for an extremely synergistic approach, where computation and experiment work in parallel to rapidly and efficiently identify clinical candidates.
Virtual peptide library design refers to a combined computational pipeline that brings together structure-based and sequence-based design in a single predictive model. In this view, the different design methods are not conceived as two isolated processes, but rather as complementary levels of a layered inference system: structure-based design supplies geometric information and topological constraints from the target, while sequence-based mining derives latent chemical principles from data, and machine learning algorithms coherently combine the two to rank synthetic candidates. Virtual library design methods are thus able to create information-rich in silico collections of peptide macrocycles, each one scored for binding affinity, conformational stability, and synthetic accessibility, allowing the exploration of the space of possible molecules to be navigated with predictive intelligence rather than by brute force. The computational pipeline is flexible and scales to the type and amount of data available for each target: when a high-resolution protein structure is available, the scoring function is dominated by docking simulations; when all that is available are activity readouts, sequence representations from deep neural networks are used instead. This adaptivity makes it easy to apply these methods across different target classes and at any stage of a project, and the generation of an ensemble of ranked, synthesizable candidates helps translate the deceptively simple notion of virtual library design into a reality, turning the search for a peptide lead into a focused optimisation problem, where experimental resources are applied only to the most-promising sequences and the design-make-test cycle is made as tight as possible with an emphasis on cheap and fast computational failure over synthetic failure. This way of working does not only expedite the identification of potent cyclic peptides, it also makes a systematic study of the structure–property relationships that drive their function possible, progressively building up a body of knowledge that can be leveraged to increase the predictive accuracy of future virtual campaigns.
Structure-based design of cyclic peptide libraries takes advantage of atomic-resolution information about a protein target to position macrocyclic scaffolds within binding pockets or across protein–protein interaction surfaces. This rationally focused approach has the potential to optimize both affinity and selectivity while limiting off-target liabilities. Molecular docking algorithms power this process, sampling millions of cyclic peptide conformers and scoring them for their complementarity to the target surface with physics-based or machine-learned potentials. Algorithms for ring conformer generation are key to the success of structure-based design because the bioactive conformation of a cyclic peptide often differs from its solution-averaged structure. Conformational sampling strategies include molecular dynamics simulations at high temperature to escape energy barriers, Monte Carlo torsion-angle perturbations that sample ring pucker space, and knowledge-based approaches that graft fragments from known cyclic peptide crystal structures onto new scaffolds. Generative models, such as variational autoencoders trained on ring conformers from the Protein Data Bank, learn a continuous latent representation of ring geometry, which can be interpolated between known topologies to create novel macrocycles that adopt a predicted low-energy conformation. The generated conformers can then be clustered based on backbone RMSD and representative structures selected for docking, allowing the search space to include both canonical β-hairpin and α-helical mimetics as well as exotic knot-like topologies.
Machine learning methods have also been used in the sequence-based design of cyclic peptide libraries. Models can be trained to predict activity from the primary amino-acid sequence alone, without any knowledge of the target structure, making it possible to design a discovery campaign against proteins with unknown or intrinsically disordered structures. Deep learning models can learn predictive features directly from a peptide activity database, mined from either phage display selections, high-throughput binding screens, or the literature, where each data point is a cyclic peptide sequence along with a measurement of its affinity and physico-chemical features. Convolutional neural networks and recurrent neural networks (such as LSTMs) learn local motifs and long-range sequence dependencies that are associated with activity, while transformer models approach the peptide sequence as a language, capturing the contextual relationships between residues that are specific to binding a given target. Quantitative structure–activity relationship (QSAR) descriptors, such as the hydrophobic moment, amphipathicity, predicted secondary structure, etc. are used as input to gradient-boosted decision trees which rank candidate sequences for synthesis based on their predicted probability of interacting with the target. Sequence enrichment can also be detected directly from the model predictions: an overrepresentation of, e.g. aromatic residues at positions i and i+4 for an α-helical mimetic, or a high occurrence of proline-glycine dipeptides at the apex of a β-turn mimetic library. Active learning schemes can be used to further optimise discovery: an initial model prediction is used to select a small information-rich subset of the virtual library for synthesis, binding data is then fed back into the model to retrain it, and a subsequent round of predictions are generated that become more accurate with each iteration, resulting in a small number of high-confidence leads with only a few experiments. Sequence-based design is also particularly useful for developing cyclic peptides that are cell-permeable or proteolytically stable: a model can be trained on a dataset that contains measurements of permeability or stability, and sequences that aggregate in aqueous solution or are susceptible to proteolytic cleavage are penalised during training. By using machine learning to interpret the large body of existing peptide data, sequence-based design of cyclic peptides can be accelerated.
Machine learning has changed the landscape of cyclic peptide screening data analysis, transforming phage display and mRNA display experiment outputs that are noisy, sparse, and high-dimensional into predictive models that generalise beyond the assayed library. The primary task is to take experimental enrichments that often suffer from sparsity and biases, and create training sets that encode real structure–activity relationships. In contrast to small-molecule screening where each compound is a fixed entity, peptide libraries encode information about dynamic relationships such as positional dependencies of residues, turn propensities that affect ring closure, and confounding biases from amplification steps that select for fast-growing clones regardless of binding affinity. Feature extraction must be performed at multiple levels to translate the linear amino-acid string into a numeric representation that captures physicochemical anisotropy, spatial architecture, and synthetic accessibility while filtering out experimental noise. This translation is non-trivial: simplistic one-hot encoding of residues treats them as independent tokens without considering cooperativity, which is critical for cyclic peptide binding, while overly complex descriptors may lead to overfitting to peculiarities of the training set. The challenge of feature engineering is to find a balance between expressive power and statistical robustness, ensuring that the model learns motifs that are transferable to unseen sequences rather than memorizing noise.
The data generated by phage or mRNA display can then be used to train a model. First, the sequencing reads are collated by round of selection and each peptide is mapped to its enrichment profile, a vector of counts at each round, describing how its frequency changes from the starting library to each wash and elution. This is typically not used directly, but transformed to one or more sets of features which can be interpreted in terms of sequence composition and selection dynamics. A popular choice is to align the enriched sequences, and calculate the log-odds of each residue at each position compared to the naive pool, generating a position-specific scoring matrix (PSSM). The resulting alignment shows the consensus motif that is enriched in the pool. For cyclic peptides, the sequence needs to be aligned to the ring, so that identical or similar peptides are not artificially offset in the alignment, which could for example cause a turn-inducing residue to be ignored. The enrichment profiles can also be used to calculate summary physicochemical features such as the hydrophobic moment, net charge, amphipathicity and predicted secondary structure propensity of each peptide. These lower-dimensional features can be used directly to train a model, or the sequence alignments can be visualised to give an overview of the selection. These datasets suffer from extreme class imbalance, which has to be accounted for during pre-processing. While the whole library is often less than 109 different clones, after a typical selection only 0.001–0.01 % of these are real binders; the other 99.9–99.99 % are either non-interactors, or fast-replicating parasites that swamp the sequencing data. If a model is trained on such an imbalanced dataset, it will by default learn to predict the majority class, and be 99+% accurate, but of course completely useless for hit discovery. Oversampling of the enriched sequences, undersampling of the naive pool, and cost-sensitive learning, where the penalty for a false negative prediction is higher than for a false positive, can be used to address the imbalance. Data cleaning to remove technical artefacts such as PCR duplicates, low-quality reads and frameshifts that produce truncated peptides is also important.
Predictive models for hit enrichment during cyclic peptide screening range from interpretable ensembles to deep architectures, which trade off accuracy, interpretability and resource requirements. Random Forest has become a reliable baseline in this setting, generating an ensemble of decision trees trained on bootstrapped training sets and taking a consensus vote on predictions. Feature importance scores extracted from this ensemble can help to understand which residues and positions contribute the most to enrichment, in a way that is intuitive to provide feedback for analogue design. For example, if a cysteine residue at position 2 in a disulfide-constrained library is assigned a high importance weight, it is likely that the crosslink is functionally required to form the correct binding geometry, and should therefore be maintained or replaced with a non-reducible analogue, whereas positions with low importance weights may be substituted with cheaper or more functional amino acids. The non-linearity of this model and the absence of explicit feature engineering makes it also amenable to capture higher order epistatic residue couplings, which are often instrumental in defining the shape of cyclic peptide binding epitopes. Convolutional Neural Networks (CNNs) allow for an explicit modelling of local residue motifs, by building a hierarchical representation from elementary signal features in the input sequence. Peptide sequence is treated as a 1D signal, and sets of convolutional filters are applied to the amino-acid chain to slide along and detect motifs. For example, a filter that detects aromatic residues in consecutive positions, or a cluster of charges, or turn-inducing dipeptides will likely be predictive of binding. Subsequent layers will combine lower-level features to represent longer-range patterns, similarly to the hierarchical nature of secondary structure elements in cyclic peptides. The most recent class of peptide models are transformer-based, which model peptide sequence using a self-attention mechanism. In contrast to CNNs, transformers do not have a fixed receptive field but compute a pairwise attention score between every residue and all others, effectively learning that residue 1 is important for binding due to its relationship with residues 10 and 15, for example. This is especially relevant in macrocyclic scaffolds where non-contiguous parts of the chain become neighbours due to cyclisation.
AI and the machine learning models trained to perform specific tasks such as hit discovery and optimisation, have disrupted traditional drug discovery methods. Historically, the process of hit discovery was more linear with limited feedback loops. Synthesis was followed by the biological assay of the library and the analysis of the data collected. Now, artificial intelligence wet-lab screening has started to integrate ML models to form an iterative design, synthesize, test, and learn cycle. To achieve this, it is necessary to create a tight integration between experimental discovery processes and computational prediction by putting the experimental data into an automated feedback loop that directly informs future experiments. This involves data from all stages of the experimental discovery, including raw reagent batch IDs, reaction and purification conditions and yields, raw HPLC chromatograms and LC-MS data, and final assay data to be digitized and stored in the cloud. This allows for predictive quality control of the data, such as predicting outlier reaction and purification batches, and inform future experiments by changing the composition of plates during screening. The feedback from wet-lab experimental data can be used to retrain the ML models (online machine learning), either continuously or at specified intervals. The iterative process of this data driven hit discovery and optimization cycle is ongoing with the experimental data used to both further train ML models and to guide future experiments As a result, not only do the discovery cycles shorten with artificial intelligence wet-lab screening, but the hit rate increases since the ML models can start to make predictions when an experiment is only partially complete.
Closed-loop optimisation represents the repetitive training and evaluation workflow inherent to the use of AI for screening. It is defined by a closed-feedback loop of synthesis, screening, model training, and prediction: a virtual library is created, ranked by a trained model, and a subset of the top-scoring predictions is chosen for synthesis and experimental validation. The screening hits are fed back into the model and retrain it, resulting in a new set of predictions. The loop begins with a virtual library, usually containing millions of virtual compounds. This virtual library is scored using a trained model. The compounds are prioritised based on their score, while a diversity selection step ensures chemotype space is spanned to cover chemically diverse regions of the library. These selected compounds are sent to synthesizers, resulting in crude macrocycles. These crude macrocycles are purified, and the purified compounds are assayed in parallel to each other against the target of interest, resulting in binding data (KD values), kinetic off-rates (kinetic data), and functional readouts (functional data) as ground truth labels. The data is reweighted and incrementally added to the training dataset using an online learning scheme. This reweighted data is used to update model parameters, often using only a subset of data at a time. A common scheme involves the down-weighting of examples where predictions were far from experimental values and up-weighting where the model and experiments agree. The model is re-scored against the remaining virtual library, shifting its recommendation profile to bias towards areas of chemical space complementary to previously identified hits. This process is repeated multiple times. For each loop, the number of validated hits should increase and model prediction accuracy should improve. Each cycle is learning from both positive and negative examples, while experimental failures can provide important negative feedback, e.g. sequences that failed to cyclise or sequences with poor solubility. These sequences are mapped to synthetic dead zones and the model learns to propose compounds outside of these regions in future recommendations. Additionally, the domain expert would review these model predictions for synthetic accessibility and target druggability, providing important domain knowledge feedback.
The active learning loop is the process by which the library gets refined. Based on the model feedback, it iteratively chooses the best subset of compounds to synthesize next. This is the process that maximizes discovery and reduces the number of compounds to synthesize. This process is what is known as active learning. At each step, instead of synthesizing all the sequences with the best predicted scores, you look for the ones with the highest uncertainty. Uncertainty could be from the model not being sure (low prediction confidence), or could be some disagreement within an ensemble of models. We want to make use of these high uncertainty sequences for our next round of synthesis in the next batch. With each of these experiments, we would like to remove as much uncertainty about the structure–activity relationship as possible. After each round of synthesis and screening, we retrain our model and requery it for the next batch. This is done iteratively until we reach our target performance, or when we can no longer improve. Another strategy is to focus on different types of compounds in each round. For example, if we get a hit from one chemotype, we can ask the model to focus more on diverse sequences for the next round of synthesis. On the other hand, if we get a near-optimal lead, we can also look for the closest analogues (for example N-methylation at a particular position or changing the size of a ring) to help fine-tune the activity.
AI-designed cyclic peptides have been successfully used to target protein–protein interactions (PPIs). Protein–protein interfaces are often large, flat, and solvent-exposed, making them less amenable to drug discovery by traditional small molecules. Cyclic peptides can surmount many of these challenges due to their structurally pre-organized nature and the resulting extended and well-defined surface that can recapitulate the shape and charge complementarity of an interfacial epitope. Structural resolution data (X-ray crystallography, cryo-EM, or NMR) is used to define the "hot spots" and secondary-structure motifs (α-helices, β-strands, or flexible loops) that contribute the most binding free energy. For example, hydrocarbon stapling is used to pre-organize the cyclic peptide into the desired helical conformation with side chains in the correct position to complement the topology of the target groove. Disulfide or lactam bridges are used to lock the loop geometry, without disturbing essential hydrogen bonds. Ring size and topology is computationally optimized to avoid entropic cost, and designed using RosettaRemodel and flex ddG. Surface complementarity is optimized by mutating the peripheral residues to enhance electrostatic or hydrophobic surface interactions. This computational strategy has been used to successfully optimize from cyclic peptides derived from native dimerization loops. For example, through computational protein–protein interface design and optimization, cyclic peptides from PHGDH dimerization loops, which selectively bound to their target interfaces with nanomolar affinities. Computational approaches have also been used to discover antimicrobial macrocycles. In one study, a genetic algorithm was used to design and evolve sequences to be structurally similar to a natural product macrocycle. The optimization focused on mixed peptide-peptoid sequences, inspired by the macrocyclic lipopeptide antibiotic polymyxin B2 (PMB2). PMB2 is a last-line membrane-targeting antimicrobial peptide used clinically to treat multidrug-resistant (MDR) bacteria infections, but is known to cause kidney toxicity. Using a genetic algorithm to evolve peptide-peptoid macrocycles based on shape and pharmacophore similarity resulted in identifying compounds with comparable antimicrobial activity and improved toxicity to PMB2. The algorithm identified over 14.9 million chemically distinct macrocycles, from over 42,000 distinct monomer combinations. Experimental validation through X-ray crystallography and NMR spectroscopy confirmed the computational design approach was reliable and supported this approach to sample the vast chemical space of macrocyclic natural products.
AI, when combined with automation and high-throughput screening, can create a powerful cyclic peptide discovery ecosystem. AI models can be trained on large datasets generated from high-throughput screening to identify features and patterns that are predictive of high-affinity binders. Automated synthesis platforms can rapidly produce large libraries of cyclic peptides, while high-throughput screening provides the data needed to train and refine AI models. This closed-loop system can accelerate the discovery process by continuously refining the peptide libraries based on AI feedback. For example, automated microwave synthesizers can assemble the linear precursors overnight, and microfluidic cyclization modules can produce milligram quantities of macrocycles the following day. Real-time data from UPLC-MS and SPR is stored in a cloud-based LIMS, and medicinal chemists can accept or reject analogues within the same week, enabling fast turnaround for iterative lead optimization. In one study, leads were evolved from initial hits to optimized variants within just a few rounds of optimization. Cloud-based peptide discovery platforms are being developed to facilitate collaborative research and streamlined workflows. These platforms allow researchers to access and share large datasets, computational models, and experimental protocols from anywhere in the world. By utilizing cloud computing resources, researchers can perform complex simulations and analyses without the need for local high-performance computing infrastructure. Cloud platforms also enable collaboration between interdisciplinary teams, allowing chemists, biologists, and data scientists to work together seamlessly. In a recent study, cloud-based tools were used to design cyclic peptides against proteins such as MDM2 and PD-L1. The designed sequences had lower binding energies and higher pLDDT scores than the native peptides, validating the effectiveness of the computational approach. These platforms not only improve the efficiency of peptide discovery but also democratize access to advanced computational tools, making them available to researchers in both academic and industrial settings.
Artificial intelligence is reshaping how cyclic peptides are designed, evaluated, and optimized. Our AI-powered discovery platform combines computational modeling, generative design, and virtual screening to accelerate your R&D pipeline from concept to validated hits.
We provide:
Partner with us to cut discovery timelines and identify superior macrocyclic candidates with greater precision.
References