An Ising model in RNA folding (part 1)

\[\newcommand{\NP}{\textsf{NP}} \newcommand{\APX}{\textsf{APX}} \newcommand{\sP}{\textsf{#P}} \newcommand{\A}{\textsf{A}} \newcommand{\T}{\textsf{T}} \newcommand{\G}{\textsf{G}} \newcommand{\C}{\textsf{C}} \newcommand{\U}{\textsf{U}}\]

This is part 1/3 of my Bachelor thesis during my first year at Ulm, carried out at Hamilton Institute, Maynooth University and under supervision of Prof. Damien Woods, titled “Computational complexity of an Ising model in pseudoknotted nucleic acid folding”, to be published.

1. Introduction

Since the early studies on nucleic acids, there has been much effort going into harvesting its power. With RNAs, we have been achieved new ways to make vaccine (Pardi et al., 2018), edit genome (Deltcheva et al., 2011), and make nanostructured devices (missing reference) to control cell fate (Shibata et al., 2017). With DNA, we have found treatments for some chronic disease (Sussman et al., 2024), done computations (Fan et al., 2020), and stored data (Ceze et al., 2019). And yet, there is so much more that we do not know about them. We now know that at least $76\%$ of human genome is transcribed, yet only $1.2\%$ of which encodes proteins (Lee et al., 2019), and most of the rest, i.e. non-coding RNAs, have their functions yet to be completed unveiled (Statello et al., 2021), and applications to be discovered. For example, it is only recently that tRNAs, most commonly known as encoding of amino acid sequences, can assemble into a DNA replicator (Kühnlein et al., 2021). Given how relatively recent these achievements are, it is expected that there is more potential to explore.

Notwithstanding, it is not fair to say that these successes are entirely product of this century, as they are built upon decades of studying nucleic acids, e.g. how they form, how their structures are related to their functions, how to predict their structures, and how to design nucleic acids that form a particular target structure.

We now understand that they start out as sequences of nucleic acids, also known as bases, e.g. $\A$, $\T$, $\G$, $\C$ for DNA, and $\A$, $\U$, $\G$, $\C$ for RNA, lying on phosphate backbones. The hydrogen bonds between the bases form what is known as secondary structure. Many of such structural elements form recognisable three-dimensional motifs such as helices, major and minor grooves, and quadruplexes, together forming what is called a tertiary structure. Finally, several such structures can, in some situations, self-assemble into larger complexes called quaternary structures.

On the one hand, advanced in nucleic acid sequencing have allowed exponential growth in the number of nucleic acid sequences we determined (Katz et al., 2022). On the other hand, the next step is significantly more difficult. We know that the structure of a nucleic acid also plays critical roles in gene regulation (Mandal & Breaker, 2004), protein-DNA recognition (Rohs et al., 2009), DNA replication (Sims & Benz, 1980), and other processes. Moreover, the sequence alone does not determine a nucleic acid’s functionalities, as there have been evidences of RNA polymers having the same function yet not sharing the same sequence (missing reference). Yet despite much progress, even determining secondary structure remains arduous, time-consuming, and technically demanding from an experimental point of view (von Löhneysen et al., 2024).

An alternative approach is introducing a thermodynamic model which associates to each secondary structure a free energy, and the ones minimising free energy are considered most favourable, thus most likely. The models can be simple as maximising the number of paired nucleotides (Nussinov & Jacobson, 1980), or complicated as a function of over 7600 parameters (Mathews et al., 1999), but they all consider secondary structures as mere pairs of bases respecting some pairing, e.g. Watson-Crick pairing or wobble pairing.

But, these models are certainly not sufficient for practical purposes. One of the reasons is that they consider few geometric constraints, e.g. the minimum length of hairpins in case of nearest neighbour model (Mathews et al., 1999), which is problematic because in order for two bases to pair, they must be close and align with each other. A predicted secondary structure by these models thus might not be realistic; this problem cannot be ignored, as it is known that these phenomena do affect the ways wherein nucleic acids can bind together (Wang, 1979), thus one must take it into consideration when designing nucleic acids (missing reference).

One can certainly consider a more detailed thermodynamic model with more geometric constraints, eventually achieving a physically and chemically realistic, if not accurate model of nucleic acids. The possible constraints are the strength of chemical bonds, volume exclusion, i.e. two atoms or bases cannot occupy the same position in space, and persistent length (Hagerman, 1988) i.e. two bases sufficiently far apart can be oriented independently in space. A model then can be simple as deriving how exactly bases can pair (Kimchi et al., 2019), to an abstract 3D model of nucleic acids such as oxDNA (Sengar et al., 2021), to molecular dynamics simulation using force field (Liebl & Zacharias, 2023), but they all have speed as a bottleneck.

The problem is then how to find a good trade-off between performance and accuracy, preferably augmenting existing thermodynamic models with some notions of geometric constraints whilst maintaining tractability of MFE problem. The goal of this work is thus two-fold.

Firstly, we augment two energy models by allowing nucleotides to have angles with respect to the backbone, and to simplify this notion, each base is oriented either up or down, analogous to the spin in Ising model. Note that this is the only similarity with Ising model in statistical physics, and in particular, we have neither conditions on the spins on the boundary, nor notion of magnetisation. The energy models are base-pair counting by Nussinov (Nussinov & Jacobson, 1980), chosen for its simplicity, and base-pair stacking by Lyngsø (Lyngsø, 2004), chosen for its featuring of stack, which is an essential feature of nearest neighbour model (Mathews et al., 1999).

Secondly, we study the computational tractability of problems associated with these models, namely the problem of calculating the energy of a secondary structure, and that of finding the minimum free energy. In particular, we show that in almost all cases, these models are computationally hard, suggesting the same if more geometric constraints are ever considered.

As we already noted, whilst we have learned a lot about geometry of nucleic acids (missing reference), little effort has been made to take this into account in de novo thermodynamic prediction of structures. In particular, no work has been done on Ising-like model for nucleic acids, although similar models have been considered for proteins (Istrail & Lam, 2009), where $\NP$-hardness has been shown for various models, whilst $\APX$- and $\sP$-hardness have not been considered. However, the majority of these protein folding models are lattice-based, e.g. HP model, which differ significantly from nucleic acid secondary structure model, thus rendering these results inapplicable.

Beyond Ising-like models, there have been some attempts to augment current energy models with some geometric constraints. For example, in nature, RNA folds as it is transcribed, a process called co-transcriptional folding, which has been shown to play an important role in its structure formation (Lai et al., 2013). To this end, Proctor and Meyer augment Turner model to incorporate this effect (Proctor & Meyer, 2013). In a different direction, Deigan et al. (Deigan et al., 2009) augments Turner model with affinity of each nucleotide as determined experimentally. One should note, however, that these attempts are to refine Turner model with additional parameters, and none of them significantly change the model. In particular, as one starts out with Zuker’s algorithm, tractability is not a question.

1.2. Organisation

The next section gives quick reminder of essential notions related to the RNA folding problem. Then, Section 3 introduces our Ising model and derives energy functions analogous to what was considered by Nussinov and Jacobson (Nussinov & Jacobson, 1980), and by Lyngsø (Lyngsø, 2004). Section 4 introduces two relevant algorithmic problems and states the main computational complexity results. Subsequent sections give proofs or outlines thereof, before Section 7 ends with some remarks.

2. Preliminaries

2.1. Nucleotide alphabet

We consider in this work a general notion of nucleotides (also called bases), where each of them is a character in an alphabet $\Sigma = \{s_i, t_i \mid 1 \leq i \leq N\}$, and the only allowed pairings are between $s_i$ and $t_i$.

For instance, Watson-Crick pairing correspond to $N = 4$, $\Sigma = \{\A, \T, \C, \G\}$, and $s_1 = \A$, $t_1 = \T$, $s_2 = \C$ and $t_2 = \G$. Note that in case of RNA, we also have wobble pairs, i.e. $\U$ can pair with either $\G$ or $\A$, but we shall omit them in this work for simplicity.

2.2. Secondary structure

A nucleic acid can be encoded as a string $q = \overline{q_1, q_2, \cdots, q_n}\in \Sigma^*$, which may admit a secondary structure $S = \{(\ell_i, r_i)\}_{i = 1}^s \in \{1, 2, \cdots, n\}^2$, who then must satisfy the following conditions in order to be considered as admissible.

For all $1 \leq i \leq s$, we have that $\ell_i < r_i$. Moreover, $(q_{\ell_i}, q_{r_i}) = (s_j, t_j)$ or $(q_{\ell_i}, q_{r_i}) = (t_j, s_j)$ for some $1 \leq j \leq N$.
A nucleotide can pair with at most one other, i.e. for all $1 \leq i, j \leq s$, if $\ell_i = \ell_j$ or $r_i = r_j$, then $i = j$. For future convenience, we denote by $P = \{i \mid \exists j, (i, j) \in S \lor (j, i) \in S\}$ the positions of paired nucleotides.

2.3. Base-pair counting model and base-pair stacking model

Base-pair counting model, hereinafter abbreviated as BPC, was the first to be considered in RNA folding problem, in the paper by Nussinov and Jacobson (Nussinov & Jacobson, 1980), where $\Delta’_1(S) = -|S|$.

On the other hand, base-pair stacking model, hereinafter abbreviated as BPS, was introduced by Lyngsø (Lyngsø, 2004), where he showed that finding MFE in this model is $\NP$-hard. Here, the energy function is defined as $\Delta’_2(S) = -|\{(i, j) \in S \mid (i+1, j-1) \in S\}|$. Intuitively speaking, two bonds of the form $(i, j)$ and $(i+1, j-1)$ form what is called a stack, and this model aims at maximising the number of stacks.

Stack is an essential feature of celebrated Turner model (Mathews et al., 1999), thus BPS model can be considered not only as a simplification but also a precursor thereof, and any negative computational complexity result for it strongly suggests the same for Turner model, which is our motivation.

3. Ising model for nucleic acids

Analogous to Ising model in physics, we first need a notion of spin. Given a sequence $q$ of length $n$, the nucleotide at position $i$ is assigned a discrete variable $\sigma_i \in \{1, -1\}$. The collection of such assignment is called a spin configuration $\sigma = (\sigma_i)_{i = 1}^n \in \{1, -1\}^n$.

Next, a bond should be symmetric, meaning that if there is a pair $(\ell_i, r_i) \in S$, then it should make no differences whether we have $q_{\ell_i} = s_j$ and $q_{r_i} = t_j$, or $q_{\ell_i} = t_j$ and $q_{r_i} = s_j$. Thus like in the original Ising model, if a bond is possible, we favour the bond between nucleotides of opposite spins. To put it rigorously, we assign an energy $\alpha \sigma_i \sigma_j$ to the pair $(i, j) \in S$, where $\alpha > 0$ is an absolute constant.

On the other hand, we also prefer that nucleotides in proximity and paired with some nucleotides not necessary amongst themselves. In other words, if the bases at position $i$ and $i+1$ are paired with those of position $j$ and $k$ respectively, i.e. $(i, j), (i+1, k) \in S$ for some $j$ and $k$, then we prefer that $\sigma_i = \sigma_{i+1}$, and thus we assign an energy $-\sigma_i \sigma_{i+1}$.

The motivation for this is less clear (aside from making the MFE problems non-trivial), but there is some from chemistry. In reality, the bond between two nucleotides is the hydrogen bond between the corresponding nitrogenous bases, which are attached to phosphate group. These groups are linked by ester bond, together forming the backbone of nucleic acid. As such, two consecutive paired nucleotides $i$ and $i+1$ cannot form arbitrary dihedral angles (Hartmann & Lavery, 1996). In the special cases where $i$ is paired with $j$ whilst $i+1$ is paired with $j-1$, it might be a part of a helix structure, thus have their angles fixed.

On the other hand, we do not consider $\sigma_i\sigma_{i+1}$ for all $i$, because the interaction above is, to some degree, local. In particular, if we see nucleotides as vectors from the phosphate group to the base, then two nucleotides of distance larger than persistent length have their directions uncorrelated (Hagerman, 1988). Thus we choose the aforementioned definition to abstractise this effect.

Finally, we introduce the following energy function as sum of the aforementioned contributions

\[\Delta_1(S, \sigma) = \alpha \sum_{(i, j) \in S} \sigma_i \sigma_j - \sum_{i, i+1 \in P} \sigma_i \sigma_{i+1}.\]

One can see it, and in particular the first term, as an analogy to $\Delta'_1$ in BPC model, by writing $\Delta'_1 (S) = \sum_{(i, j) \in S} (-1)$. Thus we call this Ising BPC model. It is now clear that $\alpha$ plays the role of ratio between the strength of two interaction, the choice of which we delay until later, as there is no reason to suggest any particular value.

In this fashion, it is straightforward to define an analogy, which we call Ising BPS model, to $\Delta'_2$ in BPS model as follow

\[\Delta_2(S, \sigma) = \alpha \sum_{(i, j) \in S \mid (i+1, j-1) \in S} \sigma_i \sigma_j - \sum_{i, i+1 \in P} \sigma_i \sigma_{i+1}.\]

4. Statements of results

Given our two energy functions, it is straightforward to define the problem of finding minimum free energy.

Problem $\textsf{MFE}_{\Delta}$ ($\Delta \in \{\Delta_1, \Delta_2\}$). Given $\alpha$ and a nucleotide sequence $q$ as input, output $\min_{S, \sigma} \Delta(S, \sigma)$.

Another interesting problem is given a secondary structure $S$, evaluating the free energy of $S$. In the classical setting, the energy is entirely determined by $S$, whose evaluation is done in polynomial time. However, in our new setting, the energy also depends on the choice of spin configuration $\sigma$, and evaluating the energy of $S$ amounts to an optimisation problem over all possible choices of $\sigma$, which a priori is non-trivial. We thus introduce it as a separate problem.

Problem $\textsf{SPIN}_{\Delta}$ ($\Delta \in \{\Delta_1, \Delta_2\}$). Given $\alpha$, a nucleotide sequence $q$, and a secondary structure $S$ as input, output $\min_{S, \sigma} \Delta(S, \sigma)$ if $S \in \mathcal{S}$, and $\infty$ otherwise.

Our results are summarise in the following theorems.

Theorem 1. $\textsf{SPIN}_{\Delta_1}$ and $\textsf{SPIN}_{\Delta_2}$ are $\sP$- and $\APX$-hard.

Theorem 2. $\textsf{MFE}_{\Delta_1}$ is $\APX$-hard and in $|\Sigma|-W[1]$. When $\Sigma$ is unbounded, it is $\sP$-hard.

Theorem 3. $\textsf{MFE}_{\Delta_2}$ is $\APX$-hard, $\NP$-hard when $\Sigma$ is bounded, and $\sP$-hard when it is unbounded.

References

Sussman, C., Liberatore, R. A., & Drozdz, M. M. (2024). Delivery of DNA-Based Therapeutics for Treatment of Chronic Diseases. Pharmaceutics, 16(4), 535. https://doi.org/10.3390/pharmaceutics16040535
von Löhneysen, S., Mörl, M., & Stadler, P. F. (2024). Limits of experimental evidence in RNA secondary structure prediction. Frontiers in Bioinformatics, 4. https://doi.org/10.3389/fbinf.2024.1346779
Liebl, K., & Zacharias, M. (2023). The development of nucleic acids force fields: From an unchallenged past to a competitive future. Biophysical Journal, 122(14), 2841–2851. https://doi.org/10.1016/j.bpj.2022.12.022
Katz, K., Shutov, O., Lapoint, R., Kimelman, M., Brister, J. R., & O’Sullivan, C. (2022). The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Research, 50(D1), D387–D390. https://doi.org/10.1093/nar/gkab1053
Kühnlein, A., Lanzmich, S. A., & Braun, D. (2021). tRNA sequences can assemble into a replicator. ELife, 10. https://doi.org/10.7554/eLife.63431
Statello, L., Guo, C.-J., Chen, L.-L., & Huarte, M. (2021). Gene regulation by long non-coding RNAs and its biological functions. Nature Reviews Molecular Cell Biology, 22(2), 96–118. https://doi.org/10.1038/s41580-020-00315-9
Sengar, A., Ouldridge, T. E., Henrich, O., Rovigatti, L., & Šulc, P. (2021). A Primer on the oxDNA Model of DNA: When to Use it, How to Simulate it and How to Interpret the Results. Frontiers in Molecular Biosciences, 8. https://doi.org/10.3389/fmolb.2021.693710
Fan, D., Wang, J., Wang, E., & Dong, S. (2020). Propelling DNA Computing with Materials’ Power: Recent Advancements in Innovative DNA Logic Computing Systems and Smart Bio‐Applications. Advanced Science, 7(24). https://doi.org/10.1002/advs.202001766
Kimchi, O., Cragnolini, T., Brenner, M. P., & Colwell, L. J. (2019). A Polymer Physics Framework for the Entropy of Arbitrary Pseudoknots. Biophysical Journal, 117(3), 520–532. https://doi.org/10.1016/j.bpj.2019.06.037
Lee, H., Zhang, Z., & Krause, H. M. (2019). Long Noncoding RNAs and Repetitive Elements: Junk or Intimate Evolutionary Partners? Trends in Genetics, 35(12), 892–902. https://doi.org/10.1016/j.tig.2019.09.006
Ceze, L., Nivala, J., & Strauss, K. (2019). Molecular digital data storage using DNA. Nature Reviews Genetics, 20(8), 456–466. https://doi.org/10.1038/s41576-019-0125-3
Pardi, N., Hogan, M. J., Porter, F. W., & Weissman, D. (2018). mRNA vaccines — a new era in vaccinology. Nature Reviews Drug Discovery, 17(4), 261–279. https://doi.org/10.1038/nrd.2017.243
Shibata, T., Fujita, Y., Ohno, H., Suzuki, Y., Hayashi, K., Komatsu, K. R., Kawasaki, S., Hidaka, K., Yonehara, S., Sugiyama, H., Endo, M., & Saito, H. (2017). Protein-driven RNA nanostructured devices that function in vitro and control mammalian cell fate. Nature Communications, 8(1), 540. https://doi.org/10.1038/s41467-017-00459-x
Proctor, J. R., & Meyer, I. M. (2013). C o F old : an RNA secondary structure prediction method that takes co-transcriptional folding into account. Nucleic Acids Research, 41(9), e102–e102. https://doi.org/10.1093/nar/gkt174
Lai, D., Proctor, J. R., & Meyer, I. M. (2013). On the importance of cotranscriptional RNA structure formation. RNA, 19(11), 1461–1473. https://doi.org/10.1261/rna.037390.112
Deltcheva, E., Chylinski, K., Sharma, C. M., Gonzales, K., Chao, Y., Pirzada, Z. A., Eckert, M. R., Vogel, J., & Charpentier, E. (2011). CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature, 471(7340), 602–607. https://doi.org/10.1038/nature09886
Deigan, K. E., Li, T. W., Mathews, D. H., & Weeks, K. M. (2009). Accurate SHAPE-directed RNA structure determination. Proceedings of the National Academy of Sciences, 106(1), 97–102. https://doi.org/10.1073/pnas.0806929106
Istrail, S., & Lam, F. (2009). Combinatorial Algorithms for Protein Folding in Lattice Models: A Survey of Mathematical Results. Communications in Information and Systems, 9(4), 303–346. https://doi.org/10.4310/CIS.2009.v9.n4.a2
Rohs, R., West, S. M., Sosinsky, A., Liu, P., Mann, R. S., & Honig, B. (2009). The role of DNA shape in protein–DNA recognition. Nature, 461(7268), 1248–1253. https://doi.org/10.1038/nature08473
Mandal, M., & Breaker, R. R. (2004). Gene regulation by riboswitches. Nature Reviews Molecular Cell Biology, 5(6), 451–463. https://doi.org/10.1038/nrm1403
Lyngsø, R. B. (2004). Complexity of Pseudoknot Prediction in Simple Models (pp. 919–931). https://doi.org/10.1007/978-3-540-27836-8_77
Mathews, D. H., Sabina, J., Zuker, M., & Turner, D. H. (1999). Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology, 288(5), 911–940. https://doi.org/10.1006/jmbi.1999.2700
Hartmann, B., & Lavery, R. (1996). DNA structural forms. Quarterly Reviews of Biophysics, 29(4), 309–368. https://doi.org/10.1017/S0033583500005874
Hagerman, P. J. (1988). Flexibility of DNA. Annual Review of Biophysics and Biophysical Chemistry, 17(1), 265–286. https://doi.org/10.1146/annurev.bb.17.060188.001405
Sims, J., & Benz, E. W. (1980). Initiation of DNA replication by the Escherichia coli dnaG protein: evidence that tertiary structure is involved. Proceedings of the National Academy of Sciences, 77(2), 900–904. https://doi.org/10.1073/pnas.77.2.900
Nussinov, R., & Jacobson, A. B. (1980). Fast algorithm for predicting the secondary structure of single-stranded RNA. Proceedings of the National Academy of Sciences, 77(11), 6309–6313. https://doi.org/10.1073/pnas.77.11.6309
Wang, J. C. (1979). Helical repeat of DNA in solution. Proceedings of the National Academy of Sciences, 76(1), 200–203. https://doi.org/10.1073/pnas.76.1.200

-- NGUYEN Doan-Dai , on September 12th, 2024

An Ising model in RNA folding (part 1)

1. Introduction

1.1. Related works

1.2. Organisation

2. Preliminaries

2.1. Nucleotide alphabet

2.2. Secondary structure

2.3. Base-pair counting model and base-pair stacking model

3. Ising model for nucleic acids

4. Statements of results

References

Some other posts

Private feature learning for single-index models (part 3)

Private feature learning for single-index models (part 2)

Private feature learning for single-index models (part 1)