Transgene Sequence Design Technology

Tomoshi KAMEDA
Senior Researcher
Artificial Intelligence Research Center,
National Institute of Advanced Industrial Science and Technology (AIST)

What This Technology Enables

It enables controlling (increasing or decreasing) protein expression levels by modifying genetic sequences.

Example of ApplicationUseful aromatic compounds

Introduction to This Technology

In bio-based material production using microorganisms, it is sometimes performed to introduce heterologous genes into the target microorganism in order to have the microorganism produce proteins it does not naturally possess. In that case, to improve the production level of the target protein, the process of properly designing the DNA sequence of the transgene (codon optimization) is important. Conventional research on codon optimization is conducted on microorganisms used for research purposes, which are easy to handle in experiments, such as E. coli. For microorganisms for industrial purposes, such as actinomycetes, there has been no established method for codon optimization. We have developed a new codon optimization method by extracting rules through information analyses, of large-scale protein production experiment data owned by the National Institute of Advanced Industrial Science and Technology (AIST), and demonstrated its effectiveness in Rhodococcus sp.¹ This method can be applied to material production in various hosts other than actinomycetes, and we have actually confirmed its effectiveness. In addition, since designed genetic sequences contain mutations only at the head of the original sequence, they can be synthesized at a low experimental cost (Fig. 1).

In this study,¹ we analyzed information on protein production experimental data in Rhodococcus sp. owned by AIST. The data contained information on the gene sequences of 204 genes, with a link to their protein production levels. By calculating various sequence feature values from the genetic sequences, the correlation between protein expression levels and sequence feature values was evaluated. As a result, the stability of mRNA secondary structures at the head of the sequence and the sequence feature values used as the Codon Adaptation Index (CAI) showed a high correlation with protein production levels. Based on this result, we have developed a new codon optimization method that designs sequences to prevent the formation of secondary structures and also to have high CAI values. In this method, first, for the original genetic sequence, all combinations of DNA sequences in which sequences for amino acids of the protein are identical and only codon usage patterns are modified are computationally generated. Then, for each sequence generated, the stability of the mRNA secondary structure at the head of the sequence and the CAI are calculated. Using the obtained values, the optimum sequence, which has the CAI greater than the specified threshold and is least likely to form a secondary structure, is sought. In general, there are a huge number of DNA sequences with a modified codon usage pattern for a certain genetic sequence. Therefore, such a search task requires a very long calculation time and is not practical. This time, the information analysis of the data owned by AIST has revealed that the head of genetic sequences is particularly important for protein expression levels. Based on this finding, we were able to accelerate the calculation by narrowing the search range to only the head of sequences and to realize this method. Thus, we have succeeded in developing a novel codon optimization method, by closely linking experimental data acquisition and information analysis.

In order to verify the effectiveness of this method, we designed the sequences of 12 genes, and introduced the designed sequences into Rhodococci to evaluate protein production levels. For each gene, 6 types of sequences that improve protein production levels and 3 types of sequences that decrease protein production levels were designed by variously changing the CAI threshold setting, and these sequences were compared with 9 wild-type sequences. Each comparison was repeated three times in a large-scale verification (12 genes × 10 sequences × 3 times = 360 experiments). As a result, we succeeded in improving the production levels of 9 genes (75%) in comparison with the wild-type sequences, and also identified the optimum threshold setting for the CAI (Fig. 2). In particular, regarding 5 genes whose wild-type sequences lead to low protein production levels, an improvement in production levels by the sequence design was observed in all genes (100%) (Fig. 3). This result indicates that this method is particularly effective for proteins that are difficult to produce. Regarding the sequences considered to decrease protein production levels, the production level decreased in all the 12 genes, as expected. In addition, possessing mutations only at the head of the designed sequences compared to the wild type has another advantage of this method (Fig. 1). This allows designed sequences to be synthesized not by a high-cost method, such as full-length gene synthesis, but by a very simple and inexpensive method, using only PCR with primers carrying mutation. Furthermore, this method can be applied to various microorganisms other than Rhodococcus sp. by adjusting the CAI to that of the target microorganism, and we have actually confirmed its effectiveness.

図1．情報技術による遺伝子配列設計でタンパク質生産量を向上

図2．遺伝子配列設計によるタンパク質生産量向上の有効性検証

図3．野生型配列でのタンパク質生産量が少ない遺伝子の配列設計による改善例

Technology Introduction

Transgene Sequence Design Technology

What This Technology Enables

Introduction to This Technology

References