Transcriptome Assembly And Gene Discovery Of Cistanche Deserticola Fleshy Stem-Ⅰ

Backgrounds

Cistanche deserticola is a completely non-photosynthetic parasitic plant with great medicinal value and is mainly distributed in the desert of Northwest China. Its dried fleshy stem is a crucial tonic in traditional Chinese medicine with roles of mainly improving male sexual function and strengthening immunity, but few mechanistic studies have been conducted partly due to the lack of genomic and transcriptomic resources.

Natural cistanche tubulosa

NATURAL CISTANCHE TUBULOSA CHINESE TRADITIONAL MEDICINE PHGS75% ECH 30% ACT 12%

Results

In this study, we performed deep transcriptome sequencing in the fleshy stem of C. deserticola, and about 80 million reads were generated using Illumina pair-end sequencing on the HiSeq2000 platform. Using the trinity assembler, we obtained 95,787 transcript sequences with transcript lengths ranging from 200bp to 15,698bp, having an average length of 950 bases and an N50 length of 1,519 bases. 63,957 transcripts were identified as actively expressed with FPKM ≥ 0.5, in which 30,098 transcripts were annotated with gene descriptions or gene ontology terms by sequence similarity analyses against several public databases (Uniprot, NR, and Nt at NCBI, and KEGG). Furthermore, we identified key enzyme genes involved in the biosynthesis of lignin and phenylethanoid glycosides (PhGs) which are known to be the primary active ingredients. Four phenylalanine ammonia-lyase (PAL) genes, the first key enzyme in lignin and PhG biosynthesis were identified based on sequence comparison and phylogenetic analysis. Two biosynthesis pathways of PhGs were also proposed for the first time.

Conclusions

In all, we completed a global analysis of the C. deserticola fleshy stem transcriptome using RNA-seq technology. A collection of enzyme genes related to the biosynthesis of lignin and phenylethanoid glycosides were identified from the assembled and annotated transcripts, and the gene family of PAL was also predicted. The sequence data from this study will provide a valuable resource for conducting future phenylethanoid glycosides biosynthesis research and functional genomic studies in this important medicinal plant.

Introduction

C. deserticola is a worldwide genus of perennial desert plants from the Orobanchaceae family and is a completely non-photosynthetic species and usually grows underground holoparasitic plant. It is parasitized on the roots of psammophyte Haloxylon ammodendron (Chenopodiaceae), which mainly inhabits deserts and semi-deserts due to its high tolerance to drought and salinity. C. deserticola shows strong resistance to harsh environmental conditions and is mainly distributed in Northwest China, especially in Inner Mongolia, Gansu, and Xinjiang. It is considered to be an endangered wild species in recent years due to increased consumption by humans. C. deserticola which is often called desert ginseng is commonly known as desert-broomrape and the dried fleshy stem has been extensively used as a traditionally important tonic in China and Japan for many years. It was initially recorded in Shen Nong Ben Cao Jing (Dictionary of Chinese Materia Medica, 1977)approximately 1800 years ago and was regarded as one of the main sources of the Chinese medicinal herb Cistanche.

Chinese cistanche tubulosa

NATURAL CISTANCHE TUBULOSA FOR IMPROVING SEXUAL FUNCTION PHGS75% ECH 30% ACT 12%

The extracts of C. deserticola possess a wide range of medicinal functions, especially for use in improving sexual function, tonifying kidneys, protecting the liver, aperient activity, enhancing memory, immunomodulatory, antioxidative activity, anti-inflammatory, antiviral activity, etc. The major bioactive components of C. deserticola are Phenylethanoid glycosides (PheGs, PhGs). To date, more than 20 phenylethanoid glycosides have been isolated from the succulent stem of C.deserticola. Among them, acteoside and echinacoside are two main components with significant pharmacological activities and are documented as the quality standards of C. deserticola in the Chinese pharmacopeia (2005 and 2010 editions). Three chemical components of PhGs are organic acid, saccharide, and phenylethanoid, however, the details concerning phenylethanoid biosynthetic pathways remain poorly understood in C.deserticola.

Despite the commercial and medicinal importance of C.deserticola, the genomic and transcriptomic data of this species are very limited. There are no ESTs available in the NCBI database and the complete genome information for this species remains unavailable except for the chloroplast genome sequence. The limited transcriptomic data hinder the study of PhG biosynthetic mechanisms. RNA-seq technology can generate sequences of the expressed parts of the targeted genome and identify genes [18] using the NGS technology platforms (such as Applied Biosystems SOLiD, Illumina HiSeq, and Roche 454). It is becoming increasingly popular in transcriptome de novo assembly, since it is a cost-effective and powerful approach with high resolution and broad dynamic range, especially since it has an advantage to explore low-abundance transcripts. Because of the various advantages, RNA-seq is specifically attractive for non-model organisms with limited genetic resources. However, there is no detailed research on C. deserticola transcriptome by RNA-seq.

In this study, we globally sequenced the stem transcriptome for C. deserticola using the Illumina Hiseq2000 platform and got 7.9G raw data. By assembly and annotation, we mined the genes involved in the biosynthesis of PhG and the genes responsible for entire lignin biosynthesis. Our RNA-seq analysis generated the first C. deserticola consensus transcriptome and provided new insights into a comprehensive understanding of the medicinal value of C. deserticola. Additionally, the method described here can be widely applied to profile transcriptomes to facilitate the discovery of genes involved in specific medicinal component biosynthesis pathways in another medicinal plant with very limited genomic resources.

Materials and Methods

Plant material collection

The fresh succulent stem for C. deserticola in the excavation stage was collected from a plant base in BayanHot City of Alxa League in Inner Mongolia in northwestern China. The collecting permit was obtained from the owner (HongKui CongRong Group) of the plant base. The voucher specimen was deposited in the Core Genomic Facility at the Beijing Institute of Genomics, Chinese Academy of Sciences. After cleaning, the succulent stem tissues were cut into small pieces and immediately frozen in liquid nitrogen, and then stored at -80°C until further processing.

RNA extraction, cDNA library construction, and Illumina sequencing

Total RNA was extracted from the succulent stem using TRIzol Reagent (Invitrogen Inc., California, USA) according to the manufacturer's instructions. The resulting samples were treated with DNase I to remove any genomic DNA. Extracted RNAs were quantified using an Agilent 2100 bioanalyzer (Agilent Technologies) and checked for integrity using denaturing agarose gel electrophoresis with ethidium bromide staining. RNA samples with A260/A280 ratios between 1.9 and 2.1, RNA 28S:18S ratios higher than 1.0, and RNA integrity numbers (RINs) -8.5 were used in subsequent analyses.

The RNA-seq libraries were generated using Illumina Truseq RNA Sample Preparation Kits. Poly(A)+ RNA was isolated from total RNA using Dynal ligo(dT)25 beads according to the manufacturer's instructions. Following purification, a fragmentation buffer was added to break the mRNA into short fragments. First-strand cDNA was synthesized using these short fragments as templates, along with SuperScript III reverse transcriptase and N6 random hexamer primer. Second-strand cDNA was then synthesized using buffer, dNTPs, RNaseH, and DNA polymerase I. The resulting double-stranded cDNA was subjected to end-repair using T4 DNA polymerase, DNA polymerase I Klenow fragment, and T4 polynucleotide kinase, and ligated to adapters using T4 DNA ligase. Adaptor-ligated fragments were purified using a QiaQuick PCR extraction kit and eluted with EB buffer. After analysis using agarose gel electrophoresis, suitable fragments were selected as templates for PCR amplification. Sequencing of the resulting cDNA library was carried out with an Illumina HiSeq 2000 system.

Transcripts de novo assembly and gene expression quantification

Raw reads generated from sequencing were cleaned by removing the adaptor sequences (ATCTCGTATGCCGTC) using an in-house method. We then carried out a stringent low-quality filtering process. Firstly, bases with a phred quality score lower than 20 would be trimmed from the 3'end of the sequence, until running into one base with a higher quality (≥ 20). If the read length was shorter than 50bp, it would be discarded. Secondly, reads will be further filtered by the criterion that 70% of bases in one read have high-quality scores (≥ 20). Thirdly, only paired-end reads were used for further assembly. De novo transcript assembly was conducted using Trinity release_20130216 [30] which consisted of three successive software modules: Inchworm, Chrysalis, and Butterfly. The assembly parameters were set as below:-seqType fq-JM 300G -min_contig_length 200-CPU 20-inchworm_cpu 20-bflyCPU 20.

To quantify transcript abundance, the sequenced pair-end reads were re-aligned to the assembled transcripts using a script in Trinity. Mapped reads were used for quantification by RSEM (RNA-Seq by Expectation Maximization) software. Gene or isoform abundance was represented by the fragment per kilobase of transcript per million fragment mapped (FPKM) value, those transcripts with FPKM value equal to or larger than 0.05 were defined as expressed.

Functional annotation of expressed transcripts

There are no gene annotation sets of C. deserticola except for the chloroplast genome [1]. We annotated the expressed transcripts by comparing them to Genbank Nt, Genbank Nr, and TAIR10_ pep_20101214_updated datasets separately using the BLAST program (E< = 1e-20). Meanwhile, all expressed transcripts were translated into potential proteins according to ORF prediction by TransDecoder and predicated for the conserved domains based on the Pfam database.

Gene Ontology and KEGG pathway annotation By sequence similarity alignment to the Uniprot database ( the Gene Ontology (GO) annotation of all assembled transcripts was obtained by using an association file downloaded from (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association. goa_uniprot.gz). GO terms clustering of expressed genes was conducted by using custom scripts, and we annotated genes at the fourth level for the CC, BP, and MF categories separately.

KEGG pathway information was assigned for all predicted protein sequences using the online tool KAAS (KEGG Automatic Annotation Server) [34]. Sequences in fasta format were submitted to KAAS request, and the resulting files of all pathways information related to C. deserticola stem transcriptome were downloaded. 13 plant organisms' gene data sets in KEGG were used for annotation using the BBH (bi-directional best hit) method.

cistanche tubulosa extract

NATURAL CISTANCHE TUBULOSA CISTANCHE EXTRACT PHGS75% ECH 30% ACT 12%

RT-qPCR analysis

After digestion with DNase I, approximately 5μg of total RNA was converted into first-strand cDNA via the reverse-transcription reaction with oligo(dT)15 primers and GoScript Reverse Transcription System (Promega). The cDNA products were then diluted 10-fold with nuclease-free deionized water before use as a template in real-time PCR. Specific cDNAs were amplified by the GoTaq 2-Step RT-qPCR system (Promega) in a volume of 20 ul. PCR amplification was performed at the annealing temperature of 60°C with the 7500 Real-Time PCR Detection System (Applied Biosystems) according to the manufacturer's instructions. Relative transcript abundances were calculated by the comparative cycle threshold method with gene "comp10579_c0" as an internal standard, using the 7500 Manager software.

Primer pairs for RT-PCR were designed based on online software (http://primer3.ut.ee/) and are listed in the S1 Dataset.

Results

RNA sequencing and de novo transcriptome assembly of C. deserticola fleshy stem

Stem of C. deserticola has been extensively used as a traditionally important tonic in China and Japan for many years. To obtain a global overview of gene expression in the C. deserticola fleshy stem, we collected C. deserticola stem samples of the same plant base in 2013 and 2014, respectively. Total RNAs were extracted and polyA+ RNAs were purified for constructing paired-end RNA-seq libraries. 79,433,734 and 86,019,176 pair-end reads corresponding to nearly 8 billion and 8.6 billion bases of the sequence were obtained using the Illumina HiSeq 2000 sequencing

platform in 2013-year and 2014-year samples (Table 1). After removing adaptor sequences and filtering low-quality reads (see details in Methods), 64,831,040 high-quality pair-end reads in the 2013-year sample were used for de novo transcriptome assembly. Using the Trinity sequence assembler [30], 51,719 genes and 95,787 transcript sequences were generated with transcript lengths ranging from 200 bp to 15,698 bp. The average length of assembled transcripts is 950 bases and the N50 length is 1,519 bases. The number of transcripts in different lengths revealed that 57.32% of the assembled transcripts were about 500 bp or longer (Fig 1A). High-quality pair-end reads in the 2014-year sample were mapped to the assembled transcriptome. Besides, we found that the transcript number for each assembled gene varied and 69% of genes with one expressed isoform while 31% of genes expressed two or more transcripts (Fig 1B).

Expression quantification and functional annotation of assembled transcripts

Gene or transcript abundance was quantified using the RSEM package, in which the sequenced reads were re-aligned to the assembled genes or transcripts sequences using Bowtie, and those mapped reads were used for quantification. FPKM value for each gene or transcript was calculated, and finally, we identified 63,957 and 52,857 actively expressed transcripts (FPKM value ≥ 0.5) in C. deserticola fleshy stem samples in 2013 and 2014, respectively. 44,776 transcripts (70.01% in the 2013-year sample, 84.71% in the 2014-year sample) were commonly expressed in the two replicates, and the correlation (Pearson correlation coefficient: 0.91979) of their expression data was shown in S1 Fig. The sequencing raw data had been uploaded to the NCBI SRA database (accession numbers: SRX857402 and SRX858938). We used expressed genes identified in the 2013-year sample for further analysis. Functional annotation information for all expressed transcripts was obtained using two methods. Firstly, all expressed transcripts were aligned to known nucleotide (GenBank nt) and peptide sequence databases (GenBank nr and Arabidopsis peptide) separately by the BLAST algorithm. Out of 63,957 expressed transcripts,

29,220 (45.7%) were annotated and showed homology to sequences in any of the three subject databases with E-value cutoff 1e-20. Meanwhile, the candidate coding regions for all expressed transcript sequences were predicted using TransDecoder software, and the longest ORFs for each transcript were used for the Pfam domain search. As a result, 21,358 (33.4%) transcripts were annotated based on the Pfam database. Overall, 30,098 (47.1%) transcripts were significantly matched to known genes in the public databases by combining the two methods above. The complete expressed transcripts list with function annotation was shown in supplemental data (S2 Dataset).

We surveyed the top 20 most highly expressed transcripts (Table 2) corresponding to 18.99% of all sequencing reads, and found that most of them are genes responding to abiotic

stress stimulus. Dehydrin (DHNs), a class of hydrophilic and thermostable stress proteins with a high number of charged amino acids that belong to the Group II Late Embryogenesis Abundant (LEA) family, is the most highly expressed gene. Three different Dehyrin transcripts (comp28713_c0_seq1/2/4) were detected as highly expressed in fleshy stems which may be involved in protecting cells from damage caused by drought stress. Other stress-related genes such as heat shock protein, pathogen-related protein, and metallothionein were also found expressed highly, which may be related to its severe survival environment. Additionally, some constitutive genes including 26S ribosomal RNA gene (comp22329_c2_seq1), auxin-repressed/ dormancy-associated protein (comp20999_c0_seq1), ADP-ribosylation factor (comp20499_ c0_seq1) were also highly transcribed.

Cistanche tubulosa extract