KidneyNetwork: Using Kidney-derived Gene Expression Data To Predict And Prioritize Novel Genes Involved in Kidney Disease Ⅰ
Aug 07, 2023
Genetic testing in patients with suspected hereditary kidney disease may not reveal the genetic cause for the disorder as potentially pathogenic variants can reside in genes that are not yet known to be involved in kidney disease. We have developed Kidney Network, which utilizes tissue-specific expression to inform candidate gene prioritization specifically for kidney diseases. KidneyNetwork is a novel method constructed by integrating a kidney RNA-sequencing co-expression network of 878 samples with a multi-tissue network of 31,499 samples. It uses expression patterns and established gene-phenotype associations to predict which genes could be related to what (disease) phenotypes in an unbiased manner. We applied KidneyNetwork to rare variants in exome sequencing data from 13 kidney disease patients without a genetic diagnosis to prioritize candidate genes. KidneyNetwork can accurately predict kidney-specific gene functions and (kidney disease) phenotypes for disease-associated genes. The intersection of prioritized genes with genes carrying rare variants in a patient with kidney and liver cysts identified ALG6 as a plausible candidate gene. We strengthen this plausibility by identifying ALG6 variants in several cystic kidney and liver disease cases without alternative genetic explanations. We present KidneyNetwork, a publicly available kidney-specific co-expression network with optimized gene-phenotype predictions for kidney disease phenotypes.

We designed an easy-to-use online interface that allows clinicians and researchers to use gene expression and co-regulation data and gene-phenotype connections to accelerate advances in hereditary kidney disease diagnosis and research.
TRANSLATIONAL STATEMENT: Genetic testing in patients with suspected hereditary kidney disease may not reveal the genetic cause for the patient’s disorder. Potentially pathogenic variants can reside in genes not yet known to be involved in kidney disease, making it difficult to interpret the relevance of these variants. This reveals a clear need for methods to predict the phenotypic consequences of genetic variation in an unbiased manner. Here we describe KidneyNetwork, a tool that utilizes tissue-specific expression to predict kidney-specific gene functions. Applying KidneyNetwork to a group of undiagnosed cases identified ALG6 as a candidate gene in cystic kidney and liver disease. In summary, KidneyNetwork can aid the interpretation of genetic variants and can therefore be of value in translational nephrogenesis and help improve the diagnostic yield in kidney disease patients.

CLICK HERE TO GET CISTANCHE FOR KIDNEY DISEASE
INTRODUCTION
Genetic testing in patients with suspected hereditary kidney disease can reveal causative pathogenic variants in kidney-related genes. However, in many cases, a genetic cause cannot yet be detected. Pathogenic variants in known kidney-related genes are detected in approximately 10–30% of genetically tested patients with chronic kidney disease of any cause [1–3]. However, these percentages are likely underestimations of the number of patients with a monogenic cause as variants in genes not yet implicated in kidney disease will go unnoticed. Potentially harmful variants can reside in these genes, which makes it difficult to prioritize and interpret the relevance of these variants. Therefore, in the current era of genomic medicine, one of the main challenges after a negative diagnostic result in known genes is to detect and prioritize new candidate genes with potentially pathogenic variants that can explain the patient’s disease [4].
RNA-sequencing data can be used to predict candidate disease genes [5]. We recently developed GeneNetwork and the GeneNetwork-Assisted Diagnostic Optimization (GADO) method to prioritize new candidate disease genes based on RNAsequencing data [6]. The idea behind this method is that certain rare disorders can be caused by variants in several genes. While these genes are different, they usually have similar biological functions. When studying gene expression data from a large number of samples, these disease genes usually show strong co-expression [6]. Thus, if there are other genes that are strongly co-expressed with known rare disease genes, it is possible that variants in these other genes can also cause the same disease.
For this kind of tool to work optimally, the co-expression information should be as accurate as possible. For GADO, we built a gene co-expression network based on publicly available RNAsequencing datasets from many different tissues and used this network to predict which genes might be causing rare diseases. These predictions were trained using the human phenotype ontology (HPO) database [7]. In the HPO database, genes are assigned to phenotypes ‒ called HPO-terms ‒ that are based on gene-disease annotations and disease symptoms present in the OMIM [8] and Orphanet [9] databases. By integrating the information from the HPO database with the gene co-expression network, we could calculate prediction scores for each gene per HPO term. Together, these scores constitute GeneNetwork. GADO then prioritizes genes by combining an input list of HPO terms that describe the patient’s phenotype with a list of genes with possible deleterious variants from that patient. The prioritization of the gene list is based on the combined gene prediction scores for the input HPO terms [6].
Because we observed that GeneNetwork’s prediction performance for kidney-related HPO phenotypes was limited, we sought to improve prediction by developing a kidney-specific network. We did this by using 878 kidney RNA-sequencing samples that we enriched with an existing dataset of 31,499 samples from other tissues [6]. By developing a new prediction algorithm that can weigh the information that is present within both datasets we improved performance for kidney-related pathways. In this paper, we present the resulting KidneyNetwork, a co-expression network that can be used to accurately predict gene-phenotype associations of genes unknown for kidney-related HPO terms. As proof of principle, we applied KidneyNetwork to exome sequencing data from a group of patients with previously unresolved kidney diseases.
METHODS
To improve the prediction of kidney-related phenotypes, we collected kidney-derived RNA-sequencing data, updated GeneNetwork with more recent reference databases and improved statistical analyses, followed by integration of tissue-specific information.
Datasets in KidneyNetwork
RNA-sequencing data from selected kidney samples of several origins, including primary, tumor and fetal tissue were combined with an existing dataset of multi-tissue RNA-sequencing used as the foundation for our previously described GeneNetwork [6] (Table S1, S2). We chose to include the multi-tissue dataset for two reasons. First, we needed a sufficient number of samples to build a baseline network. Second, we wanted to preserve expression that is specifific to several, or all, kidney cell types but not to other tissues. We did this because gene-phenotype scores are based on differences in expression between samples; if all genes have high (or low) expression in all the samples included in the analysis, they will not add sufficient information to the prediction algorithm. The multi-tissue dataset of human RNA-sequencing samples used to develop GeneNetwork was re-used and processed as described previously [6]. After pre-processing, this dataset contained 31,499 samples and 56,435 genes.
3,194 Kidney-derived RNA-sequenced samples were downloaded from the European Nucleotide Archive (ENA) and the Genotype-Tissue Expression (GTEx) Project (Note S1). Preprocessing of the kidney dataset was done similarly to the multi-tissue dataset [6] (Note S2, Note S3). After sample and gene selection, 58,283 genes and 878 kidney samples remained. We investigated the remaining 878 RNA-sequencing samples using the UMAP clustering algorithm (Note S4).
HPO filtering. For the construction of KidneyNetwork, we used gene-phenotype associations from the HPO database [7] version 1268. In the HPO database, annotation of genes to HPO-defined phenotypes is based on the gene-disease annotations in the OMIM [8] morbid map (downloaded March 26, 2018) and the Orphanet [9] “en_product6.xml” file version 1.3.1. Gene‒disease annotations in these databases can be based on several factors, including statistical associations and large-scale copy number variations. We wanted to train KidneyNetwork using only genes for which the link between gene and the rare disease is well established. Therefore, we excluded the multigenic syndromes, since it is often not clear which of the genes in the copy number variants contribute to which phenotypes. We also excluded mere susceptibility genes (Note S5).
Expression normalization. After sample and gene quality control (QC), the expression matrix of the remaining samples and genes was log2- transformed and gene counts were normalized using DESeq following the median of ratios method. We then corrected the gene expression data for covariates (Note S6).

Decomposition
After filtering and QC of the entire dataset, the next step was to perform a decomposition to calculate the eigenvectors of the dataset (Note S7). For both GeneNetwork and the gene regulatory network based on kidney-derived data, we defined the optimal number of components (Note S8). The first 165 eigenvectors for GeneNetwork and the first 170 eigenvectors for the kidney-derived data were identified and merged into a larger matrix containing all 335 eigenvectors.
Gene‒HPO-term score calculation The gene-phenotype score calculation was done in several steps (Fig. S5). First, we performed a logistic regression using the combined eigenvectors and the gene‒phenotype annotations file as input. We used the resulting β values and the eigenvector scores to calculate a gene log-odds score for every gene in every eigenvector (Note S9).

To avoid overfitting of the gene log-odds-scores of already annotated genes, we applied a leave-one-out cross-validation approach (Note S10). The log odds were subsequently translated to gene z-scores using a permuted null distribution for each phenotype (Note S11).
To determine prediction accuracy, we calculated the area under the ROC curve (AUC). The AUC was calculated per HPO term using the predicted gene z-scores and known annotations. The significance of the predictions was calculated using the two-sided Mann-Whitney rank test. After Bonferroni correction, a prediction was considered signifificant at p < 0.05.
Comparison of prediction performance
We compared the prediction performance of four distinct networks: (1) the original GeneNetwork, (2) the updated GeneNetwork, (3) the kidneyspecifific gene regulatory network based solely on kidney-derived samples, and finally (4) KidneyNetwork, which combines the latter two. The quality of the HPO predictions made by these networks was assessed based on the AUC for each kidney-related phenotype (Table S3). Improved quality of a network was defined as improved prediction accuracy for kidney-related terms that were signifificantly predicted in each comparison of two networks and by an increased number of signifificantly predicted kidney-related terms. The significance of improvement in prediction accuracy of one network versus another was assessed using the DeLong test [10] integrated in the pROC R package [11].
Application of KidneyNetwork to 13 patients with suspected hereditary kidney disease One of the applications of KidneyNetwork is to prioritize candidate genes in patients with unsolved kidney disease. To evaluate this clinical application, we used KidneyNetwork to prioritize candidate genes for patients with various kidney diseases using the GADO method [6]. GADO combines the gene prediction z-scores rendered through KidneyNetwork for a given set of HPO terms. Genes with a combined z-score ≥5 for the unique set of HPO terms associated with each patient were considered potential candidate genes for that patient.

Fig. 1 UMAP visualization of the kidney-derived expression data. 878 samples were grouped into three main clusters: healthy primary tissue (middle and bottom), developmental samples (left) and renal cell carcinoma (RCC) samples (right). On the left side of the figure, the clustering of pluripotent stem cell (PSC)-derived podocytes and PSC-derived organoids with primary fetal samples and nephron progenitor cells can be seen. On the right side, RCC samples cluster close to proximal tubule samples, and the RCC cluster closest to healthy primary tissue samples consists of nonclear cell RCC (nccRCC) samples. In the middle and at the bottom, healthy primary kidney samples cluster based on their tissue of origin.
The 13 patients included in the study were all suspected to have monogenic kidney disease but had no genetic diagnosis (Note S12). HPOterms were assigned to these cases based on their phenotype. For each patient, the complete exome sequencing data were analyzed using CAPICE [12] to identify potentially pathogenic variants. Genes containing variants with a gnomAD Popmax filtering AF [13] <0.005 and a recall ≥99%, corresponding with a mild CAPICE cut-off of ≥0.0027, were considered interesting candidates.
Overlapping the genes identified by the KidneyNetwork integration in GADO with those identified by CAPICE resulted in a list of genes for each patient. These genes and variants in these genes were manually reviewed by a nephrogenesis expert panel (AMvE, LRC, NVAMK) for their pathogenetic potential based on population metrics, prediction tools, available literature, and segregation (Note S13). For the resulting candidate gene, additional patients carrying variants in the same gene were identified via collaborators and the 100,000 Genomes Project [14]. Also, the GeneMatcher tool [15] was used and yielded no additional patients through February 15th, 2023.
Identifification of additional patients The previously described unsolved polycystic kidney and liver disease cohort [16] was used to assess rare variants (Note S14). We used Fisher’s exact test to compare the frequency of identified variant(s) to the European subset of non-Finnish Europeans in the gnomAD database [17]. Furthermore, we used the 100,000 Genomes Project [14] for the identifification of additional patients based on the identified variant(s) (Note S15).

RESULTS
Data retrieval and sample clustering We selected 878 kidney samples (Fig. S2), which we clustered and plotted using the UMAP algorithm (Fig. 1). Generally, the data clusters into three main clusters: primary non-tumor kidney data, kidney developmental samples, and proximal tubule, glomerulus and renal cell carcinoma (RCC) samples.
KidneyNetwork improves gene-phenotype predictions First, we updated GeneNetwork with the updated HPO database (Fig. S6) and optimized the gene network building pipeline (Fig. S7). These changes yielded an improvement in the general GeneNetwork compared to the previous version (Fig. S8). We then used the improved pipeline to build the kidney-specific gene regulatory network. As expected, given the small sample size, this version of the kidney-specific network performed less well than GeneNetwork (Fig. S9). Subsequently, combining GeneNetwork and the kidney specifific gene co-expression network into KidneyNetwork yielded our best results for kidney-related HPOterms (Fig. 2A; Table S5). The prediction AUC, precision, sensitivity and f1-scores for each predicted pathway are provided (Table S6)
Two examples of improved kidney-related HPO-terms are hypomagnesemia and tubulointerstitial abnormality (Fig. 2B). Visualization of these phenotypes in density plots shows higher prioritization z-scores for known disease-related genes compared to non-annotated genes. For unknown genes, the higher the prediction z-score, the more likely they are to be a candidate disease gene. Visualizing the gene interaction networks of known disease genes based on the prediction scores again shows the increase in the number and strength of interactions obtained using KidneyNetwork compared to GeneNetwork.

Fig. 2 KidneyNetwork performs better for kidney-related HPO terms than the updated GeneNetwork. A 27% of kidney-related phenotypes are predicted signifificantly better using KidneyNetwork, as compared to GeneNetwork. B Density plots of the gene prediction scores within two of the most improved phenotypes, hypomagnesemia, and tubulointerstitial abnormality, show higher prediction values for the genes annotated for the phenotype and also predict potential unknown candidate genes. The networks predicted using KidneyNetwork show more and stronger correlations between the annotated genes than the networks predicted using GeneNetwork.
We also saw an increase in the number of signifificant predicted kidney-related HPO-terms for KidneyNetwork (n = 71) compared to GeneNetwork (n = 63). This led us to hypothesize that KidneyNetwork predicts kidney-related terms with higher accuracy overall and is therefore capable of predicting more kidney-related phenotypes with higher significance. A paired t-test shows that overall, the HPO AUC score was signifificantly better for KidneyNetwork versus GeneNetwork (mean AUC: 0.76 versus 0.74; t-test p-value: 4.5 × 10−8 ). This result suggests that KidneyNetwork predicts more kidneyspecifific HPO-terms with a higher prediction accuracy than GeneNetwork.
KidneyNetwork prioritizes ALG6 as a candidate disease gene in patients with kidney cysts and liver cysts To examine the clinical utility of KidneyNetwork, we prioritized genes for 13 patients with a suspected hereditary kidney disease but no genetic diagnosis and intersected these with genes containing potentially pathogenic variants. The resulting gene lists contained 1‒4 candidate genes for 9 of the 13 patients (Table S7). In one patient (SAMPLE6), manual curation of this list identified ALG6 (ALG6 alpha-1,3-glucosyltransferase) as a potential candidate gene to explain the patient’s kidney and liver cysts (Fig. 3). The combined z-score for ALG6 for the imputed HPOterms was signifificant in KidneyNetwork after multiple testing correction (z = 5.43). This gene would have been missed if we had used GeneNetwork: there ALG6 did not reach the significance threshold of z-score ≥5.

ALG6 as a candidate gene for patients with kidney and liver cysts The ALG6 variant c.680 + 2 T > G carried by SAMPLE6 is heterozygous. This is a known pathogenic splice site variant that results in congenital disorder of glycosylation (CDG) type Ic when pathogenic variants are present on both alleles [18, 19]. ALG6 strongly resembles ALG8 which has been implicated in kidney and liver cyst phenotypes [20], and according to KidneyNetwork, ALG6 and ALG8 are highly co-regulated (z-score = 8.59).
Given this biological plausibility, we queried a cohort of 120 unrelated cases of polycystic kidney and liver disease for rare variants, MAF < 0.001, in ALG6. This cohort is minorly updated since it was previously described and has been excluded by exome sequencing analysis for loss of function mutations or reported pathogenic non-truncating variants in PKD1, PKD2, PRKCSH, SEC63, GANAB, ALG8, ALG9, SEC61B, PKHD1, or DNAJB11 [16]. Three unrelated cases (YU372, YU378, YU481) carried rare ALG6 variants; each had the same ALG6 c.257 + 5 G > A noncanonical splice variant known to be pathogenic for ALG6-CDG and splice-altering in vitro [19, 21]. Despite a shared mutation, these three cases each report no known affected family members were enrolled from different states across the United States, and are unrelated to the best limit of detection using the VCFtools relatedness2 algorithm with Relatedness_PHI < 0.005.
Given the representation of this variant in three cases of European ancestry in this phenotypically-defifined cohort, we compared its frequency in the European subset of cases (n = 105) to non-Finnish Europeans in gnomAD [17] with coverage at this position (n = 64,466) [17]. In the patient cohort 3 out of 210 alleles contained this variant, while in gnomAD, a cohort unselected with regards to kidney or liver cyst burden, it was found in 121 of 128,932 alleles. This approximately 10-fold enrichment is statistically signifificant by Fisher's exact test, p = 0.0011. This mutation was also recurrent in cases of ALG6-CDG [19].
We also investigated the 100,000 Genomes Project dataset [14] and contacted collaborators who identified three additional data for application to kidney diseases. A signifificant proportion of patients with a suspected genetic kidney disease remain without a genetic diagnosis, as lists of disease genes for many conditions are incomplete. Identifying which genes are involved in kidney disease is essential for improving the diagnostic yield in kidney disease patients and for studying disease pathogenesis to approach treatment avenues. Establishing novel disease genes requires careful biological validation. Implicating genes worthy of such investigations is critical. Application of KidneyNetwork in conjunction with WES or GWAS data by nephrologists, clinical geneticists, or researchers will help each of these groups to participate in gene implication. KidneyNetwork combines a co-expression network based on a kidney sample dataset with the previously published multi-tissue dataset used to build GeneNetwork. Combining the datasets into KidneyNetwork improved phenotype predictions related to kidney disease when compared to networks based on the two datasets separately. As proof of principle, we show that the candidate gene list for the combined phenotype of kidney and liver cysts generated by KidneyNetwork prioritized a manageable list of candidate genes from a long list of genes containing rare variants in our patient with this phenotype.
We also investigated the 100,000 Genomes Project dataset [14] and contacted collaborators which identified three additional patients with kidney and/or liver cysts carrying a heterozygous potentially deleterious variant in ALG6, without an alternative genetic explanation.

Fig. 3 KidneyNetwork incorporated in the GADO method in SAMPLE6, a patient with renal and hepatic cysts. 89 candidate genes out of all genes were prioritized by KidneyNetwork using GADO, based on the HPO-terms “Renal cysts” (HP:0000107) and “Hepatic cysts” (HP:0001407). Exome-sequencing data interpretation method CAPICE yielded 322 genes containing potentially pathogenic variants in the patient’s exome sequencing data. When overlapping these gene lists three genes were identified that met the selection criteria, one being ALG6.
In total, we identified seven patients with known splice site variants that were reported to be disease-causing in severely affected CDG patients upon homozygosity or compoundheterozygosity and one patient with a likely pathogenic splice site variant (Table 1). In contrast to the severely affected ALG6-CDG patients (presenting with multi-organ involvement including developmental delay and multiple neurological symptoms), our patients presented with a phenotype of multiple kidney cysts and/ or liver cysts (Fig. 4). While PCLD can be extensive, the kidney phenotype seems to be mild with no eGFR decline reported despite advanced age (i.e. one patient is in her thirties, the others are between 45 to 80 years old). Furthermore, we found that the ALG6 variant segregated in a few family members that were also affected (Table 1; Fig. 4).
Supportive Service:
Email:wallence.suen@wecistanche.com
Whatsapp/Tel:+86 15292862950
Shop:
https://www.xjcistanche.com/cistanche-shop






