Building PANDA and LIONESS Regulatory Networks from GTEx Gene Expression Data in R

Introduction

In this vignette, we will demonstrate the functionalities of netZooR.

PANDA

PANDA Overview

PANDA (Passing Attributes between Networks for Data Assimilation) is a method for constructing gene regulatory networks. It uses message passing to find congruence between 3 different data layers: protein-protein interaction (PPI), gene expression, and transcription factor (TF) motif data.

More details can be found in the published paper https://doi.org/10.1371/journal.pone.0064832.

Running a single PANDA analysis

Load some libraries. We use the data.table library for reading in large datasets as it is more efficient.

library(netZooR)
library(data.table)
install.packages("visNetwork",repos = "http://cran.us.r-project.org")
#> 
#> The downloaded binary packages are in
#>  /var/folders/dc/zj7mjqq93034_xd2wszf4jjw0000gn/T//RtmpHoGYyJ/downloaded_packages
library(visNetwork) # to visualize the networks

# point R to your python 3 installation. Make sure that this is the installation that has all the required python libraries (numpy, scipy, etc) installed. netZooR uses a python implementation of PANDA under the hood.
#use_python("/usr/bin/python3") # for example, you can check the installation with py_config()

The previous command is necessary to bind R to Python since we are calling PANDA from Python because netZooPy has an optimized implementation of PANDA. Check this tutorial for an example using a pure R implementation of PANDA. Now we locate our ppi and motif priors. The ppi represents physical interactions between transcription factor proteins, and is an undirected network. The motif prior represents putative regulation events where a transcription factor binds in the promotor of a gene to regulate its expression, as predicted by the presence of transcription factor binding motifs in the promotor region of the gene. The motif prior is thus a directed network linking transcription factors to their predicted gene targets. These are small example priors for the purposes of demonstrating this method.

The ppi and motif priors are available in our AWS public bucket, and can be downloaded into current working directory.

Let’s download and take a look at the priors:

options(timeout=600) # set timeout for file download
download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/motif_GTEx.txt","motif_GTEx.txt")
download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/ppi_GTEx.txt","ppi_GTEx.txt")
motif <- read.table("./motif_GTEx.txt") 
ppi <- read.table("./ppi_GTEx.txt")
ppi[1:5,]
#>    V1   V2    V3
#> 1 AHR  AHR 1.000
#> 2 AHR ALX1 0.179
#> 3 AHR ALX3 0.179
#> 4 AHR ALX4 0.179
#> 5 AHR   AR 0.811
motif[1:5,]
#>    V1              V2 V3
#> 1 AHR ENSG00000004897  1
#> 2 AHR ENSG00000005339  1
#> 3 AHR ENSG00000006071  1
#> 4 AHR ENSG00000006194  1
#> 5 AHR ENSG00000006283  1

Now we locate out expression data.

As example, We will use a portion of the GTEx (Genotype-Tissue Expression) version 7 RNA-Seq data, read in the expression data and the list of LCL samples. Then parse the expression data.

We can either

downlaod the file GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct from from https://gtexportal.org/home/datasets or in our AWS bucket and place it in the folder “expressionData”. We will initially use the LCL RNA-seq data to create a regulatory network for this cell line. Later, we will also generate a regulatory network for whole blood for comaprison.

Here, we use the expression data and sample ids file copy from our AWS bucket.

download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct","GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct")
# load the GTEx expression matrix
expr <- fread("./GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct", header = TRUE, skip = 2, data.table = TRUE)
# remove the transcript ids so that the genes match the gene ids in the tf-motif prior
expr$Name<-sub("\\.[0-9]","", expr$Name)

#downlooad and load the sample ids of LCL samples
download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/LCL_samples.txt","LCL_samples.txt")
lcl_samples <-fread("./LCL_samples.txt", header = FALSE, data.table=FALSE)

#select the columns of the expression matrix corresponding to the LCL samples.
lcl_expr <- expr[,union("Name",intersect(c(lcl_samples[1:149,]),colnames(expr))), with=FALSE]

#determine the number of non-NA/non-zero rows in the expression data. This is to be able to ensure that PANDA will have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-exression prior.
zero_na_counts <- apply(lcl_expr, MARGIN = 1, FUN = function(x) length(x[(!is.na(x)| x!=0) ]))

#maintain only genes with at least 20 valid gene expression entries
clean_data <- lcl_expr[zero_na_counts > 20,]

#write the cleaned expression data to a file, ready to be passed as an argument to the PANDA algorithm.
write.table(clean_data, file = "pandaExprLCL.txt", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE)

Download our pre-processed pandaExprLCL.txt from our AWS S3 Bucket.

download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/pandaExprLCL.txt","pandaExprLCL.txt")

Now we run PANDA, pointing it to the parsed expression data, motif prior and ppi prior.

panda_results_LCL <- pandaPy(expr_file = "./pandaExprLCL.txt" , motif_file = "./motif_GTEx.txt", ppi_file = "./ppi_GTEx.txt", modeProcess="legacy", remove_missing = TRUE)
#> Loading motif data ...
#>   Elapsed time: 0.02 sec.
#> Loading expression data ...
#>   Elapsed time: 0.26 sec.
#> Loading PPI data ...
#> Number of PPIs: 86609
#>   Elapsed time: 0.02 sec.
#> Remove expression not in motif:
#>    55689 rows removed from the initial 56202
#> Remove motif not in expression data:
#>    58955 rows removed from the initial 103400
#> Remove ppi not in motif:
#>    0 rows removed from the initial 86609
#> new case
#> Calculating coexpression network ...
#>   Elapsed time: 0.00 sec.
#> Creating motif network ...
#>   Elapsed time: 0.01 sec.
#> Creating PPI network ...
#>   Elapsed time: 0.03 sec.
#> legacy ./motif_GTEx.txt ./pandaExprLCL.txt ./ppi_GTEx.txt False True False
#> Normalizing networks ...
#>   Elapsed time: 0.02 sec.
#> Saving expression matrix and normalized networks ...
#>   Elapsed time: 0.00 sec.
#> Running PANDA algorithm ...
#> step: 0, hamming: 0.9678742295275564
#> step: 1, hamming: 0.398740485692637
#> step: 2, hamming: 0.28448582363703867
#> step: 3, hamming: 0.2345843939954457
#> step: 4, hamming: 0.2042075476717615
#> step: 5, hamming: 0.18118973861477952
#> step: 6, hamming: 0.1614704894054021
#> step: 7, hamming: 0.14360782392336247
#> step: 8, hamming: 0.1271104933911665
#> step: 9, hamming: 0.11185402126446312
#> step: 10, hamming: 0.09783344721350776
#> step: 11, hamming: 0.08507016957437542
#> step: 12, hamming: 0.07356780340154848
#> step: 13, hamming: 0.06330116711146468
#> step: 14, hamming: 0.05421852200951187
#> step: 15, hamming: 0.04624764701825898
#> step: 16, hamming: 0.039301895859622696
#> step: 17, hamming: 0.03328718090872966
#> step: 18, hamming: 0.028107462583496577
#> step: 19, hamming: 0.023668722559827687
#> step: 20, hamming: 0.01988152451165294
#> step: 21, hamming: 0.016662707998748628
#> step: 22, hamming: 0.013936416334063792
#> step: 23, hamming: 0.011634447928678703
#> step: 24, hamming: 0.009696184613626214
#> step: 25, hamming: 0.008068282125831978
#> step: 26, hamming: 0.006704156831568275
#> step: 27, hamming: 0.005563429715892798
#> step: 28, hamming: 0.0046113098212006735
#> step: 29, hamming: 0.003817972519694833
#> step: 30, hamming: 0.0031579704664075493
#> step: 31, hamming: 0.002609676049734357
#> step: 32, hamming: 0.0021547746826681815
#> step: 33, hamming: 0.0017778073470498855
#> step: 34, hamming: 0.0014657623367399968
#> step: 35, hamming: 0.00120771534642688
#> step: 36, hamming: 0.0009945161790090437
#> Running panda took: 1.70 seconds!

Let’s take a look at the results. The output contains a list of three data frames:

data frame containing the regulatory network (bipartite graph) with edge weights representing the “likelihood” that a transcription factor binds the promotor of and regulates the expression of a gene.
data frame odf the in-degrees of genes (sum of the weights of inbound edges around a gene)
data frame of the out-degrees of TFs (sum of the weights of outbound edges around a TF)

# the bipartite regulatory network
regNetLCL <- panda_results_LCL$panda
regNetLCL[1:5,]
#>     TF            Gene Motif       Score
#> 1  AHR ENSG00000049245     0 -0.66273059
#> 2 AIRE ENSG00000049245     1  3.91520334
#> 3 ALX1 ENSG00000049245     0  0.04430445
#> 4 ALX3 ENSG00000049245     0  0.09652412
#> 5 ALX4 ENSG00000049245     0 -0.06595998

# gene in-degree
inDegreeLCL <- panda_results_LCL$indegree
head(inDegreeLCL)
#>            Target Target_Score
#> 1 ENSG00000000005    -47.74480
#> 2 ENSG00000000419    -11.80864
#> 3 ENSG00000000457     38.85765
#> 4 ENSG00000000938     30.19761
#> 5 ENSG00000001036    -56.41923
#> 6 ENSG00000001084    292.77857

# TF out-degree
outDegreeLCL <- panda_results_LCL$outdegree
head(outDegreeLCL)
#>   Regulator Regulator_Score
#> 1       AHR      -172.47026
#> 2      AIRE       115.74408
#> 3      ALX1        83.45379
#> 4      ALX3       139.81888
#> 5      ALX4       119.41043
#> 6        AR      -146.19404

Run another PANDA analysis on Whole Blood Samples

Like the LCL expression data in previous section, we can either download the raw data and process;

#### skip this part if you already did same process in LCL expression data section
download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct","GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct")
# load the GTEx expression matrix
expr <- fread("./GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct", header = TRUE, skip = 2, data.table = TRUE)
# remove the transcript ids so that the genes match the gene ids in the tf-motif prior
expr$Name<-sub("\\.[0-9]","", expr$Name)
#####

#load the sample ids of Whole Blood samples
download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/WholeBlood_samples.txt","WholeBlood_samples.txt")
wblood_samples <-fread("./WholeBlood_samples.txt", header = FALSE, data.table=FALSE)

#select the columns of the expression matrix corresponding to the LCL samples.
wblood_expr <- expr[,union("Name",intersect(c(wblood_samples[1:149,]),colnames(expr))), with=FALSE]

#determine the number of non-NA/non-zero rows in the expression data. This is to be able to ensure that PANDA will have enough values in the vectors to calculate pearson correlations between gene expression profiles in the construction of the gene co-exression prior.
zero_na_counts_wblood <- apply(wblood_expr, MARGIN = 1, FUN = function(x) length(x[(!is.na(x)| x!=0) ]))

#maintain only genes with at least 20 valid gene expression entries
clean_data_wb <- wblood_expr[zero_na_counts_wblood > 20,]

#write the cleaned expression data to a file, ready to be passed as an argument to the PANDA algorithm.
write.table(clean_data_wb, file = "pandaExprWholeBlood.txt", sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE)

or download the whole blood expression data directly from AWS Bucket.

download.file("https://netzoo.s3.us-east-2.amazonaws.com/netZooR/tutorial_datasets/pandaExprWholeBlood.txt","pandaExprWholeBlood.txt")

#run PANDA
panda_results_wblood <- pandaPy(expr_file = "./pandaExprWholeBlood.txt" , motif_file = "./motif_GTEx.txt", ppi_file = "./ppi_GTEx.txt", modeProcess="legacy", remove_missing = TRUE)
#> Use the legacy mode to pre-process the input dataset and keep only the matched TFs or Genes
#> Loading motif data ...
#>   Elapsed time: 0.02 sec.
#> Loading expression data ...
#>   Elapsed time: 0.07 sec.
#> Loading PPI data ...
#> Number of PPIs: 86609
#>   Elapsed time: 0.02 sec.
#> Remove expression not in motif:
#>    55689 rows removed from the initial 56202
#> Remove motif not in expression data:
#>    58955 rows removed from the initial 103400
#> Remove ppi not in motif:
#>    0 rows removed from the initial 86609
#> new case
#> Calculating coexpression network ...
#>   Elapsed time: 0.00 sec.
#> Creating motif network ...
#>   Elapsed time: 0.01 sec.
#> Creating PPI network ...
#>   Elapsed time: 0.03 sec.
#> legacy ./motif_GTEx.txt ./pandaExprWholeBlood.txt ./ppi_GTEx.txt False True False
#> Normalizing networks ...
#>   Elapsed time: 0.02 sec.
#> Saving expression matrix and normalized networks ...
#>   Elapsed time: 0.01 sec.
#> Running PANDA algorithm ...
#> step: 0, hamming: 1.0690849805252594
#> step: 1, hamming: 0.4691397610863155
#> step: 2, hamming: 0.33036461786634547
#> step: 3, hamming: 0.26806926123263264
#> step: 4, hamming: 0.23042570772644422
#> step: 5, hamming: 0.20256536152502197
#> step: 6, hamming: 0.17929462709712923
#> step: 7, hamming: 0.15865735441197737
#> step: 8, hamming: 0.13989696240746843
#> step: 9, hamming: 0.12274921075135972
#> step: 10, hamming: 0.10712627123983814
#> step: 11, hamming: 0.09299263928203307
#> step: 12, hamming: 0.08031383413612993
#> step: 13, hamming: 0.0690361089114833
#> step: 14, hamming: 0.05908483808770143
#> step: 15, hamming: 0.05036857084130915
#> step: 16, hamming: 0.04278429695791332
#> step: 17, hamming: 0.036223979927264414
#> step: 18, hamming: 0.030579252521705833
#> step: 19, hamming: 0.025745150525848724
#> step: 20, hamming: 0.021622615090325713
#> step: 21, hamming: 0.018120119978282237
#> step: 22, hamming: 0.015154398422204278
#> step: 23, hamming: 0.012650788145759574
#> step: 24, hamming: 0.010543041908389054
#> step: 25, hamming: 0.008772952126201682
#> step: 26, hamming: 0.0072897621090774025
#> step: 27, hamming: 0.006049506006422676
#> step: 28, hamming: 0.0050143227470869035
#> step: 29, hamming: 0.004151769096928692
#> step: 30, hamming: 0.0034341712803001153
#> step: 31, hamming: 0.002838014367091252
#> step: 32, hamming: 0.0023433884769824866
#> step: 33, hamming: 0.0019334886365366035
#> step: 34, hamming: 0.0015941705108742305
#> step: 35, hamming: 0.0013135594305309585
#> step: 36, hamming: 0.0010817088184296895
#> step: 37, hamming: 0.0008903048282875369
#> Running panda took: 1.64 seconds!
#> ...Finish PANDA...

install.packages("visNetwork",repos = "http://cran.us.r-project.org",dependencies=TRUE)
library(visNetwork)
edges <- head(panda_results_wblood$panda[order(panda_results_wblood$panda$Score,decreasing = TRUE),], 500)
colnames(edges) <- c("from","to","motif","force")
nodes <- data.frame(id = unique(as.vector(as.matrix(edges[,c(1,2)]))))
nodes$group <- ifelse(nodes$id %in% edges$from, "TF", "gene")

net <- visNetwork(nodes, edges, width = "100%")
net <- visGroups(net, groupname = "TF", shape = "square",
                     color = list(background = "teal", border="black"))
net <- visGroups(net, groupname = "gene", shape = "dot",       
                     color = list(background = "gold", border="black"))
visLegend(net, main="Legend", position="right", ncol=1)

LIONESS

LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) is a method for creating sample-specific networks. When applied to a PANDA regulatory network, the result is a set of gene regulatory networks, one for each sample in the gene expression dataset. More information on LIONESS can be found in the published paper: https://doi.org/10.1016/j.isci.2019.03.021

Running LIONESS with netZoo is simple, and very similar to running PANDA:

lionessLCL <- lionessPy(expr_file = "pandaExprLCL.txt" , motif_file = "./motif_GTEx.txt", ppi_file = "./ppi_GTEx.txt", modeProcess="legacy", remove_missing = TRUE)
#> Use the legacy mode to pre-process the input dataset and keep only the matched TFs or Genes

The result is a data frame in which the first colum contains TFs, the second column contains genes and each subsequent column contains the edge weight for that particular TF-gene pair in a particular sample.

lionessLCL[1:5,1:10]
#>     tf            gene            1            2            3            4
#> 1  AHR ENSG00000049245 0.068203.... -1.01318.... -0.11908.... -0.64714....
#> 2 AIRE ENSG00000049245 3.405967.... 3.606517.... 1.341021.... 3.962872....
#> 3 ALX1 ENSG00000049245 -1.56464.... -0.65310.... -2.58347.... 0.069443....
#> 4 ALX3 ENSG00000049245 -2.26576.... 0.437433.... -1.84254.... 0.169395....
#> 5 ALX4 ENSG00000049245 -2.30018.... 0.112392.... -1.83354.... 0.015880....
#>              5            6            7            8
#> 1 -4.72954.... -0.17908.... -0.54619.... -0.96282....
#> 2 10.14026.... 2.654841.... 4.577900.... 3.730394....
#> 3 1.548765.... -0.63456.... 0.829474.... 2.063450....
#> 4 7.957915.... -0.44872.... 0.339952.... 1.848870....
#> 5 5.195770.... -0.37209.... 0.187463.... 2.238460....