Constructing genotype-specific gene regulatory networks with EGRET
Deborah Weighill
2020-11-24
Source:vignettes/EGRET_toy_example.Rmd
      EGRET_toy_example.RmdIntroduction
EGRET is a method for constructing individual-specific gene regulatory networks (GRNs), taking into account the underlying genotype of the individual in question. EGRET combines multiple lines of evidence (See Figure 1 below) in order to predict the effect of an individual’s mutations on TF-to-gene edges and construct a complete, individual-specific bipartite GRN. TF motifs are used to construct a prior bipartite network of the presence or absence of TFs in the promoter regions of genes. This prior serves as an initial “guess” as to which TFs bind within the promoter regions of, and thus potentially regulate the expression of which genes. This prior is then modified to account for individual-specific genetic information using the individual’s genotype combined with publicly available eQTL data as well as computational predictions of the effects of variants on TF binding using QBiC [1].
For a given individual and a given prior edge connecting TF i to gene j, the edge weight is penalized if the individual has a genetic variant meeting 3 conditions, namely, the individual must have (1) an alternate allele at a location within a TF binding motif in the promoter region of a gene, which (2) is an eQTL affecting the expression of the gene adjacent to the promoter and (3) must be predicted by QBiC to affect the binding of the TF corresponding to the motif at that location. Each of these data types is essential to the accurate capturing of variant-derived regulatory disruptions. The altered prior is then integrated with gene expression data and protein-protein interaction information to refine the edge weights using the PANDA message-passing framework [2]. The message-passing algorithm uses the logic that if two genes are co-expressed, they are more likely to be co-regulated and thus are more likely to be regulated by a similar set of TFs; conversely, if two proteins physically interact, they are more likely to bind promoter regions as a complex and thus are more likely to regulate the expression of a similar set of genes. The result is a individual-and-tissue-specific GRN taking into account the genotype information of the individual in question.
EGRET has been integrated into the netZooR package.

Install/load netZooR
If you do not have netZooR installed, you can install it from the development branch as follows:
#install.packages("devtools")
#devtools::install_github("netZoo/netZooR@devel")Load the netZooR package:
EGRET input data
Get the example data sets
First download the example datasets:
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_qbic.txt")
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_genotype.vcf")
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_motif_prior.txt")
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_expr.txt")
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_ppi_prior.txt")
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_eQTL.txt")
system("curl -O  https://netzoo.s3.us-east-2.amazonaws.com/netZooR/unittest_datasets/EGRET/toy_map.txt")Read in each of the data types.
qbic <- read.table("toy_qbic.txt", header = FALSE)
vcf <- read.table("toy_genotype.vcf", header = FALSE, sep = "\t", stringsAsFactors = FALSE, colClasses = c("character", "numeric", "character", "character", "character", "character", "character", "character", "character", "character"))
motif <- read.table("toy_motif_prior.txt", sep = "\t", header = FALSE)
expr <- read.table("toy_expr.txt", header = FALSE, sep = "\t", row.names = 1)
ppi <- read.table("toy_ppi_prior.txt", header = FALSE, sep = "\t")
qtl <- read.table("toy_eQTL.txt", header = FALSE)
nameGeneMap <- read.table("toy_map.txt", header = FALSE)Let’s take a look at each of the inputs for EGRET:
Motif prior
The motif prior is a bipartite network represented as a 3 column data frame. Each row represents an edge in the bipartite graph, with column 1 representing source TFs, column 2 representing target genes and column 3 representing the edge weight. The edge weight represents the presence (edge weight = 1) or absence (edge weight = 0) of the motif corresponding to the TF in column 1 in the promoter region of the gene in column 2. Note that, for ease of differentiating TF nodes from gene nodes, we name TFs with the TF name, and we name genes with their ensembl id.
head(motif)
#>    V1              V2 V3
#> 1 AHR ENSG00000000419  0
#> 2 AHR ENSG00000131238  0
#> 3 AHR ENSG00000184939  0
#> 4 AHR ENSG00000225077  0
#> 5 AHR ENSG00000225697  1
#> 6 AHR ENSG00000254184  0Gene expression
The gene expression data represents gene expression measurements (in this case as TPMs from GTEx https://gtexportal.org/home/datasets) across several individuals. These are represented in a data frame with rows corresponding to genes and columns corresponding to samples/individuals. Row names of the data frame should be assigned gene names.
head(expr)
#>                       V2       V3      V4      V5     V6       V7       V8
#> ENSG00000225077   0.2971   0.2846   1.141  0.1843  0.000  0.05798   0.2158
#> ENSG00000225697  12.3100  16.5400  23.050 15.4700 15.400 14.58000  26.5500
#> ENSG00000254184   4.6300   2.1590   1.752  3.8840  3.031  1.11900   1.3880
#> ENSG00000000419 107.8000 100.6000 102.500 91.3400 98.960 93.49000 101.6000
#>                       V9    V10      V11      V12      V13     V14    V15
#> ENSG00000225077   0.2398  0.588   0.8401   0.3979   0.5166  0.5822  1.071
#> ENSG00000225697  15.2900 16.850  11.6900  15.8800  20.1300 20.3000 15.870
#> ENSG00000254184   1.6610  2.085   1.5290   4.7240   6.4270  5.0950  5.216
#> ENSG00000000419 113.5000 97.960 119.4000 109.5000 106.5000 96.1200 93.490
#>                     V16     V17      V18     V19      V20     V21     V22
#> ENSG00000225077   1.127   1.478   0.4451  0.4858  0.05161  0.3869  0.1505
#> ENSG00000225697  16.490  18.070  16.6800 20.8900 18.22000 19.2900  8.0150
#> ENSG00000254184   2.431   6.536   1.4430  1.4420  3.24300  1.6770  2.2760
#> ENSG00000000419 109.700 108.000 108.2000 99.9000 85.07000 86.7900 90.0500
#>                    V23     V24     V25      V26     V27     V28     V29
#> ENSG00000225077  0.592  0.5435  0.4381   0.3502  0.2018  0.3623  0.3682
#> ENSG00000225697 16.880 14.3600 10.6900  18.0300 13.5000 16.8800 23.9000
#> ENSG00000254184  4.467  3.7200  1.1110   4.3410  2.3460  3.6920  1.7830
#> ENSG00000000419 87.740 78.2700 98.1600 120.0000 91.9300 99.4400 72.2100
#>                      V30    V31     V32      V33     V34      V35      V36
#> ENSG00000225077  0.02132  0.537  0.3532   0.5425  0.3832   0.2543   0.4901
#> ENSG00000225697 10.23000  9.273 14.4900  13.0300 15.6300  12.1700  17.2500
#> ENSG00000254184  2.25800  2.391  4.9570   2.1820  2.1490   1.1500   5.5350
#> ENSG00000000419 85.47000 96.650 98.1600 109.7000 98.3200 119.1000 112.8000
#>                     V37     V38      V39     V40      V41      V42    V43
#> ENSG00000225077  0.3868  0.9251   0.5506  0.7583   0.6189   0.5083  1.092
#> ENSG00000225697 15.2500 15.9900  12.4800 14.2400  13.0500  11.1400 20.520
#> ENSG00000254184  3.8270  0.9408   1.8190  4.0410   1.1260   1.4920  2.400
#> ENSG00000000419 84.8600 79.2100 120.8000 81.0800 117.3000 111.6000 99.170
#>                     V44    V45     V46      V47      V48      V49     V50
#> ENSG00000225077  0.2399  0.216  0.1678   0.5535   0.4013   0.8003  0.1167
#> ENSG00000225697 14.8500 16.140 13.0300  14.6200  13.7400  18.6100 20.3900
#> ENSG00000254184  1.2660  1.166  2.3430   1.3400   6.4070   1.4990  6.9480
#> ENSG00000000419 86.6900 87.660 85.9000 112.6000 110.1000 112.4000 97.6500
#>                     V51     V52      V53      V54     V55     V56     V57
#> ENSG00000225077  0.2041  0.3714   0.4667  0.09785  0.3234  0.5766  0.2789
#> ENSG00000225697 29.9200 16.8600  12.4700 20.98000 15.5100 14.1200 17.3700
#> ENSG00000254184  3.9180  1.6030   4.2750  3.80800  2.1790  1.8850  1.2880
#> ENSG00000000419 92.9700 83.0500 107.3000 82.69000 92.9400 85.1800 97.9400
#>                     V58     V59     V60     V61    V62     V63     V64    V65
#> ENSG00000225077  0.2551  0.2882  0.8827  0.3583  0.515  0.2286  0.2668  0.056
#> ENSG00000225697 17.1500 10.9700 22.5500 17.3300 16.140 23.4900 19.9500 21.510
#> ENSG00000254184  1.2810  1.4260  5.3920  1.5840  1.465  2.9400  3.8630  1.676
#> ENSG00000000419 88.8000 80.1500 97.2900 91.3400 89.260 89.8000 88.9800 87.040
#>                      V66     V67     V68     V69    V70     V71    V72     V73
#> ENSG00000225077   0.8874  0.2598  0.1168  0.3319  0.896  0.3382  0.225  0.4426
#> ENSG00000225697  22.9000 27.5400 16.6000 22.1500 24.570 25.3100 16.820 21.5100
#> ENSG00000254184   1.6540  1.2210  4.0060  4.9990  1.116  1.2790  1.765  2.3060
#> ENSG00000000419 103.9000 79.9500 71.9600 87.3200 82.640 69.6100 87.550 85.7400
#>                    V74     V75     V76     V77     V78      V79      V80
#> ENSG00000225077  0.361  0.1127  0.1499  0.2601  0.4474  0.05797   0.4357
#> ENSG00000225697 13.370 16.0600 23.9300 18.7500 14.8600 10.36000  18.0100
#> ENSG00000254184  1.202  5.4350  1.9430  1.2170  1.1510  2.25100   4.2730
#> ENSG00000000419 81.560 87.8100 83.2700 81.9000 75.8600 71.91000 100.6000
#>                      V81    V82     V83     V84      V85     V86    V87     V88
#> ENSG00000225077  0.06602  0.000  0.4624  0.1525  0.09708  0.9362  0.334  0.1625
#> ENSG00000225697 17.62000 13.590 13.0300 16.6100 17.41000 18.9900 12.000 13.5000
#> ENSG00000254184  3.48400  1.479  2.7630  1.7230  1.76500  1.5880  1.322  1.6240
#> ENSG00000000419 96.09000 82.300 81.2600 87.0900 93.72000 86.4500 75.650 76.3000
#>                     V89     V90    V91    V92     V93    V94     V95    V96
#> ENSG00000225077  0.2919  0.3239  0.458  0.421  0.2439  0.441  0.4655  0.000
#> ENSG00000225697 15.4800 17.1700 14.140 13.530 16.3000 15.690 13.5900 19.420
#> ENSG00000254184  0.9428  1.1310  1.284  1.339  1.8400  2.323  1.7330  1.616
#> ENSG00000000419 88.1000 99.8700 80.830 95.850 85.3000 75.690 78.1500 77.850
#>                     V97     V98     V99    V100     V101     V102    V103
#> ENSG00000225077  0.3186  0.3548  0.3561  0.2471   0.4384   0.5253  0.3227
#> ENSG00000225697 17.7500 16.0900  9.3700 16.1800  19.0400  11.5100 18.4500
#> ENSG00000254184  1.5100  8.4720  2.4930  1.6390   1.2330   1.3960  5.2390
#> ENSG00000000419 96.0900 83.9500 99.3800 77.5200 119.9000 103.1000 75.2600
#>                     V104   V105   V106    V107    V108    V109     V110    V111
#> ENSG00000225077  0.06031  0.221  0.354  0.2368  0.1489   1.057  0.04143   0.413
#> ENSG00000225697 18.08000 13.780 13.430 18.0600 12.0400  19.120 14.06000  12.360
#> ENSG00000254184  1.61100  2.542  3.339  1.4330  1.4310   1.513  1.27100   1.151
#> ENSG00000000419 94.30000 75.860 81.830 98.7200 82.9200 104.100 79.95000 104.100
#>                    V112   V113    V114    V115    V116     V117    V118
#> ENSG00000225077  0.3183  0.174  0.1456  0.1846  0.2067   0.1562  0.1587
#> ENSG00000225697 18.4800 16.350 18.4700 12.5500 16.5900  15.2800 27.2400
#> ENSG00000254184  3.2670  1.876  1.0920  2.1620  3.7090   1.1590  1.5440
#> ENSG00000000419 83.3800 94.440 88.2100 98.3100 95.2800 100.6000 95.0900
#>                     V119    V120   V121     V122   V123   V124    V125    V126
#> ENSG00000225077   0.9566  0.6961  0.348   0.7355  0.314  0.252  0.5893  0.2411
#> ENSG00000225697  12.3900 19.0600 12.650  15.1700 19.300 16.770 15.1300 17.6100
#> ENSG00000254184   2.0510  2.0450  1.768   7.5050  3.593  4.007  6.0080  1.2290
#> ENSG00000000419 103.8000 80.6500 90.760 126.1000 96.740 85.190 81.0200 97.3700
#>                    V127    V128    V129     V130    V131
#> ENSG00000225077  0.1284  0.2762  0.4024   0.6954  0.7493
#> ENSG00000225697  8.4670 11.0600 16.3800  18.6600 16.6300
#> ENSG00000254184  1.3850  1.3430  2.0000   3.8730  1.5120
#> ENSG00000000419 73.6300 86.2600 84.8200 105.0000 99.0400Protein-protein interaction (PPI) data
The PPI prior can be obtained from interaction databases such as String (https://string-db.org/). EGRET takes in a PPI network of TFs as a data frame in which each row represents an edge, with columns one and two corresponding to TF nodes and column 3 representing the interaction weight.
head(ppi)
#>     V1    V2    V3
#> 1  AHR   AHR 1.000
#> 2  AHR  ALX3 0.179
#> 3  AHR GATA2 0.194
#> 4  AHR GATA4 0.150
#> 5 ALX3   AHR 0.179
#> 6 ALX3  ALX3 1.000eQTL data
The eQTL data consists of eQTL variants where the eQTL variant lies within a motif within the promoter region of the eGene. These are passed to EGRET as a data frame with the following columns: (1) TF corresponding to the motif in which the eQTL variant resides, (2) eGene adjacent to the promoter, (3) position of the eQTL variant, (4) chromosome on which the eQTL variant and eGene reside, and (5) beta value for the eQTL association. The eQTL data should be from the same cell type/tissue as the gene expression data and can be obtained from databases such as GTEx (https://gtexportal.org/home/datasets).
head(qtl)
#>      V1              V2       V3    V4        V5
#> 1  ALX3 ENSG00000131238 40563700  chr1 -0.645817
#> 2 GATA4 ENSG00000184939 68573287 chr16  0.515920
#> 3  ETS2 ENSG00000225077  6296238  chr1 -0.736242
#> 4 GATA2 ENSG00000225697 48671834  chr3  0.442918
#> 5 HOXB8 ENSG00000254184 72209959  chr7  1.365800Individual genotype
The genotype data for the individual in question should be loaded as a VCF file. Columns of the VCF used include column 1 (chromosome), column 2 (variant position), column 4 (reference allele), column 5 (alternate allele) and column 10 (genotype).
head(vcf)
#>      V1       V2 V3 V4 V5 V6   V7
#> 1  chr1 40563700  .  T  C  . PASS
#> 2 chr16 68573287  .  T  C  . PASS
#> 3  chr1  6296238  .  T  G  . PASS
#> 4  chr3 48671834  .  T  C  . PASS
#> 5  chr7 72209959  .  T  G  . PASS
#>                                                                                      V8
#> 1 MTD=cgi,bwa_freebayes,bwa_platypus,bwa_gatk,cortex,isaac_strelka;KM=11.95;KFP=0;KFF=0
#> 2  MTD=cgi,bwa_freebayes,bwa_platypus,bwa_gatk,cortex,isaac_strelka;KM=8.42;KFP=0;KFF=0
#> 3             MTD=isaac_strelka,bwa_freebayes,bwa_platypus,bwa_gatk;KM=8.96;KFP=0;KFF=0
#> 4        MTD=cgi,bwa_freebayes,bwa_platypus,isaac_strelka,bwa_gatk;KM=13.15;KFP=0;KFF=0
#> 5            MTD=isaac_strelka,bwa_freebayes,bwa_platypus,bwa_gatk;KM=14.29;KFP=0;KFF=0
#>   V9 V10
#> 1 GT 1|0
#> 2 GT 1|1
#> 3 GT 1|1
#> 4 GT 1|1
#> 5 GT 0|1QBiC predictions
EGRET requires QBiC [1] to be run on the eQTL variants occurring in the individual(s) in question in order to determine which transcription factor’s binding is potentially disrupted due to the variant, at the location of the variant. QBiC makes use of models trained on protein binding microarray (PBM) data to predict the impact of a given variant on TF binding at that location. Some of QBiC’s models are trained on non-human PBMs. We thus require a more stringent filtering (p < 1e-20) of resulting QBiC predictions from non-human models. We also require the predicted effect on binding to be negative (i.e. disruption of binding). QBiC predictions are passed to EGRET in a dataframe with the following columns: (1) variant as chr[num]_position which occurs within a motif in a promoter, (2) TF predicted to be impacted by QBiC, (3) gene adjacent to the promoter, (4) QBiC effect on binding. Note that multiple TFs can be predicted to have disrupted binding at a given variant.
head(qbic)
#>               V1    V2              V3       V4
#> 1  chr1_40563700  ALX1 ENSG00000131238 -0.72773
#> 2  chr1_40563700  ALX3 ENSG00000131238 -0.72773
#> 3  chr1_40563700  ALX4 ENSG00000131238 -0.72773
#> 4 chr16_68573287 GATA4 ENSG00000184939 -2.56780
#> 5   chr1_6296238  ETS2 ENSG00000225077 -0.59758
#> 6  chr3_48671834 GATA2 ENSG00000225697 -2.33451Run EGRET
Set a tag for the EGRET run. The EGRET outputs will be labeled with this tag.
tag <- "my_toy_egret_run"Call the runEgret function to
runEgret(qtl,vcf,qbic,motif,expr,ppi,nameGeneMap,tag)
#> [1] "Initializing and validating"
#> [1] "Verified sufficient samples"
#> [1] "Normalizing networks..."
#> [1] "Learning Network..."
#> [1] "Using tanimoto similarity"
#> [1] "Initializing and validating"
#> [1] "Verified sufficient samples"
#> [1] "Normalizing networks..."
#> [1] "Learning Network..."
#> [1] "Using tanimoto similarity"EGRET output
EGRET produces two output GRNs - a genotype specific “EGRET” network, and a genotype-agnostic baseline network (equivalent to a PANDA network).
load("my_toy_egret_run_egret.RData")
load("my_toy_egret_run_panda.RData")
head(regnetE)
#>       ENSG00000000419 ENSG00000225077 ENSG00000225697 ENSG00000254184
#> AHR        -1.7207724      -1.3665641       2.7951309      -2.3822209
#> ALX3       -1.3200757       1.1089799      -1.4428010       1.0835040
#> ETS2        0.1649549       2.9397079      -0.7637921      -0.2586327
#> GATA2      -0.4868319      -0.7719010       3.4413886      -0.9031585
#> GATA4       1.4572034       0.9393168      -2.8361430       1.0380431
#> HOXB8       1.3519842      -1.3298778      -1.8452988       2.3230026
head(regnetP)
#>       ENSG00000000419 ENSG00000225077 ENSG00000225697 ENSG00000254184
#> AHR        -1.8857284      -1.7359682       3.3731547      -2.7983875
#> ALX3       -1.3288906       1.4376131      -1.6112997       1.2258982
#> ETS2        0.8361299       0.8887604       0.7134097       0.5956296
#> GATA2       0.8361299       0.8887604       0.7134097       0.5956296
#> GATA4       1.3594558       1.2339543      -2.9404332       1.1808202
#> HOXB8       1.5258602      -1.4063417      -1.7945491       1.2819613References
- Martin, V., Zhao, J., Afek, A., Mielko, Z. and Gordân, R., 2019. QBiC-Pred: quantitative predictions of transcription factor binding changes due to sequence variants. Nucleic acids research, 47(W1), pp.W127-W135. [https://doi.org/10.1093/nar/gkz363]
- Glass, K., Huttenhower, C., Quackenbush, J. and Yuan, G.C., 2013. Passing messages between biological networks to refine predicted interactions. PloS one, 8(5), p.e64832. [https://doi.org/10.1371/journal.pone.0064832]