Integrating multi-omics for constructing transcriptional regulatory networks

Introduction

Understanding the molecular mechanisms underlying complex diseases or during the course of development has been a major goal in biomedical research. The regulation of the transcriptome, both at the transcriptional and post- transcriptional levels, is a key component of these mechanisms. However, the complexity of the regulatory networks, in particular as it involves multiple epigenetic players, renders the dissection of these mechanisms challenging. In particular, the most common may to construct a transcriptional regulatory network (TRN) involves using known interactions at a certain condition. However, this method is insufficient to explain the dynamics of the TRN in terms of the changes observed between biological conditions. In this vignette, we demonstrate how to use the IntegraTRN package to construct a TRN using multi-omics data.

The package IntegraTRN is designed to integrate differential analysis based on multiple omics data and construct a TRN based on both curated and predicted interactions that underlie the differential expression of genes. In the mean time, the curation of multi-omics data could be difficult, so the package provides a flexibility to allow users to decide what data can be integrated based on their own knowledge. The analysis pipeline can be primarily divided into two parts: 1) differential analysis of omics data; and 2) construction of the TRN. In this vignette, we will demonstrate how to use the package to construct a TRN using transcriptomic, small RNAomic, and chromatin accessibility data.

See help(package = "IntegraTRN") for more information about the package and citation("IntegraTRN") for how to cite the package. To download the package, please use the following command:

require("devtools")
devtools::install_github("j-y26/IntegraTRN", build_vignettes = TRUE)
library("IntegraTRN")

To list all available functions in the package, use the following command:

ls("package:IntegraTRN")

Data preparation

The package has provided datasets for demonstration purposes. The datasets are fully simulated or partially simulated based on real data (Aharon-Yariv et al. 2023). Here is an overview of the basic omics data used in this vignette:

data("RNAseq_heart")  # RNAseq count matrix
data("RNAseq_heart_samples")  # RNAseq sample information
data("smallRNAseq_heart")  # small RNAseq count matrix
data("smallRNAseq_heart_samples")  # small RNAseq sample information

These are RNAseq and small RNAseq data from human fetal heart tissues, as well as the information about the samples.

Please use ?<dataset> to see the details of each dataset.

We will now illustrate how to use the package to construct a TRN using these datasets.

Differential analysis of omics data

The first step of the analysis pipeline is to perform differential analysis separately on each omic data. During the differential analysis, we can retrieve insights into the “changes” that happen during the biological process separately for each omic data. This exploratory analysis is essential to understand what alterations in the transcriptome can explain difference in the phenotype of interest, and what variations in any of the regulatory players could potentially contribute to these changes.

Generate `MOList` object using multi omics data

To get started, we first need to generate a MOList object. The MOList object stands for multi-omics list, which is the core of entire analysis pipeline as it preserves our input data and the results of the differential analysis in a centralized object. The MOList object can be generated using the MOList function.

The MOList constructor has a large flexibility in terms of what input omics data are available. At the minimum, when constructing a MOList object for the first time, we need to provide the RNAseq data and the sample grouping information. For any count-based omics data (RNAseq, small RNAseq, and proteomics), the constructor takes a numeric matrix of counts, where each row represents a gene or a protein, and each column represents a sample. The grouping information can be any vector of the same length as the number of samples, in the same order to the columns of the count matrix. A continuous or a binary variable can be used. However, we need to make sure we are supplying the same grouping criteria for all omics data. Here we create a myMOList using only RNAseq and small RNAseq data, using the age of the samples as the grouping criteria for downstream differential analysis.

myMOList <- MOList(
  RNAseq = RNAseq_heart,
  RNAGroupBy = RNAseq_heart_samples$Age,
  smallRNAseq = smallRNAseq_heart,
  smallRNAGroupBy = smallRNAseq_heart_samples$Age
)
#> RNAseq:       RNAseq data has been added.
#> Small RNAseq: Small RNAseq data has been added.
#> Proteomics:   Optional proteomics data has NOT been provided.
#> ATACseq:      Optional ATAC peaks data has NOT been provided.

As long as a MOList object is constructed with RNAseq data, additional omics data can be added to the existing object using the same syntax. Any existing data can also be replaced by new data using the same syntax.

Specifically for ATACseq data, we need to provide the path to the peak files instead of the count matrix. Since most peak callers output peak files in BED format, we can simply provide a path to a BED file, for each condition. Our package has already provided two simulated peak files in the extdata folder.

atacPeak1Path <- system.file("extdata", "peak1.bed", package = "IntegraTRN")
atacPeak2Path <- system.file("extdata", "peak2.bed", package = "IntegraTRN")

A consensus peak file should be provided for each condition. When biological replicates are used for the ATACseq experiment, the peak files should be merged into one and filtered for blacklisted regions before being provided to the constructor (Amemiya, Kundaje, and Boyle 2019). Here, we simply use the previously created myMOList object to append the ATACseq data.

myMOList <- MOList(myMOList,
  ATACpeak1 = atacPeak1Path,
  ATACpeak2 = atacPeak2Path
)
#> Small RNAseq: Optional small RNAseq data has NOT been provided.
#> Proteomics:   Optional proteomics data has NOT been provided.
#> ATACseq:      ATAC peaks data has been added.

Differential analysis of omics data

Once the MOList object is created, we can perform differential analysis on each omics data separately. However, since differential analysis on count-based data is different that that of the epigenomic data, the process is slightly different.

Differential analysis of count-based omics data

For count-based omics data, we can use the diffOmics function to perform differential analysis. To simplify the process, the diffOmics function essentially performs differential expression analysis on all available count based omics data in the MOList object. Since we can specify to use either DESeq2 or edgeR for the differential analysis, our underlying assumption is that the expression data follows a negative binomial distribution (Love, Huber, and Anders 2014) (Robinson, McCarthy, and Smyth 2010). Furthermore, the differential expression analysis utilizes a generalized linear model (GLM) to allow adjustment on covariates. In particular, the function allows adjustment on the batch effect, which is a common issue in next-generation sequencing (NGS) data.

myMOList <- diffOmics(myMOList,
  rnaseqBatch = RNAseq_heart_samples$Batch,
  program = "DESeq2"
)
#> Performing differential analysis on RNAseq data
#> 
#> Performing differential analysis on smallRNAseq data
#> 
#> Processing peaks...

The batch argument follow the same criteria as the grouping information provided during the construction of the MOList object. Therefore, this argument is not limited to the batch effect, but can be any covariate that we want to adjust for.

# NOT run, just illustrate the usage of the batch argument
myMOList <- diffOmics(myMOList,
  smallRnaBatch = smallRNAseq_heart_samples$Sex,
  program = "edgeR"
)

Exploring differential gene expression analysis

To explore the results of the differential analysis, several visualization methods are provided. First, to see the distribution of differentially expressed mRNAs as well as exploring a optimal cutoff for defining genes as differentially expressed, we can visualize the RNAseq data using a volcano plot. We can specify the cutoffs for the adjusted p-value and the log2 fold change (log2FC) to define differentially expressed genes. Here, we define log2FC > 0.1 and adjusted p-value < 0.05, which is the default, as the cutoffs.

plotVolcano(myMOList, omic = "RNAseq", log2FC = 0.1)

Figure 1. Volcano plot of RNAseq data

The plot also directly indicates the number of up- and down-regulated genes.

How do we efficiently explore the raw differential expression results? The IntegraTRN package has conveniently defined a set of data structures, primarily S4 classes and methods, that allows users to easily explore the results. The previously mentioned MOList object essentially works as a list, which contains differential results stored in a DETag object for each omics data. This means that we can easily retrieve the differential expression results via simple subsetting.

myRNADETag <- myMOList$DERNAseq

A simple snap shot of the DETag object can be directly invoked.

myRNADETag
#> A DETag S4 object
#> Method for differential analysis:  DESeq2 
#> 
#> Number of genes:  5000 
#> Number of genes with adjusted p-value < 0.05:  3223 
#> 
#> A snap shot for differential analysis results:
#>            baseMean log2FoldChange       lfcSE        stat       pvalue
#> B4GALT3   958.13887  -0.0078640979 0.009959527 -0.78960552 4.297582e-01
#> SLC26A11  458.38254  -0.0019483551 0.009851067 -0.19778113 8.432163e-01
#> BEAN1     114.23147  -0.0006515378 0.020920562 -0.03114342 9.751552e-01
#> STAT3    2121.37501  -0.0343187743 0.008212107 -4.17904601 2.927345e-05
#> STEAP3     56.69544   0.0311278206 0.016709689  1.86286052 6.248189e-02
#> PIP5K1C  1552.90544  -0.0978374269 0.012828549 -7.62653887 2.411402e-14
#>                  padj
#> B4GALT3  4.955699e-01
#> SLC26A11 8.747057e-01
#> BEAN1    9.810414e-01
#> STAT3    8.402252e-05
#> STEAP3   9.057972e-02
#> PIP5K1C  3.367881e-13
#> 
#> To access the full results, use the exportDE S4 method

To further obtain the full results of the differential expression analysis, we can use the exportDE function. This function returns a data frame containing the outputs of DESeq2 or edgeR, with choice of using a consistent interface.

exportDE(myRNADETag, original = TRUE) %>%
    head(5)
#>            baseMean log2FoldChange       lfcSE        stat       pvalue
#> B4GALT3   958.13887  -0.0078640979 0.009959527 -0.78960552 4.297582e-01
#> SLC26A11  458.38254  -0.0019483551 0.009851067 -0.19778113 8.432163e-01
#> BEAN1     114.23147  -0.0006515378 0.020920562 -0.03114342 9.751552e-01
#> STAT3    2121.37501  -0.0343187743 0.008212107 -4.17904601 2.927345e-05
#> STEAP3     56.69544   0.0311278206 0.016709689  1.86286052 6.248189e-02
#>                  padj
#> B4GALT3  4.955699e-01
#> SLC26A11 8.747057e-01
#> BEAN1    9.810414e-01
#> STAT3    8.402252e-05
#> STEAP3   9.057972e-02

By having original = TRUE, the data frame remains exactly the same as the DESeq2 output.

Some times, we may also want to rank the genes to perform gene set enrichment analysis (GSEA). (Reimand et al. 2019) To this end, we can make use of the subclass of DETag, which is the TOPTag object. It automatically performs piValue transformation (Reimand et al. 2019) and ranking based on a user-specified criteria. For example, let’s retrieve genes that have at least 0.1 log2 fold change and select the top 200 ranked genes.

myRNATOPTag <- TOPTag(myRNADETag,
  pCutoff = 0.05,
  logFCCutoff = 0.1,
  topGenes = 200,
  direction = "both"
)

To see a snap shot:

myRNATOPTag
#> A TOPTag object.
#> The top 200 differential genes are selected.
#> 
#> A snapshot of the top up-regulated genes:
#>             logFC       pvalue         padj  piValue rank
#> SLC35F1 0.3168227 6.830735e-61 1.707684e-57 56.76759    1
#> JAM2    0.1955125 2.053533e-51 2.566917e-48 47.59059    2
#> FHL2    0.2654241 6.220696e-49 6.220696e-46 45.20616    3
#> FBXW7   0.2408516 6.853848e-44 4.283655e-41 40.36819    4
#> CNOT11  0.1348388 4.780297e-43 2.655720e-40 39.57582    5
#> 
#> A snapshot of the top down-regulated genes:
#>              logFC       pvalue         padj   piValue rank
#> RAI1    -0.1236128 4.680130e-10 3.175122e-09 -8.498240   -1
#> ATP8B2  -0.1022981 4.871540e-10 3.291581e-09 -8.482595   -2
#> SORCS2  -0.1413809 5.456246e-10 3.652106e-09 -8.437457   -3
#> PRRC2B  -0.1897221 5.944744e-10 3.931709e-09 -8.405419   -4
#> SLC47A1 -0.1106573 6.664062e-10 4.384252e-09 -8.358105   -5
#> 
#> Method for differential expression analysis:  DESeq2 
#> Log fold change cutoff:  0.1 
#> Adjusted p-value cutoff:  0.05

Due to the inheritance and object-oriented nature of the S4 class system, we can handle the TOPTag object in the same way as the DETag object.

# Export the desired results with rank information
myTopRNAResults <- exportDE(myRNATOPTag)
# Make numeric values easier to read
myTopRNAResults[, 1:4] <- format(myTopRNAResults[, 1:4], digits = 4, scientific = TRUE)
head(myTopRNAResults)
#>              logFC    pvalue      padj    piValue rank
#> SLC35F1  3.168e-01 6.831e-61 1.708e-57  5.677e+01    1
#> JAM2     1.955e-01 2.054e-51 2.567e-48  4.759e+01    2
#> FHL2     2.654e-01 6.221e-49 6.221e-46  4.521e+01    3
#> FBXW7    2.409e-01 6.854e-44 4.284e-41  4.037e+01    4
#> CNOT11   1.348e-01 4.780e-43 2.656e-40  3.958e+01    5
#> LDHD     2.558e-01 2.245e-41 9.356e-39  3.803e+01    6

Here, a rank of 1 and -1 indicates the most up- and down-regulated gene.

For publication purposes, we can also label some genes that we are interested in on the volcano plot. Let’s first retrieve a few differentially expressed genes

genesToLabel <- exportDE(myRNADETag) %>%
  dplyr::filter(padj < 0.05, abs(logFC) > 0.1) %>%
  dplyr::arrange(desc(logFC)) %>%
  rownames()
# Get an up- and a down- regulated gene
# For example use the 10th gene with highest fold change
genesToLabel <- genesToLabel[c(10, length(genesToLabel) - 10)]

Then we again plot the volcano plot, with some genes labeled. We can make use of the highlight argument to specify the genes to label.

plotVolcano(myMOList, omic = "RNAseq", log2FC = 0.1, highlight = genesToLabel)

Figure 2. Volcano plot of RNAseq data with some genes labelled

Exploring differential small RNA expression analysis

Before moving on to small RNAseq data, we first need to annotate the small RNAs for their types. To do so, we need to provide a data frame of small RNA annotation, as in the name of the transcripts for each type of small RNA. The following is an example:

smallRNAAnnotation <- data.frame(
  transcript = c("hsa-miR-1-3p", "hsa-miR-1-5p", "hsa-miR-2-5p"),
  type = c("miRNA", "miRNA", "miRNA")
)

To annotate the small RNAseq data, we can use the annotateSmallRNA function.

# NOT run, just illustrate the usage of the smallRNAAnnotation argument
myMOList <- annotateSmallRNA(myMOList, anno = smallRNAAnnotation)

We need to make sure that the names of the transcripts are consistent with the names in the count matrix, since different annotation databases may provide different names for the same transcript.

Alternatively, the package internally provides an annotation database for human small RNAs. This database is curated from several small RNA annotation databases, including miRBase, circBase, piRBase, piRNABank, GtRNAdb, GENCODE, and piRNACluster. Since we are dealing with human data, we can make use of this internal database.

myMOList <- annotateSmallRNA(myMOList, anno = "human")

The annotateSmallRNA function will automatically checks for the coverage of each small RNA type and will alert the user if the annotation does not cover all existing transcripts.

Similar to the RNAseq data, we can visualize the distribution of differentially expressed small RNAs using a volcano plot.

plotVolcano(myMOList, omic = "smallRNAseq", log2FC = 0)

Figure 3. Volcano plot of small RNAseq data

Here, we use the default P-value cutoff, which is 0.05. In the mean time, the separately colors the up and down regulated small RNAs, with the number of each annotated respectively.

On the other hand, small RNAs are usually a lot more complex and diverse than mRNAs. Each type of small RNA has its own function and mechanism of action. Therefore, we can also specifically color each type of small RNAs on the volcano plot, using a more specialized function.

By default, the plotVolcanoSmallRNA function generates a volcano plot with each type of small RNA colored. By default, it utilizes the BuPu palette defined in the RColorBrewer package (Neuwirth 2022).

# Plot Volcano plot of small RNAseq data
plotVolcanoSmallRNA(myMOList, log2FC = 0)

Figure 4. Volcano plot of small RNAseq data with each type of small RNA coloured

Alternatively, we can also specify a color scheme using the colScheme argument.

# NOT run, just illustrate the usage of the colScheme argument
plotVolcanoSmallRNA(myMOList, 
                    adjP = 0.05,
                    log2FC = 0, 
                    colScheme = RColorBrewer::brewer.pal(6, "Set1"))

Furthermore, although the function of microRNAs (miRNAs) is well studied, the understanding of other types of small RNAs such as piwi-interacting RNAs (piRNAs) and small nucleolar RNAs (snoRNAs) is still limited. Therefore, we can assess the contribution of each type of small RNA to the differences observed between samples. To do so, we can perform a principal component analysis (PCA) on each type of small RNA separately.

pcaPlotList <- plotSmallRNAPCAs(myMOList)

The plotSmallRNAPCAs function returns a list of PCA plots, one for each type of small RNA. We can check the length of the list to see how many PCA plots are generated.

length(pcaPlotList)
#> [1] 6

In this case, we have 6 PCA plots, one for each small RNA type. We can plot the PCA plot for miRNA using the following command:

pcaPlotList$miRNA

Figure 5. PCA plot of miRNA

Or, we can generate a combined PCA plot for all small RNA types.

pcaPlotList$miRNA + pcaPlotList$circRNA + pcaPlotList$piRNA + pcaPlotList$snoRNA +
    pcaPlotList$snRNA + pcaPlotList$tRNA

Figure 6. Combined PCA plot of all small RNA types

Differential analysis of chromatin accessibility

Analysis of epigenomic data requires several more steps than that of the count-based omics data. In the case of ATACseq data, genomic regions are profiled for their accessibility to the hyperactive Tn5 transposase. This accessibility corresponds to the chromatin state of the region, which can be either open or closed for DNA binding proteins, particularly transcription factors (TFs). Therefore, to analyze differential chromatin accessibility, we need to annotate the peaks with the genes that are nearby. Furthermore, since we are interested in transcriptional regulation, we can parse through the open chromatin regions and identify the TF binding motifs that are enriched in these regions. Differentially enriched motifs can then be used to infer the TFs that are potentially involved in the regulation of the differentially expressed genes.

To do so, we first need additional annotation information. Here, we make use of the annotation database from the TxDb.Hsapiens.UCSC.hg38.knownGene package, which is a TxDb object that contains the genomic coordinates of all known genes in the hg38 genome assembly (Team and Maintainer 2023). We also need to provide a more general annotation database, which is the org.Hs.eg.db package to extract the gene information (Carlson 2023). Finally, scanning the open chromatin regions for TF binding motifs requires identifying the sequence of the open chromatin regions. Here, we use the corresponding genome FASTA database from the BSgenome.Hsapiens.UCSC.hg38 package (Pagès 2023).

# Annotate ATAC Peaks
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
library("org.Hs.eg.db")
library("BSgenome.Hsapiens.UCSC.hg38")
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
annoDb <- "org.Hs.eg.db"
bsgenome <- BSgenome.Hsapiens.UCSC.hg38

Motif enrichment analysis would also require a known set of position weight matrices (PWMs) for all known curated TFs. The PWM is a matrix that contains the probability of each nucleotide (A, T, C, G) at every position of the binding motif. The IntegraTRN package has already provided a list of PWMs from the JASPAR2022 database for vertebrates. We can load the PWMs using the following command:

# Load PWMs from JASPAR
data("jasparVertebratePWM")

This is a PWMatrixList object defined in the TFBSTools package (Tan and Lenhard 2016). This package also provides additional functionality to retrieve PWMs from other databases or tools, such as HOMER etc. Here, we will use our package-provided PWMs from JASPAR2022 (Castro-Mondragon et al. 2022).

The annotateATACPeaksMotif function performs both peak annotation and motif enrichment analysis via a single call. Here, we define the promoter region as the region 3kb upstream and 3kb downstream of the transcription start site.

myMOList <- annotateATACPeaksMotif(myMOList,
  tssRegion = c(-3000, 3000),
  TxDb = txdb,
  annoDb = annoDb,
  bsgenome = bsgenome,
  pwmL = jasparVertebratePWM
)

Exploring differential chromatin accessibility

After performing peak annotation and motif enrichment analysis, we can explore the analysis via several methods.

First, let us take a look at the annotation on the peaks for their genomic features. This pie chart shows the percentage of peaks that are annotated as each genomic feature.

plotATACAnno(myMOList)

Figure 7. Annotation of ATAC peaks

To further access the quality of the ATACseq experiment, we can plot the coverage of the ATACseq peaks on the genome. This plot shows the where the peaks are located in each chromosome.

plotATACCoverage(myMOList)

Figure 8. ATACseq coverage

On the left of the plots shows all differentially enriched motifs.

Beyond globally exploring the ATACseq data with the above plots, we can also inspect the annotation on each individual peak. Once again, the IntegraTRN package utilizes the S4 inheritance system to create a PEAKTag object once motif enrichment analysis is performed. This object implements the generic as.data.frame method, conveniently allowing us to export the results as a data frame to see where each individual peaks are located.

myMotifs <- as.data.frame(myMOList$DEATAC)
head(myMotifs[, 1:8])  # Many annotations, see the first 8 attributes
#>   seqnames   start     end width strand Condition       annotation geneChr
#> 1     chr1 1597754 1597883   130      *         - Promoter (2-3kb)       1
#> 2     chr1 7783291 7783789   499      *         - Promoter (<=1kb)       1
#> 3     chr1 7783791 7784289   499      *         - Promoter (<=1kb)       1
#> 4     chr1 7784291 7784789   499      *         - Promoter (<=1kb)       1
#> 5     chr1 9996206 9996704   499      *         - Promoter (<=1kb)       1
#> 6     chr1 9996706 9997204   499      *         - Promoter (<=1kb)       1

Then we can explore the key part of the analysis, which is the differentially enriched motifs. Typically, the differential motif enrichment can be very stringent especially when the ATACseq libraries are not sequenced very deeply (less than 100 million reads) (Yan et al. 2020). Therefore, we can use a more relaxed cutoff for the p-value. The plotATACMotifHeatmap function by default utilizes adjusted p-value, but when a unadjusted p-value is provided, it will use the unadjusted one and ignore the default. Since our ATACseq data is simulated, here we use a relaxed cutoff.

plotATACMotifHeatmap(myMOList, pValue = 0.01)

Figure 9. Motif enrichment on ATACseq peaks

Constructing the TRN

Exploring the differential analysis results can provide insights into the changes that happen during the biological process. In addition, by exploring these results, have a better understanding of the data and the quality of the experiment. These understanding should allow us to make better decisions on deciding the cutoffs used for selecting genes and transcripts for constructing the TRN.

Now we can move on to the second part of the analysis pipeline, which is to construct the TRN.

Match samples between omics data

However, before directly construct the TRN, some data preprocessing is required. First, the package analyzes predicted interactions by using a tree-based algorithm to effectively predict regulatory interactions based on co-expression between genes and small RNAs (Huynh-Thu et al. 2010). However, this algorithm requires that the samples between the RNAseq and small RNAseq data are matched so that co-expression profiles can be extracted. Therefore, we need to match the samples between the RNAseq and small RNAseq data (Stuart et al. 2011).

We can perform an one-to-one optimal matching between the samples using the matchSamplesRNAsmallRNA function. When we have further information about the samples for RNAseq and small RNAseq, we can supply the information as data frames. The algorithm by default extracts all common variables between the two data frames and use them for matching. Alternatively, we can also specify the variables used for matching using the varMatch argument.

Here, we use all common variables, which are Age and Sex for matching.

myMOList <- matchSamplesRNAsmallRNA(myMOList,
  sampleDFRNAseq = RNAseq_heart_samples,
  sampleDFSmallRNAseq = smallRNAseq_heart_samples
)

Let us take a look at the matched samples.

exportMatchResult(myMOList) %>%
    head(5)  # For the first 5 matches
#>      RNAseq SmallRNAseq
#> 1 sample_48    sample_1
#> 2 sample_51    sample_2
#> 3  sample_5    sample_3
#> 4 sample_49    sample_4
#> 5 sample_10    sample_5

There are a total of 37 matches.

nrow(exportMatchResult(myMOList))
#> [1] 37

Using externally curated interactions

Depending on the biological question, we may want to use externally curated interactions to be part of the TRN. For example, the miRNet tools provides a comprehensive database of miRNA-target interactions, but the information are usually not tissue-specific or all relevant to the biological question (Chang et al. 2020). However, IntegraTRN provides a flexibility to allow users to import externally curated interactions, and during the construction of the TRN, these external interactions are intersected with the predicted interactions as well as the differential genes/enriched motifs to provide a more focused TRN.

Let us load some simulated miRNA-target interactions and TF-target interactions. These data has been provided as part of the package. These interactions are in the adjacency list format, where it is a list of two vectors, regulator and target.

data("miR2Genes")
data("tf2Genes")
myMOList <- loadExtInteractions(myMOList, miR2Genes = miR2Genes, tf2Genes = tf2Genes)

Setting cutoffs for omics data

As mentioned previously, the cutoffs used for selecting genes and transcripts for constructing the TRN are important. Due to the large number of omics data, a separate function is provide for users to set the cutoffs. Here we use the cutoffs defined earlier in the differential analysis.

omiCutoffs <- setOmicCutoffs(
  rnaAdjPval = 0.05,
  rnaLogFC = 0.1,
  rnaTopGenes = 50,
  smallRNAAdjPval = 0.05,
  smallRNALogFC = 0.1,
  smallRNATopGenes = 50,
  atacMotifPval = 0.01
)

This can also be easily checked.

omiCutoffs
#> RNAseq adjusted p-value cutoff: < 0.05 
#> RNAseq log fold change cutoff: > 0.1 
#> RNAseq selecting top 50 DE genes 
#> smallRNAseq adjusted p-value cutoff: < 0.05 
#> smallRNAseq log fold change cutoff: > 0.1 
#> smallRNAseq selecting top 50 DE genes 
#> Proteomics adjusted p-value cutoff: < 0.05 
#> Proteomics log fold change cutoff: > 0 
#> ATACseq motif p-value cutoff: < 0.01 
#> ATACseq motif log fold enrichment cutoff: > 0

Constructing and exploring the TRN

Finally, we can construct the TRN using the constructTRN function. This function contains the logic to deal with different availability of the omics data as well as the availability of external interactions. We can specify whether we want to focus only on the up or down regulated genes, or both. We also need to specify whether we want to use the predicted interactions or not.

Here, we construct a network using both predicted and curated interactions, for all differentially expressed genes that are both up- and down-regulated. When predicted interactions are used, we can make the process faster by using multiple threads, and a random seed can be specified or reproducibility (Huynh-Thu et al. 2010).

myTRNet <- constructTRN(myMOList,
  omiCutoffs = omiCutoffs,
  smallRNAtypes = "all",
  targetDirection = "both",
  predicted = TRUE,
  nthreads = 1,
  seed = 123
)
#> Constructing a combined expression matrix for GENIE3...
#> Criteria for defining the top differentially expressed genes:
#>   log2 fold change cutoff:  0.1 
#>   adjusted p-value cutoff:  0.05 
#>   top 50  genes selected
#> Criteria for defining the top differentially expressed small RNAs:
#>   log2 fold change cutoff:  0.1 
#>   adjusted p-value cutoff:  0.05 
#>   top 50  small RNAs selected
#> Running GENIE3 for predicted interactions...
#>   Number of trees:  1000 
#>   Number of threads:  1 
#>   Tree method:  RF 
#> This may take a while...

We can easily plot the network using the plotNetwork function. We can decide whether we want an interactive network. For large networks, the interactive network provides a better way to explore the interactions (Allaire et al. 2017).

plotNetwork(myTRNet, interactive = TRUE)

Figure 10. The Transcriptional Regulatory Network

When hovering over the nodes, we can see the name of the element. The each type of the element (target gene, miRNA, piRNA, TF etc.) are color coded. The layout of the network is computed by a force-directed algorithm, but we can simply drag the nodes to better visualize the interactions.

We can also plot a static network using the plotNetwork function. But this time, we only plot the up-regulated genes and miRNAs with a more stringent cutoff.

omiCutoffs2 <- omiCutoffs
omiCutoffs2$rnaAdjPval <- 0.01
omiCutoffs2$rnaLogFC <- 0.2

Let us construct a new TRN.

myTRNet2 <- constructTRN(myMOList,
  omiCutoffs = omiCutoffs2,
  smallRNAtypes = "miRNA",
  targetDirection = "up",
  predicted = TRUE
)
#> Constructing a combined expression matrix for GENIE3...
#> Criteria for defining the top differentially expressed genes:
#>   log2 fold change cutoff:  0.2 
#>   adjusted p-value cutoff:  0.01 
#>   top 50  genes selected
#> Criteria for defining the top differentially expressed small RNAs:
#>   log2 fold change cutoff:  0.1 
#>   adjusted p-value cutoff:  0.05 
#>   top 50  small RNAs selected
#> Running GENIE3 for predicted interactions...
#>   Number of trees:  1000 
#>   Number of threads:  1 
#>   Tree method:  RF 
#> This may take a while...

Then we can plot the network as a static figure (Csardi, Nepusz, et al. 2006).

plotNetwork(myTRNet2, vertex.size = 10)

Figure 11. The Transcriptional Regulatory Network of Up-regulated Genes

Some times it is hard to use the default parameters to plot a publication-level network. However, users can customize the network using the igraph package. We can extract the network as an igraph object and performs further customization (Csardi, Nepusz, et al. 2006).

exportIgraph(myTRNet2)
#> IGRAPH 8d8a9d9 DN-B 57 119 -- 
#> + attr: name (v/c), type (v/c)
#> + edges from 8d8a9d9 (vertex names):
#>  [1] hsa-miR-1301-3p ->KLHL40  hsa-mir-183     ->KBTBD12
#>  [3] hsa-miR-323a-3p ->TSPO    hsa-miR-1301-3p ->LRRC14B
#>  [5] hsa-miR-6747-3p ->COX7C   hsa-miR-1910-5p ->NRAP   
#>  [7] hsa-miR-31-5p   ->COX7C   hsa-miR-323a-3p ->DBP    
#>  [9] hsa-miR-323a-3p ->C1QL1   hsa-miR-1301-3p ->CEBPD  
#> [11] hsa-miR-31-5p   ->CEBPD   hsa-miR-323a-3p ->FAU    
#> [13] hsa-miR-323a-3p ->G0S2    hsa-mir-130b    ->LGALS3 
#> [15] hsa-mir-183     ->FAM129A hsa-miR-6747-3p ->CEBPD  
#> + ... omitted several edges

References

Aharon-Yariv, Adar, Yaxu Wang, Abdalla Ahmed, and Paul Delgado-Olguı́n. 2023. “Integrated Small RNA, mRNA and Protein Omics Reveal a miRNA Network Orchestrating Metabolic Maturation of the Developing Human Heart.” BMC Genomics 24 (1): 1–18.

Allaire, JJ, Peter Ellis, Christopher Gandrud, Kevin Kuo, BW Lewis, Jonathan Owen, Kenton Russell, et al. 2017. networkD3: D3 JavaScript Network Graphs from r. https://CRAN.R-project.org/package=networkD3.

Amemiya, Haley M, Anshul Kundaje, and Alan P Boyle. 2019. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome.” Scientific Reports 9 (1): 9354.

Carlson, Marc. 2023. Org.hs.eg.db: Genome Wide Annotation for Human.

Castro-Mondragon, Jaime A, Rafael Riudavets-Puig, Ieva Rauluseviciute, Roza Berhanu Lemma, Laura Turchi, Romain Blanc-Mathieu, Jeremy Lucas, et al. 2022. “JASPAR 2022: The 9th Release of the Open-Access Database of Transcription Factor Binding Profiles.” Nucleic Acids Research 50 (D1): D165–73.

Chang, Le, Guangyan Zhou, Othman Soufan, and Jianguo Xia. 2020. “miRNet 2.0: Network-Based Visual Analytics for miRNA Functional Analysis and Systems Biology.” Nucleic Acids Research 48 (W1): W244–51.

Csardi, Gabor, Tamas Nepusz, et al. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal, Complex Systems 1695 (5): 1–9.

Huynh-Thu, Vân Anh, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. 2010. “Inferring Regulatory Networks from Expression Data Using Tree-Based Methods.” PloS One 5 (9): e12776.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 1–21.

Neuwirth, Erich. 2022. RColorBrewer: ColorBrewer Palettes. https://CRAN.R-project.org/package=RColorBrewer.

Pagès, Hervé. 2023. BSgenome: Software Infrastructure for Efficient Representation of Full Genomes and Their SNPs. https://doi.org/10.18129/B9.bioc.BSgenome.

Reimand, Jüri, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, et al. 2019. “Pathway Enrichment Analysis and Visualization of Omics Data Using g: Profiler, GSEA, Cytoscape and EnrichmentMap.” Nature Protocols 14 (2): 482–517.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.

Stuart, Elizabeth A, Gary King, Kosuke Imai, and Daniel Ho. 2011. “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software.

Tan, Ge, and Boris Lenhard. 2016. “TFBSTools: An r/Bioconductor Package for Transcription Factor Binding Site Analysis.” Bioinformatics 32 (10): 1555–56.

Team, Bioconductor Core, and Bioconductor Package Maintainer. 2023. TxDb.hsapiens.UCSC.hg38.knownGene: Annotation Package for TxDb Object(s).

Villanueva, Randle Aaron M, and Zhuo Job Chen. 2019. “Ggplot2: Elegant Graphics for Data Analysis.” Taylor & Francis.

Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Yan, Feng, David R Powell, David J Curtis, and Nicholas C Wong. 2020. “From Reads to Insight: A Hitchhiker’s Guide to ATAC-Seq Data Analysis.” Genome Biology 21: 1–16.

Session information

sessionInfo("IntegraTRN")
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22621)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                    LC_CTYPE=English_Canada.utf8   
#> [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
#> [5] LC_TIME=English_Canada.utf8    
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> character(0)
#> 
#> other attached packages:
#> [1] IntegraTRN_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] fs_1.6.3                                
#>   [2] matrixStats_1.0.0                       
#>   [3] bitops_1.0-7                            
#>   [4] enrichplot_1.22.0                       
#>   [5] DirichletMultinomial_1.44.0             
#>   [6] devtools_2.4.5                          
#>   [7] TFBSTools_1.40.0                        
#>   [8] HDO.db_0.99.1                           
#>   [9] httr_1.4.7                              
#>  [10] RColorBrewer_1.1-3                      
#>  [11] doParallel_1.0.17                       
#>  [12] profvis_0.3.8                           
#>  [13] tools_4.3.1                             
#>  [14] backports_1.4.1                         
#>  [15] utf8_1.2.4                              
#>  [16] R6_2.5.1                                
#>  [17] DT_0.30                                 
#>  [18] sm_2.2-5.7.1                            
#>  [19] lazyeval_0.2.2                          
#>  [20] GetoptLong_1.0.5                        
#>  [21] urlchecker_1.0.1                        
#>  [22] withr_2.5.2                             
#>  [23] prettyunits_1.2.0                       
#>  [24] gridExtra_2.3                           
#>  [25] cli_3.6.1                               
#>  [26] Biobase_2.62.0                          
#>  [27] formatR_1.14                            
#>  [28] datasets_4.3.1                          
#>  [29] scatterpie_0.2.1                        
#>  [30] labeling_0.4.3                          
#>  [31] sass_0.4.7                              
#>  [32] base_4.3.1                              
#>  [33] readr_2.1.4                             
#>  [34] Rsamtools_2.18.0                        
#>  [35] yulab.utils_0.1.0                       
#>  [36] R.utils_2.12.2                          
#>  [37] DOSE_3.28.0                             
#>  [38] vioplot_0.4.0                           
#>  [39] sessioninfo_1.2.2                       
#>  [40] plotrix_3.8-3                           
#>  [41] BSgenome_1.70.1                         
#>  [42] grDevices_4.3.1                         
#>  [43] limma_3.58.1                            
#>  [44] rstudioapi_0.15.0                       
#>  [45] RSQLite_2.3.3                           
#>  [46] generics_0.1.3                          
#>  [47] gridGraphics_0.5-1                      
#>  [48] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 
#>  [49] shape_1.4.6                             
#>  [50] BiocIO_1.12.0                           
#>  [51] gtools_3.9.4                            
#>  [52] dplyr_1.1.3                             
#>  [53] GO.db_3.18.0                            
#>  [54] Matrix_1.6-1.1                          
#>  [55] fansi_1.0.5                             
#>  [56] S4Vectors_0.40.1                        
#>  [57] abind_1.4-5                             
#>  [58] R.methodsS3_1.8.2                       
#>  [59] lifecycle_1.0.4                         
#>  [60] yaml_2.3.7                              
#>  [61] edgeR_4.0.1                             
#>  [62] SummarizedExperiment_1.32.0             
#>  [63] gplots_3.1.3                            
#>  [64] qvalue_2.34.0                           
#>  [65] SparseArray_1.2.0                       
#>  [66] BiocFileCache_2.10.1                    
#>  [67] grid_4.3.1                              
#>  [68] blob_1.2.4                              
#>  [69] promises_1.2.1                          
#>  [70] crayon_1.5.2                            
#>  [71] miniUI_0.1.1.1                          
#>  [72] lattice_0.21-8                          
#>  [73] methods_4.3.1                           
#>  [74] cowplot_1.1.1                           
#>  [75] annotate_1.80.0                         
#>  [76] GenomicFeatures_1.54.1                  
#>  [77] KEGGREST_1.42.0                         
#>  [78] pillar_1.9.0                            
#>  [79] knitr_1.45                              
#>  [80] ComplexHeatmap_2.18.0                   
#>  [81] fgsea_1.28.0                            
#>  [82] GenomicRanges_1.54.1                    
#>  [83] rjson_0.2.21                            
#>  [84] boot_1.3-28.1                           
#>  [85] codetools_0.2-19                        
#>  [86] fastmatch_1.1-4                         
#>  [87] glue_1.6.2                              
#>  [88] ggfun_0.1.3                             
#>  [89] data.table_1.14.8                       
#>  [90] remotes_2.4.2.1                         
#>  [91] Rdpack_2.6.9000                         
#>  [92] vctrs_0.6.4                             
#>  [93] png_0.1-8                               
#>  [94] treeio_1.26.0                           
#>  [95] poweRlaw_0.70.6                         
#>  [96] gtable_0.3.4                            
#>  [97] cachem_1.0.8                            
#>  [98] xfun_0.40                               
#>  [99] rbibutils_2.2.16                        
#> [100] S4Arrays_1.2.0                          
#> [101] mime_0.12                               
#> [102] tidygraph_1.2.3                         
#> [103] pracma_2.4.2                            
#> [104] survival_3.5-5                          
#> [105] iterators_1.0.14                        
#> [106] statmod_1.5.0                           
#> [107] MatchIt_4.5.5                           
#> [108] stabs_0.6-4                             
#> [109] interactiveDisplayBase_1.40.0           
#> [110] ellipsis_0.3.2                          
#> [111] rlemon_0.2.1                            
#> [112] nlme_3.1-162                            
#> [113] utils_4.3.1                             
#> [114] ggtree_3.10.0                           
#> [115] usethis_2.2.2                           
#> [116] bit64_4.0.5                             
#> [117] progress_1.2.2                          
#> [118] filelock_1.0.2                          
#> [119] GENIE3_1.24.0                           
#> [120] GenomeInfoDb_1.38.0                     
#> [121] rprojroot_2.0.4                         
#> [122] bslib_0.5.1                             
#> [123] KernSmooth_2.23-21                      
#> [124] colorspace_2.1-0                        
#> [125] seqLogo_1.68.0                          
#> [126] BiocGenerics_0.48.1                     
#> [127] DBI_1.1.3                               
#> [128] DESeq2_1.42.0                           
#> [129] tidyselect_1.2.0                        
#> [130] processx_3.8.2                          
#> [131] bit_4.0.5                               
#> [132] compiler_4.3.1                          
#> [133] curl_5.1.0                              
#> [134] optmatch_0.10.6                         
#> [135] glmnet_4.1-8                            
#> [136] xml2_1.3.5                              
#> [137] desc_1.4.2                              
#> [138] DelayedArray_0.28.0                     
#> [139] shadowtext_0.1.2                        
#> [140] rtracklayer_1.62.0                      
#> [141] scales_1.2.1                            
#> [142] caTools_1.18.2                          
#> [143] ChIPseeker_1.38.0                       
#> [144] monaLisa_1.8.0                          
#> [145] callr_3.7.3                             
#> [146] rappdirs_0.3.3                          
#> [147] stringr_1.5.0                           
#> [148] digest_0.6.33                           
#> [149] shinyBS_0.61.1                          
#> [150] rmarkdown_2.25                          
#> [151] XVector_0.42.0                          
#> [152] htmltools_0.5.6.1                       
#> [153] pkgconfig_2.0.3                         
#> [154] MatrixGenerics_1.14.0                   
#> [155] highr_0.10                              
#> [156] dbplyr_2.4.0                            
#> [157] fastmap_1.1.1                           
#> [158] rlang_1.1.1                             
#> [159] GlobalOptions_0.1.2                     
#> [160] htmlwidgets_1.6.2                       
#> [161] shiny_1.7.5.1                           
#> [162] jquerylib_0.1.4                         
#> [163] farver_2.1.1                            
#> [164] zoo_1.8-12                              
#> [165] jsonlite_1.8.7                          
#> [166] BiocParallel_1.36.0                     
#> [167] R.oo_1.25.0                             
#> [168] GOSemSim_2.28.0                         
#> [169] RCurl_1.98-1.13                         
#> [170] magrittr_2.0.3                          
#> [171] GenomeInfoDbData_1.2.11                 
#> [172] ggplotify_0.1.2                         
#> [173] patchwork_1.1.3                         
#> [174] munsell_0.5.0                           
#> [175] Rcpp_1.0.11                             
#> [176] ape_5.7-1                               
#> [177] viridis_0.6.4                           
#> [178] chk_0.9.1                               
#> [179] stringi_1.7.12                          
#> [180] ggraph_2.1.0                            
#> [181] zlibbioc_1.48.0                         
#> [182] MASS_7.3-60                             
#> [183] org.Hs.eg.db_3.18.0                     
#> [184] AnnotationHub_3.10.0                    
#> [185] plyr_1.8.9                              
#> [186] pkgbuild_1.4.2                          
#> [187] parallel_4.3.1                          
#> [188] HPO.db_0.99.2                           
#> [189] ggrepel_0.9.4                           
#> [190] CNEr_1.38.0                             
#> [191] Biostrings_2.70.1                       
#> [192] graphlayouts_1.0.2                      
#> [193] splines_4.3.1                           
#> [194] hms_1.1.3                               
#> [195] circlize_0.4.15                         
#> [196] BSgenome.Hsapiens.UCSC.hg38_1.4.5       
#> [197] locfit_1.5-9.8                          
#> [198] ps_1.7.5                                
#> [199] igraph_1.5.1                            
#> [200] reshape2_1.4.4                          
#> [201] biomaRt_2.58.0                          
#> [202] stats4_4.3.1                            
#> [203] pkgload_1.3.3                           
#> [204] TFMPvalue_0.0.9                         
#> [205] BiocVersion_3.18.0                      
#> [206] XML_3.99-0.15                           
#> [207] evaluate_0.23                           
#> [208] BiocManager_1.30.22                     
#> [209] tzdb_0.4.0                              
#> [210] foreach_1.5.2                           
#> [211] tweenr_2.0.2                            
#> [212] httpuv_1.6.11                           
#> [213] networkD3_0.4                           
#> [214] graphics_4.3.1                          
#> [215] tidyr_1.3.0                             
#> [216] purrr_1.0.2                             
#> [217] polyclip_1.10-6                         
#> [218] clue_0.3-65                             
#> [219] ggplot2_3.4.4                           
#> [220] ggforce_0.4.1                           
#> [221] xtable_1.8-4                            
#> [222] restfulr_0.0.15                         
#> [223] tidytree_0.4.5                          
#> [224] MPO.db_0.99.7                           
#> [225] roxygen2_7.2.3                          
#> [226] later_1.3.1                             
#> [227] viridisLite_0.4.2                       
#> [228] TxDb.Hsapiens.UCSC.hg38.knownGene_3.18.0
#> [229] tibble_3.2.1                            
#> [230] aplot_0.2.2                             
#> [231] memoise_2.0.1                           
#> [232] AnnotationDbi_1.64.1                    
#> [233] GenomicAlignments_1.38.0                
#> [234] IRanges_2.36.0                          
#> [235] stats_4.3.1                             
#> [236] cluster_2.1.4

Integrating multi-omics for constructing transcriptional regulatory networks

Introduction

Data preparation

Differential analysis of omics data

Generate MOList object using multi omics data

Differential analysis of omics data

Differential analysis of count-based omics data

Exploring differential gene expression analysis

Exploring differential small RNA expression analysis

Differential analysis of chromatin accessibility

Exploring differential chromatin accessibility

Constructing the TRN

Match samples between omics data

Using externally curated interactions

Setting cutoffs for omics data

Constructing and exploring the TRN

References

Session information

Generate `MOList` object using multi omics data