Integrating multi-omics data is a powerful approach to understand the complex regulatory mechanisms that govern biological processes. Unlike most existing methods that focus on a single type of omics data, incorporating multiple omics data provides a more solid biological foundation for deciphering the key regulatory interactions that are involved in the biological process of interest. However, the analysis of multi-omics data is not trivial. In particular, the logic behind the workflow given different availability of the omics data remains complicated.
Here, in this vignette tutorial, we will demonstrate how to utilize
the functionalities of the IntegraTRN
package to optimize the workflow for constructing a transcriptional
regulatory network (TRN) using multi-omics data. This vignette will also
discusses the logic behind the workflow, informing the user of their
options when constructing a TRN.
We will cover the following topics:
Customizing the differential analysis of omics data
TRN construction with different data availability
A basic workflow of the IntegraTRN
package has been demonstrated in the vignette
Integrating multi-omics for constructing transcriptional regulatory networks
.
It is highly recommended that users explore the basic workflow first
before moving on to this vignette.
See help(package = "IntegraTRN")
for more information
about the package and citation("IntegraTRN")
for how to
cite the package.
The package has provided datasets for demonstration purposes. The datasets are fully simulated or partially simulated based on real data (Aharon-Yariv et al. 2023). Here is an overview of the basic omics data used in this vignette:
data("RNAseq_heart") # RNAseq count matrix
data("RNAseq_heart_samples") # RNAseq sample information
data("smallRNAseq_heart") # small RNAseq count matrix
data("smallRNAseq_heart_samples") # small RNAseq sample information
data("protein_heart") # Proteomics count matrix
data("protein_heart_samples") # Proteomics sample information
data("miR2Genes") # miRNA-target interactions, externally curated
data("tf2Genes") # TF-target interactions, externally curated
data("proteinGeneIDConvert") # Protein ID conversion table
Please use ?<dataset>
to see the details of each
dataset.
ATACseq data are separately provided as peak files. The peak files
are in the BED format, which is a tab-delimited file format for genomic
features. The BED files can be located in the
extdata
folder of the package.
atacPeak1Path <- system.file("extdata", "peak1.bed", package = "IntegraTRN")
atacPeak2Path <- system.file("extdata", "peak2.bed", package = "IntegraTRN")
Let us now begin different ways for optimizing our analytical workflow.
A key component of the package is the data structures it provides. The following are a list of the key S4 classes:
MOList
: An
MOList
object is singly responsible for
storing all the omics data and the results of each step of the analysis.
While the raw data of each omic type are stored internally in the slots,
all analysis results are kept as list elements that allow easy
subsetting and extraction. By exploiting this feature, we can customize
our analysis by extracting and manipulating the results of the
analysis.
DETag
: A
DETag
object is the core S4 class that
stores the results of the differential expression analysis. It provides
the users with a composite structure for exploring the results of the
analysis.
TOPTag
: A
TOPTag
object is a subclass of the
DETag
object. Unlike the
DETag
object, the
TOPtag
object further limits the results
to a subset of genes that we are interested in. It provides an
user-friendly interface for filtering and ranking top differentially
expressed genes/transcripts.
PEAKTag
: Similar to
TOPTag
, a
PEAKTag
object is also a subclass of the
DETag
object. However, it is specifically
designed for handling peak annotation and motif enrichment
analysis.
TRNet
: Finally, as the final
component of the traditional workflow, a
TRNet
object is the core S4 class that
stores the final TRN. Although itself operates as a data storage object,
it can be easily integrated with any third-party tools for further
topological analysis or customized network visualization.
Let us begin with the first step of the analysis pipeline, which is to perform differential analysis on each omics data separately. We begin by exploring how these data structures facilitate flexible and detailed analysis of the results.
We will first construct a MOList
object
using the data we have prepared. We consider the age of the samples as
the grouping criteria for the differential analysis.
myMOList <- MOList(
RNAseq = RNAseq_heart,
RNAGroupBy = RNAseq_heart_samples$Age,
smallRNAseq = smallRNAseq_heart,
smallRNAGroupBy = smallRNAseq_heart_samples$Age,
proteomics = protein_heart,
proteomicsGroupBy = protein_heart_samples$Age,
ATACpeak1 = atacPeak1Path,
ATACpeak2 = atacPeak2Path
)
#> RNAseq: RNAseq data has been added.
#> Small RNAseq: Small RNAseq data has been added.
#> Proteomics: Proteomics data has been added.
#> ATACseq: ATAC peaks data has been added.
With the package’s one-step differential analysis function, we can easily explore the results of the differential analysis.
myMOList <- diffOmics(myMOList, program = "edgeR")
#> Performing differential analysis on RNAseq data
#>
#> Performing differential analysis on smallRNAseq data
#>
#> Performing differential analysis on proteomics data
#>
#> Processing peaks...
In the IntegraTRN
package, several
visualization methods are provided to explore the results of the
differential analysis. These visualization methods are highly efficient
in generating an overview of the results, allowing users to quickly
identify the key changes that happen during the biological process.
However, as biologists who generated the data, we may be interested in
some particular aspects of the results. Therefore, we can further
customize our experience by performing easy extraction of the
results.
To obtain the DETag
object for a
particular omics data, we can simply subset the
MOList
object using the name of the omics
data. For example, to retrieve differential analysis result of the
RNAseq data:
would work just fine. The DETag
class
has implemented several methods for users to easily extract the results.
For example, we can directly export the results as a data frame using
the exportDE
function.
exportDE(myRNADETag) %>%
head(5)
#> logFC pvalue padj
#> SLC35F1 0.3301368 1.435088e-30 7.175439e-27
#> HMGA2 -0.3862208 1.161249e-29 2.463844e-26
#> CDH11 -0.2866885 1.478307e-29 2.463844e-26
#> FHL2 0.2719520 1.650073e-28 2.062591e-25
#> JAM2 0.1827767 4.298455e-27 3.991485e-24
Furthermore, in certain conditions, such as generating a heatmap, we
may want to explore only the differentially expressed genes as well as a
normalized count matrix for visualization purposes. Besides subsetting
the data frame returned by the exportDE
function, we can also further utilize the functionality of its subclass,
the TOPTag
object. As the name suggests,
the TOP
part implies that we could easily
perform filtering to obtain what we consider to be the “top”
differentially expressed genes. For example, we can retrieve the top 20
differentially expressed genes with any fold change, with an adjusted
p-value less than 0.05 considered to be significant.
myRNATOPTag <- TOPTag(myRNADETag,
pCutoff = 0.05,
logFCCutoff = 0,
topGenes = 20,
direction = "both"
)
Similarly, we can also export the results as a data frame.
Furthermore, we can see that a ranking based on
piValue
is also provided.
exportDE(myRNATOPTag) %>%
head(5)
#> logFC pvalue padj piValue rank
#> SLC35F1 0.3301368 1.435088e-30 7.175439e-27 26.14415 1
#> FHL2 0.2719520 1.650073e-28 2.062591e-25 24.68559 2
#> JAM2 0.1827767 4.298455e-27 3.991485e-24 23.39887 3
#> LDHD 0.2818013 1.218915e-26 7.475262e-24 23.12637 4
#> RD3L 0.4080196 1.495052e-26 7.475262e-24 23.12637 5
In case where we want to obtain a normalized count matrix for visualization, such as when plotting a heatmap, we can also do so by a simple function call.
The piValue
here ranks the genes based
on the adjusted p-value and the fold change, which is a standard ranking
for rank-based enrichment analysis. The
piValue
is calculated as follows:
\[piValue = -log_{10}(padj) * sign(logFC)\]
where \(padj\) is the adjusted p-value and \(logFC\) is the log fold change.
We can then make a heatmap using the
ComplexHeatmap
package.
heatmapMatrix <- topRNANormCounts %>%
as.matrix() %>%
t() %>%
scale() %>%
t()
exprRange <- range(heatmapMatrix)
col <- circlize::colorRamp2(
c(exprRange[1], 0, exprRange[2]),
c("blue", "white", "red")
)
ComplexHeatmap::Heatmap(heatmapMatrix,
col = col,
show_column_names = FALSE
)
Figure 1. Heatmap of top 20 differentially expressed genes
Similarly, we can use the list of selected genes to perform gene set
enrichment analysis (GSEA) (Reimand et al.
2019). While a rank based GSEA analysis utilizes all available
genes, we can again utilize the TOPTag
constructor to perform ranking while not filtering any genes. This can
be done by setting a non-selective cutoff. Here, a
topGenes = 1
means that we are selecting all genes that
satisfy the cutoffs.
rankedResult <- TOPTag(myRNADETag,
pCutoff = 1,
logFCCutoff = 0,
topGenes = 1,
direction = "both"
) %>% exportDE()
This can then be used by the users as import to the GSEA analysis.
For most highly controlled experiments, the differential analysis
between two conditions is fairly straightforward. However, in certain
cases, especially for patient-derived samples, a much higher biological
variability may be observed. How could we handle such variability? With
the DETag
class and the list behavior of
the MOList
object, this has been made
possible.
In the following section, we will explore an alternative differential
analysis with multiple covariates. However, any alternative differential
expression analysis can be performed at the user’s discretion, and the
results can be easily integrated into the
MOList
object.
In the case of a population-based cohort, many attributes could be contributing to the biological variability. In our example, we can see that the following attributes are available:
In this case, in the differential expression analysis, we can adjust
for both sex and the batch effect. Again, we use the
edgeR
package for the differential
expression analysis.
# Create the edgeR DGEList object
dgeList <- edgeR::DGEList(
counts = RNAseq_heart,
group = RNAseq_heart_samples$Age
)
# Filter lowly expressed genes
keep <- edgeR::filterByExpr(dgeList)
dgeList <- dgeList[keep, ]
# Perform differential expression analysis
design <- model.matrix(~ RNAseq_heart_samples$Age +
RNAseq_heart_samples$Sex +
RNAseq_heart_samples$Batch)
dgeList <- edgeR::calcNormFactors(dgeList, method = "TMM") # or a compatible method
dgeList <- edgeR::estimateDisp(dgeList, design)
fit <- edgeR::glmQLFit(dgeList, design) # or other suitable models
qlf <- edgeR::glmQLFTest(fit, coef = ncol(RNAseq_heart_samples))
With complex designs, the performance of user-defined differential
expression analysis may be better than the one-step differential
expression analysis provided by the
IntegraTRN
package. As we have obtained
the results of the differential expression analysis, integrating the
results back into the MOList
object is
fairly straightforward.
To do so, we need to construct a DETag
object. The core component of the DETag
object is a result data frame, as the original output of
edgeR
or
DESeq2
. Additionally is a normalized count
matrix.
# Obtaining the DE results and normalized count
deResult <- edgeR::topTags(qlf, n = nrow(dgeList)) %>%
as.data.frame()
normCounts <- edgeR::cpm(dgeList) %>%
as.matrix()
# Construct the DETag object
myNewRNADETag <- DETag(DEResult = deResult, method = "edgeR", normalizedCounts = normCounts)
With the DETag
object, we can then
integrate the results back into the MOList
object.
This would further allow us to perform downstream analysis using the customized differential expression results.
For the purpose of this specific example, we will not replace the original analysis results.
While we have demonstrated the flexibility of the
IntegraTRN
package in case of the RNAseq
data, other omics data can be handled in the same way. With these
flexibility, we can not only explore the differential analysis results
in a more detailed manner, but also customize the analysis that are
specific to our data and biological question of interest.
Once we have optimize the first part of the analysis pipeline, we can then move on to the second part, which is to construct the TRN. In this section, we will discuss how the workflow is impacted by the availability of the omics data, and how users can make use of their best judgement select the most appropriate workflow.
In a simplistic way, network construction by integrating multiple
omics data is a process of intersecting the regulatory interactions to
determine the ones that are biologically relevant to the condition of
interest. Having more omics data provided, the analytical workflow of
the IntegraTRN
package could identify the
most relevant regulatory interactions with much higher confidence.
However, the availability of the omics data is not always guaranteed.
Therefore, the package take different approaches to handle different
scenarios.
In the following sections, we will discuss the contribution of each data type to the TRN construction, and how the workflow make use of the data to construct the TRN. Here, the package focuses on creating a transcription factor - small RNA - mRNA regulatory network. The following sections describe the role of each omics data in the TRN.
Before we begin, to more properly demonstrate the TRN construction workflow, we will downsample the data to reduce the computational time. Here, we randomly select 500 genes from each of the omics data.
set.seed(91711)
# Count-based data
RNAseq_heart <- RNAseq_heart[sample(nrow(RNAseq_heart), 500), ]
smallRNAseq_heart <- smallRNAseq_heart[sample(nrow(smallRNAseq_heart), 500), ]
protein_heart <- protein_heart[sample(nrow(protein_heart), 500), ]
# Chromatin accessibility data
peak1 <- read.table(atacPeak1Path, sep = "\t", header = FALSE)
peak2 <- read.table(atacPeak2Path, sep = "\t", header = FALSE)
peak1 <- peak1[sample(nrow(peak1), 500), ]
peak2 <- peak2[sample(nrow(peak2), 500), ]
set.seed(NULL)
We will then construct a MOList
object
using the downsampled data and perform differential analysis on each
omics data.
myMOList <- MOList(
RNAseq = RNAseq_heart,
RNAGroupBy = RNAseq_heart_samples$Age,
smallRNAseq = smallRNAseq_heart,
smallRNAGroupBy = smallRNAseq_heart_samples$Age,
proteomics = protein_heart,
proteomicsGroupBy = protein_heart_samples$Age,
ATACpeak1 = peak1,
ATACpeak2 = peak2
)
#> RNAseq: RNAseq data has been added.
#> Small RNAseq: Small RNAseq data has been added.
#> Proteomics: Proteomics data has been added.
#> ATACseq: ATAC peaks data has been added.
myMOList <- diffOmics(myMOList)
#> Performing differential analysis on RNAseq data
#>
#> Performing differential analysis on smallRNAseq data
#>
#> Performing differential analysis on proteomics data
#>
#> Processing peaks...
We will then begin discussing TRN construction with each component of the network starting from the targets of the regulatory interactions.
The RNAseq data is the most commonly available omics data. In the case of a transcriptional regulatory network, the RNAseq data is the most direct representation of the transcriptional activity. Therefore, the network employs differentially expressed genes (DEGs) as the regulatory targets of the TRN.
When setting the cutoffs for constructing the TRN, there are several cutoffs to consider. These include the fold change, the adjusted p-value, the top number of genes, and the direction of the change.
setOmicCutoffs(
rnaAdjPval = 0.05,
rnaLogFC = 0.1,
rnaTopGenes = 200
)
#> RNAseq adjusted p-value cutoff: < 0.05
#> RNAseq log fold change cutoff: > 0.1
#> RNAseq selecting top 200 DE genes
#> smallRNAseq adjusted p-value cutoff: < 0.05
#> smallRNAseq log fold change cutoff: > 0
#> smallRNAseq selecting top 100 % of DE genes
#> Proteomics adjusted p-value cutoff: < 0.05
#> Proteomics log fold change cutoff: > 0
#> ATACseq motif adjusted p-value cutoff: < 0.05
#> ATACseq motif log fold enrichment cutoff: > 0
Here, we specifically set the cutoffs for the RNAseq data. While the
setOmicCutoffs
provides default cutoffs
for other omics data, these cutoffs are only considered if the selected
data types are available.
These parameters are directly correlated to the size of the network, and the which genes to be included defines the primary biological scope of the network.
We will demonstrate the impact of these cutoffs on the TRN construction once we start to discuss regulators of the network.
Similar to the RNAseq data, the proteomics data is also a direct representation of the transcriptional activity. However, proteomics provides a much more biologically relevant representation of the transcriptional activity, since the protein level remains a downstream product of the transcriptional and post-transcriptional regulation.
Therefore, the proteomics data are used to further filter the DEGs obtained from the RNAseq data to extract the phenotypically relevant genes. However, an interesting limitation of proteomics experiments is that protein coverage is heavily affected by the sample quality as well as the technology used to perform the experiment. Therefore, during prior exploratory analysis, we may fine-tune the cutoffs for the proteomics data to obtain the most relevant target gene set while minimizing the loss of information due to technical limitations.
For example, when protein coverage is quite high, we could employ a regular cutoff. Here, we limit the network to include only target genes that shows consistent expression changes in both the transcriptomic and proteomic level.
setOmicCutoffs(
rnaAdjPval = 0.05,
rnaLogFC = 0.1,
rnaTopGenes = 200,
proteomicsAdjPval = 0.05,
proteomicsLogFC = 0
)
#> RNAseq adjusted p-value cutoff: < 0.05
#> RNAseq log fold change cutoff: > 0.1
#> RNAseq selecting top 200 DE genes
#> smallRNAseq adjusted p-value cutoff: < 0.05
#> smallRNAseq log fold change cutoff: > 0
#> smallRNAseq selecting top 100 % of DE genes
#> Proteomics adjusted p-value cutoff: < 0.05
#> Proteomics log fold change cutoff: > 0
#> ATACseq motif adjusted p-value cutoff: < 0.05
#> ATACseq motif log fold enrichment cutoff: > 0
However, what if our proteomics experiment did not went very well? Are those data still useful? The answer depends on the biological question of interest. If we have identified our proteins of interest in the proteomics data, we can still construct the network using both the RNAseq and proteomics data. However, the low coverage may negatively impact the differential analysis. We can either customize the differential analysis to be more sensitive to the proteomics data, or we can select genes that show consistent direction of change with the proteins.
setOmicCutoffs(
rnaAdjPval = 0.05,
rnaLogFC = 0.1,
rnaTopGenes = 200,
proteomicsAdjPval = 1,
proteomicsLogFC = 0
)
#> RNAseq adjusted p-value cutoff: < 0.05
#> RNAseq log fold change cutoff: > 0.1
#> RNAseq selecting top 200 DE genes
#> smallRNAseq adjusted p-value cutoff: < 0.05
#> smallRNAseq log fold change cutoff: > 0
#> smallRNAseq selecting top 100 % of DE genes
#> Proteomics adjusted p-value cutoff: < 1
#> Proteomics log fold change cutoff: > 0
#> ATACseq motif adjusted p-value cutoff: < 0.05
#> ATACseq motif log fold enrichment cutoff: > 0
For integrative use of the proteomics and RNAseq data, we need to make sure that a correct conversion between the protein IDs and the gene IDs are available. This can be done by providing a data frame that contains the conversion table.
Setting the conversion can be done any time before the TRN construction.
Since the TRN is a TF - small RNA - mRNA network, having only the mRNA targets are unable to obtain any valid regulatory relationships. Therefore, the TRN construction requires at least one of the other data types to be available.
One of the key post-transcriptional regulatory mechanisms is the small RNA regulation. In particular, microRNAs (miRNAs) are the most well-studied small RNAs. Their regulatory mechanism is well understood, and they are known to be negatively regulating the expression of their target genes by facilitating degradation of the target mRNA or inhibiting the translation of the target mRNA.
Small RNA data can come in two forms: small RNAseq data or miRNA-target interactions that are curated in third-party databases.
The benefit of small RNAseq data is that it provides a direct representation of the small RNA expression, which also cover multiple types of small RNAs. Most of them, except for miRNAs, are not well understood. However, they represent in very high abundance in the cell, and may play a role in the regulation of the transcriptome. Therefore, the small RNAseq data relies on predicted inference of the small RNA - mRNA interactions. Using the normalized count matrix of the RNAseq and the small RNAseq data, the package performs optimal pair matching between the two datasets and utilize a random-forest based classifier to predict and select the top small RNA - mRNA interactions based on their expression patterns.
With small RNAseq data, we can leverage all these tiny details to the package.
# Annotate the small RNAs
myMOList <- annotateSmallRNA(myMOList, anno = "human")
# Perform optimal matching between the RNAseq and small RNAseq data
myMOList <- matchSamplesRNAsmallRNA(myMOList,
sampleDFRNAseq = RNAseq_heart_samples,
sampleDFSmallRNAseq = smallRNAseq_heart_samples
)
# Define a cutoff for RNAseq and small RNAseq
setOmicCutoffs(
rnaAdjPval = 0.05,
rnaLogFC = 0,
rnaTopGenes = 0.2,
smallRNAAdjPval = 0.05,
smallRNALogFC = 0,
smallRNATopGenes = 0.2
)
#> RNAseq adjusted p-value cutoff: < 0.05
#> RNAseq log fold change cutoff: > 0
#> RNAseq selecting top 20 % of DE genes
#> smallRNAseq adjusted p-value cutoff: < 0.05
#> smallRNAseq log fold change cutoff: > 0
#> smallRNAseq selecting top 20 % of DE genes
#> Proteomics adjusted p-value cutoff: < 0.05
#> Proteomics log fold change cutoff: > 0
#> ATACseq motif adjusted p-value cutoff: < 0.05
#> ATACseq motif log fold enrichment cutoff: > 0
On the other hand, miRNA-target interactions are curated from third-party databases. Since miRNAs have been well studied, the curated interactions has a broad coverage of the miRNA - mRNA interactions that have been experimentally validated. However, these interactions are not necessarily relevant to the biological process of interest, or not even related to the tissue/cell type of interest. Therefore, by providing the miRNA-target interactions, the package utilizes the differential expression analysis results and predicted miRNA expression to select the most relevant miRNA - mRNA interactions.
While fetching externally curated miRNA-target interactions is not a direct functionality of the package, it is straightforward as part of the pipeline. With the differential expression results, we can export a list of names of the target genes of interest, which are usually the inputs of most database searches.
geneList <- exportDE(myMOList$DERNAseq) %>%
rownames()
geneList[1:12]
#> [1] "CTSH" "CLMP" "ZNF491" "SORL1" "DOCK9" "TMEM167B"
#> [7] "BOD1" "AFTPH" "BTBD19" "RPGR" "ZNF222" "NOC2L"
Several databases are available for fetching miRNA-target
interactions. For example, miRNet
(Chang et al. 2020) provides a
comprehensive list of miRNA-target interactions with simply fetching
using a list of target gene names. Similarly, additional databases such
as miRDB
(Chen and Wang 2020) can also be searched,
where users can integrate the results from multiple databases to obtain
a more comprehensive, experimentally validated list of miRNA-target
interactions.
Alternatively, with a similar approach, we can also fetch the miRNA-target using differentially expressed miRNAs. This can be done by exporting the differentially expressed miRNAs and use them as input for the database search.
smallRNAs <- getDESmallRNA(myMOList,
padj = 0.05,
log2fc = 0,
type = "miRNA"
)
head(smallRNAs, 5)
#> logFC pvalue padj piValue rank
#> hsa-mir-145 0.2215842 9.707373e-25 1.932544e-22 21.71387 1
#> hsa-miR-499a-5p 0.1936384 1.285506e-24 1.932544e-22 21.71387 2
#> hsa-miR-30d-3p 0.2098948 5.795504e-23 5.227545e-21 20.28170 3
#> hsa-miR-101-3p 0.3008562 1.988497e-22 1.494687e-20 19.82545 4
#> hsa-miR-373-3p 0.2931234 5.555387e-18 2.505480e-16 15.60111 5
#> transcriptType
#> hsa-mir-145 miRNA
#> hsa-miR-499a-5p miRNA
#> hsa-miR-30d-3p miRNA
#> hsa-miR-101-3p miRNA
#> hsa-miR-373-3p miRNA
As an example, we use the multiMiR
package (Ru et al. 2014) to fetch the
miRNA-target interactions from multiple databases, namely
miRTarBase
(Huang
et al. 2020), miRecords
(Xiao et al. 2009), and
TarBase
(Sethupathy, Corda, and Hatzigeorgiou 2006).
mir2geneMultiMiR <- multiMiR::get_multimir(
org = "hsa",
mirna = rownames(smallRNAs),
table = "validated",
summary = FALSE
)
mir2geneMultiMiR <- mir2geneMultiMiR@data
To avoid unnecessary database query, here we load an example of the
results, where we randomly selected 500 interactions from the
multiMiR
results.
Since in the example data we used the hsa
nomenclature
for small RNAs and gene symbols for the target genes, we can extract
only the information relevant to us.
mir2geneMultiMiR <- mir2geneMultiMiR %>%
dplyr::select(mature_mirna_id, target_symbol) %>%
dplyr::rename(
regulator = mature_mirna_id,
target = target_symbol
)
We can then combine the regulatory interactions identified from the target gene centric approach and the miRNA centric approach.
# Combine the two sets of interactions
miR2Genes <- as.data.frame(miR2Genes)
miR2Genes <- rbind(miR2Genes, mir2geneMultiMiR)
# Remove any potential duplications across databases
miR2Genes <- miR2Genes %>%
dplyr::distinct() %>%
as.list()
After obtaining a comprehensive list of miRNA-target interactions, we
can then use the list to construct the TRN. To do so, we need to load
the miRNA-target interactions into the
MOList
object.
As we have discussed earlier, the cutoffs for the RNAseq data as well as whether to select genes that show consistent expression changes in the transcriptomic and proteomic level are important factors that define the primary biological scope of the network. With the small RNA regulators available, we can visualize the impact of these cutoffs on the TRN construction.
Let us define a less stringent cutoff first:
omiCutoffs <- setOmicCutoffs(
rnaAdjPval = 0.05,
rnaLogFC = 0,
rnaTopGenes = 0.5,
smallRNAAdjPval = 0.05,
smallRNALogFC = 0,
smallRNATopGenes = 0.5
)
Let’s construct the TRN without considering the protein expression.
Without protein, the network we obtained is fairly large and complicated to interpret.
Figure 2. TRN constructed with without considering protein expression
Alternatively, we can enforce the network to only include genes that show consistent expression changes in both the transcriptomic and proteomic level.
omiCutoffs <- setOmicCutoffs(
rnaAdjPval = 0.05,
rnaLogFC = 0.1,
rnaTopGenes = 0.5,
smallRNAAdjPval = 0.05,
smallRNALogFC = 0,
smallRNATopGenes = 0.5,
proteomicsAdjPval = 0.05,
proteomicsLogFC = 0
)
The network we obtained is much more stringent in explaining what are the most important regulations that directly explains the phenotypic changes between the biological conditions of interests.
Let’s see the network.
Figure 3. TRN constructed with coherent protein expression
Besides filtering the target genes, we can also limit our scope to the most important miRNA - target gene regulation. Since the biological function of miRNA are known, we could infer further biological insights from a miRNA-target gene centric network.
Let’s see the network.
Figure 4. miRNA-target gene TRN constructed with coherent protein expression
Interestingly, we can easily identify modules at this level of biological scope.
In the end, we can export the network into a third-party tool for further topological analysis or customized network visualization.
The chromatin accessibility data is a direct representation of the transcription factor (TF) binding. Therefore, the chromatin accessibility data is used to further construct the TRN by selecting the TFs that are found to be differentially enriched across the conditions of interest.
ATACseq data is used in conjunction with the TF-target interactions to construct the TRN. Similar to the curated miRNA-target interactions, curated the TF-target interactions also face limitations in terms of relevance to the biological process of interest. Therefore, the package utilizes the differential accessibility analysis with motif enrichment analysis to select the most relevant TF - target interactions.
To do so, we need to load the TF-target interactions into the
MOList
object.
The logic behind integrating ATACseq with TF-target interactions is similar to that of integrating small RNAseq with miRNA-target interactions. However, instead of using tree-based classifier to predict the regulatory interactions between RNAseq and small RNAseq, ATACseq data alone is sufficient to identify enriched transcription factor motifs. Therefore, with the background of the curated TF-target interactions, we can isolate the differentially enriched motifs from the biological conditions and extract regulatory interactions that are relevant to the biological process of interest.
In general, the construction of the TRN is a process of intersecting
the regulatory interactions that are determined to be biologically
relevant to the condition of interest. The workflow of the
IntegraTRN
package is designed to be
flexible to accommodate different availability of the omics data.
However, with the ease of obtaining externally curated interactions, it
is highly recommended that users utilize public databases to obtain a
comprehensive list of interactions to establish a global context. Then,
the package can be used to select the most relevant interactions to the
biological process of interest.
sessionInfo("IntegraTRN")
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22621)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=English_Canada.utf8
#> [3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Canada.utf8
#>
#> time zone: America/Toronto
#> tzcode source: internal
#>
#> attached base packages:
#> character(0)
#>
#> other attached packages:
#> [1] IntegraTRN_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] fs_1.6.3
#> [2] matrixStats_1.0.0
#> [3] bitops_1.0-7
#> [4] enrichplot_1.22.0
#> [5] DirichletMultinomial_1.44.0
#> [6] devtools_2.4.5
#> [7] TFBSTools_1.40.0
#> [8] HDO.db_0.99.1
#> [9] httr_1.4.7
#> [10] RColorBrewer_1.1-3
#> [11] doParallel_1.0.17
#> [12] profvis_0.3.8
#> [13] tools_4.3.1
#> [14] backports_1.4.1
#> [15] utf8_1.2.4
#> [16] R6_2.5.1
#> [17] DT_0.30
#> [18] sm_2.2-5.7.1
#> [19] lazyeval_0.2.2
#> [20] GetoptLong_1.0.5
#> [21] urlchecker_1.0.1
#> [22] withr_2.5.2
#> [23] prettyunits_1.2.0
#> [24] gridExtra_2.3
#> [25] cli_3.6.1
#> [26] Biobase_2.62.0
#> [27] formatR_1.14
#> [28] datasets_4.3.1
#> [29] scatterpie_0.2.1
#> [30] sass_0.4.7
#> [31] base_4.3.1
#> [32] readr_2.1.4
#> [33] Rsamtools_2.18.0
#> [34] yulab.utils_0.1.0
#> [35] R.utils_2.12.2
#> [36] DOSE_3.28.0
#> [37] vioplot_0.4.0
#> [38] sessioninfo_1.2.2
#> [39] plotrix_3.8-3
#> [40] BSgenome_1.70.1
#> [41] grDevices_4.3.1
#> [42] limma_3.58.1
#> [43] rstudioapi_0.15.0
#> [44] RSQLite_2.3.3
#> [45] generics_0.1.3
#> [46] gridGraphics_0.5-1
#> [47] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [48] shape_1.4.6
#> [49] BiocIO_1.12.0
#> [50] gtools_3.9.4
#> [51] dplyr_1.1.3
#> [52] GO.db_3.18.0
#> [53] Matrix_1.6-1.1
#> [54] fansi_1.0.5
#> [55] S4Vectors_0.40.1
#> [56] abind_1.4-5
#> [57] R.methodsS3_1.8.2
#> [58] lifecycle_1.0.4
#> [59] yaml_2.3.7
#> [60] edgeR_4.0.1
#> [61] SummarizedExperiment_1.32.0
#> [62] gplots_3.1.3
#> [63] qvalue_2.34.0
#> [64] SparseArray_1.2.0
#> [65] BiocFileCache_2.10.1
#> [66] grid_4.3.1
#> [67] blob_1.2.4
#> [68] promises_1.2.1
#> [69] crayon_1.5.2
#> [70] miniUI_0.1.1.1
#> [71] lattice_0.21-8
#> [72] methods_4.3.1
#> [73] cowplot_1.1.1
#> [74] annotate_1.80.0
#> [75] GenomicFeatures_1.54.1
#> [76] KEGGREST_1.42.0
#> [77] pillar_1.9.0
#> [78] knitr_1.45
#> [79] ComplexHeatmap_2.18.0
#> [80] fgsea_1.28.0
#> [81] GenomicRanges_1.54.1
#> [82] rjson_0.2.21
#> [83] boot_1.3-28.1
#> [84] codetools_0.2-19
#> [85] fastmatch_1.1-4
#> [86] glue_1.6.2
#> [87] ggfun_0.1.3
#> [88] data.table_1.14.8
#> [89] remotes_2.4.2.1
#> [90] Rdpack_2.6.9000
#> [91] vctrs_0.6.4
#> [92] png_0.1-8
#> [93] treeio_1.26.0
#> [94] poweRlaw_0.70.6
#> [95] gtable_0.3.4
#> [96] cachem_1.0.8
#> [97] xfun_0.40
#> [98] rbibutils_2.2.16
#> [99] S4Arrays_1.2.0
#> [100] mime_0.12
#> [101] tidygraph_1.2.3
#> [102] pracma_2.4.2
#> [103] survival_3.5-5
#> [104] iterators_1.0.14
#> [105] statmod_1.5.0
#> [106] MatchIt_4.5.5
#> [107] stabs_0.6-4
#> [108] interactiveDisplayBase_1.40.0
#> [109] ellipsis_0.3.2
#> [110] rlemon_0.2.1
#> [111] nlme_3.1-162
#> [112] utils_4.3.1
#> [113] ggtree_3.10.0
#> [114] usethis_2.2.2
#> [115] bit64_4.0.5
#> [116] progress_1.2.2
#> [117] filelock_1.0.2
#> [118] GENIE3_1.24.0
#> [119] GenomeInfoDb_1.38.0
#> [120] rprojroot_2.0.4
#> [121] bslib_0.5.1
#> [122] KernSmooth_2.23-21
#> [123] colorspace_2.1-0
#> [124] seqLogo_1.68.0
#> [125] BiocGenerics_0.48.1
#> [126] DBI_1.1.3
#> [127] DESeq2_1.42.0
#> [128] tidyselect_1.2.0
#> [129] processx_3.8.2
#> [130] bit_4.0.5
#> [131] compiler_4.3.1
#> [132] curl_5.1.0
#> [133] optmatch_0.10.6
#> [134] glmnet_4.1-8
#> [135] xml2_1.3.5
#> [136] desc_1.4.2
#> [137] DelayedArray_0.28.0
#> [138] shadowtext_0.1.2
#> [139] rtracklayer_1.62.0
#> [140] scales_1.2.1
#> [141] caTools_1.18.2
#> [142] ChIPseeker_1.38.0
#> [143] monaLisa_1.8.0
#> [144] callr_3.7.3
#> [145] rappdirs_0.3.3
#> [146] stringr_1.5.0
#> [147] digest_0.6.33
#> [148] shinyBS_0.61.1
#> [149] rmarkdown_2.25
#> [150] XVector_0.42.0
#> [151] htmltools_0.5.6.1
#> [152] pkgconfig_2.0.3
#> [153] MatrixGenerics_1.14.0
#> [154] highr_0.10
#> [155] dbplyr_2.4.0
#> [156] fastmap_1.1.1
#> [157] rlang_1.1.1
#> [158] GlobalOptions_0.1.2
#> [159] htmlwidgets_1.6.2
#> [160] shiny_1.7.5.1
#> [161] jquerylib_0.1.4
#> [162] farver_2.1.1
#> [163] zoo_1.8-12
#> [164] jsonlite_1.8.7
#> [165] BiocParallel_1.36.0
#> [166] R.oo_1.25.0
#> [167] GOSemSim_2.28.0
#> [168] RCurl_1.98-1.13
#> [169] magrittr_2.0.3
#> [170] GenomeInfoDbData_1.2.11
#> [171] ggplotify_0.1.2
#> [172] patchwork_1.1.3
#> [173] munsell_0.5.0
#> [174] Rcpp_1.0.11
#> [175] ape_5.7-1
#> [176] viridis_0.6.4
#> [177] chk_0.9.1
#> [178] stringi_1.7.12
#> [179] ggraph_2.1.0
#> [180] zlibbioc_1.48.0
#> [181] MASS_7.3-60
#> [182] AnnotationHub_3.10.0
#> [183] plyr_1.8.9
#> [184] pkgbuild_1.4.2
#> [185] parallel_4.3.1
#> [186] HPO.db_0.99.2
#> [187] ggrepel_0.9.4
#> [188] CNEr_1.38.0
#> [189] Biostrings_2.70.1
#> [190] graphlayouts_1.0.2
#> [191] splines_4.3.1
#> [192] hms_1.1.3
#> [193] circlize_0.4.15
#> [194] locfit_1.5-9.8
#> [195] ps_1.7.5
#> [196] igraph_1.5.1
#> [197] reshape2_1.4.4
#> [198] biomaRt_2.58.0
#> [199] stats4_4.3.1
#> [200] pkgload_1.3.3
#> [201] TFMPvalue_0.0.9
#> [202] BiocVersion_3.18.0
#> [203] XML_3.99-0.15
#> [204] evaluate_0.23
#> [205] multiMiR_1.24.0
#> [206] BiocManager_1.30.22
#> [207] tzdb_0.4.0
#> [208] foreach_1.5.2
#> [209] tweenr_2.0.2
#> [210] httpuv_1.6.11
#> [211] networkD3_0.4
#> [212] graphics_4.3.1
#> [213] tidyr_1.3.0
#> [214] purrr_1.0.2
#> [215] polyclip_1.10-6
#> [216] clue_0.3-65
#> [217] ggplot2_3.4.4
#> [218] ggforce_0.4.1
#> [219] xtable_1.8-4
#> [220] restfulr_0.0.15
#> [221] tidytree_0.4.5
#> [222] MPO.db_0.99.7
#> [223] roxygen2_7.2.3
#> [224] later_1.3.1
#> [225] viridisLite_0.4.2
#> [226] tibble_3.2.1
#> [227] aplot_0.2.2
#> [228] memoise_2.0.1
#> [229] AnnotationDbi_1.64.1
#> [230] GenomicAlignments_1.38.0
#> [231] IRanges_2.36.0
#> [232] stats_4.3.1
#> [233] cluster_2.1.4