Saying that over 50% of biological data analysis is about exploring and cleaning data is not an exaggeration. In fact, that’s the way it’s supposed to be. Biological data are inherently noisy. Not only biological differences exist between cells and samples, but all sampling and experimental procedures for data generation, introduce variability, errors, and biases into the data that we have to first spot to then clean or correct.
Exploratory Data Analysis (EDA) is a primordial initial step in which we get to know the overall and specific aspects of the data. This is a process where we create tons of plots to uncover and visualize the data structures, patterns, relationships, and outliers, through which we identify data issues, drive hypoteses, and guide subsequent statistical testing.
EDA is not a pre-established path with ordered steps to follow. What you do, what you plot depends on the particularities of the data you have and the questions you aim to answer; it is completely context dependent. Moreover, many exploratory analyses you perform depend on what previous explorations informed, adding to the global understanding of the data. However, there are basic exploratory steps that establish a foundation to move forward in your analyses.
In transcriptomic data analysis, EDA includes exploring the read count distribution, filtering of lowly-expressed genes and poor-quality samples/cells, determine biological and technical sources of gene expression variation, and detect confounding factors.
In this blog post, you will learn how to perform EDA by exploring an example gene expression data set, aiming to provide a methodological but flexible framework for you to guide your own exploratory analysis.
What you’ll learn here
Know the type of exploratory analyses that can be done for RNA-seq data and the rationale behind them.
Understand the insights that can be derived from explorations and how they inform posterior analysis decisions.
Learn of poweful visualization tools to explore your data.
The chosen study can be found in the Sequence Read Archive (SRA) under ID SRP107565, and it’s titled “Multiomics Profiling Establishes the Polypharmacology of FDA-Approved CDK4/6 Inhibitors and the Potential for Differential Clinical Activity” (Marc Hafner et al., 2019).
library(recount3)## Download all available projects in human in recount3human_projects <-available_projects()## Download gene expression data and metadata from SRP107565 studyproj_info <-subset( human_projects, project =="SRP107565"& project_type =="data_sources",recount3_url ="https://sciserver.org/public-data/recount3/data")proj_info
## Create RangedSummarizedExperiment object to handle RNA-seq and sample data rse <-create_rse(proj_info)## Gene expression data in assay(rse): for first 5 genes and 5 samplesassay(rse)[1:5, 1:5]
In this study, the authors performed transcriptomic, proteomic, biochemical, and phenotypic assays to compare the activities of 3 inhibitors of cyclin-dependent kinases 4/6 (CDK4/6): abemaciclib, palbociclib, and ribociclib, which are used to treat hormone receptor-positive breast cancer. For our purposes, only transcriptomic data have been downloaded to be explored.
Specifically, in the transcriptomic experiment, 7 breast cancer human cell lines were drug-treated with 0.3, 1 OR 3 µM of abemaciclib, palbociclib, OR ribociclib (OR untreated = control= drug concentration of 0 µM), and mRNA sequencing was performed after 6 OR 24 hrs of exposure. Including replicates, a total of 216 bulk samples were sequenced.
Processing the sample attributes contained in sra.sample_attributes, we can check the number of samples from each cell line, treated with each drug, at each concentration, and time. Let’s add individual columns in the sample metadata for these variables.
[1] "agent;;Abemaciclib|cell line;;BT20|dose;;0.3 uM|source_name;;BT20 breast cancer cell line|time;;6 hr"
[2] "agent;;Abemaciclib|cell line;;BT20|dose;;0.3 uM|source_name;;BT20 breast cancer cell line|time;;6 hr"
[3] "agent;;Palbociclib|cell line;;BT20|dose;;3 uM|source_name;;BT20 breast cancer cell line|time;;6 hr"
## Divide each string by "|" and extract drugrse$Drug =sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][1], ";;")[[1]][2]}) %>% unname## Num of samples treated with each drugtable(rse$Drug)
Abemaciclib Control Palbociclib Ribociclib
84 28 52 52
## Num of samples of each cell linerse$Cell_line =sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][2], ";;")[[1]][2]}) %>% unnametable(rse$Cell_line)
## Num of samples treated with each doserse$Concentration =sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][3], ";;")[[1]][2]}) %>% unnametable(rse$Concentration)
0 uM 0.3 uM 1 uM 3 uM
28 80 4 104
## Num of samples sequenced after each timerse$Time =sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][5], ";;")[[1]][2]}) %>% unnametable(rse$Time)
24 hr 6 hr
108 108
Sample variables
In any EDA, we first need to know what the sample variables in the metadata stand for in order to interpret accurately any insights derived from them. A second important consideration is that, even when there are well-established QC metrics used to filter samples, exploring a broader set of them is convenient to find unexpected sample quality differences.
We start by defining the sample-level variables we’ll explore in our EDA.
Main sample attributes
Cell_line: name of breast cancer cell line.
Drug: the drug with which a given cell line was treated.
Concentration: the concentration of the drug administered to the cell line.
Time: drug exposure time for each cell line.
Quality-control metrics
QC metrics are measurements of a sample’s RNA composition and the mapping of its reads to a reference genome, reflecting the integrity of the cells, and the quality of the mRNA extraction, library preparation, sequencing, and read alignment steps.
The following metrics were collected by recount3 through a number of tools.
all_mapped_reads: total number of aligned reads aligned.
uniquely_mapped_reads_%: number of reads that mapped to a single locus, of the total input reads.
uniquely_mapped_reads_number: number of reads that mapped to a single locus.
%_of_reads_mapped_to_multiple_loci: number of reads that mapped to multiple loci, of the input reads.
number_of_reads_mapped_to_multiple_loci: number of reads that mapped to multiple loci.
%_of_reads_unmapped:_other: reads that didn’t map due to no acceptable seed/windows, of all input reads.
number_of_reads_unmapped:_other: number of reads left unmapped due to no acceptable seed/windows.
%_of_reads_unmapped:_too_many_mismatches: Number of reads where best alignment has more mismatches than max allowed number of mismatches divided by number of input reads
number_of_reads_unmapped:_too_many_mismatches: Number of reads where best alignment has more mismatches than max allowed number of mismatches
%_of_reads_unmapped:_too_short: Number of reads where best alignment was shorter than min allowed mapped length divided by number of input reads.
number_of_reads_unmapped:_too_short: Number of reads where best alignment was shorter than min allowed mapped length.
SAMtools(Danecek et al. 2021) statistics for chromosome-specific read mapping. Of special interest are the reads mapping to the mitochondrial and sex chromosomes (see explanation in Step X):
aligned_reads%.chrm: Percent of reads aligning to the mitochondrial genome.
aligned_reads%.chrx: Percent of reads aligning to chromosome X.
aligned_reads%.chry: Precent of reads aligning to chromosome Y.
bc_auc.all_reads_all_bases: Area under coverage (total depth of coverage evaluated at all bases) for all alignments
bc_auc.all_reads_annotated_bases: Area under coverage for all alignments, but only for bases in annotated exons
bc_auc.unique_reads_all_bases: Area under coverage for uniquely aligned reads
bc_auc.unique_reads_annotated_bases: Area under coverage for uniquely aligned reads, but only for bases in annotated exons.
exon_fc_count_all.total: Total number of fragments, including multi-mappers, input to featureCounts
exon_fc_count_all.assigned: Number of fragments, including multi-mappers, assigned by featureCounts to an exon
exon_fc_count_unique.total: Total number of uniquely mapping fragments input to featureCounts
exon_fc_count_unique.assigned: Number of uniquely mapping fragments assigned by featureCounts to an exon.
gene_fc_count_all.total: Total number of fragments, including multi-mappers, input to featureCounts
gene_fc_count_all.assigned: Number of fragments, including multi-mappers, assigned by featureCounts to a gene
gene_fc_count_unique.total: Total number of uniquely mapping fragments input to featureCounts
gene_fc_count_unique.assigned: Number of uniquely mapping fragments assigned by featureCounts to a gene.
For more details on the computation of these metrics please refer to recount3documentation and the manual of each tool.
1. Exploring count data TODO
1.1 Read count distribution TODO
A first thing to plot is the frequency of the raw counts across all genes and samples, before any transformation. This gives us an idea of how sparse the data are, i.e., if a large percentage of the counts are zero.
library("ggplot2")## Raw countsdata <-data.frame(raw_counts =as.vector(assays(rse)$raw_counts))plot <-ggplot(data, aes(x = raw_counts)) +geom_histogram(colour ="black", fill ="lightgray") +labs(x ="Raw counts", y ="Frecuency") +theme_classic()plot +theme(plot.margin =unit(c(2, 4, 2, 4), "cm"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Percentage of counts that are zerolength(which(as.vector(assays(rse)$raw_counts) ==0)) /length(as.vector(assays(rse)$raw_counts)) *100
[1] 61.05911
More than 60% of the total counts across all genes and bulk samples are 0. Even though this is a large fraction, it is not surprising as we don’t expect most genes to be expressed in every condition; scRNA-seq data is even more sparse. In step 1.3 we’ll deal with lowly-expressed genes …
1.2 Count log-normalization TODO
1.3 Filtering lowly-expressed genes TODO
There are multiple approaches to remove zero- and lowly-expressed genes. Here we’ll adopt the approach of removing genes with less than X CPM in at least X % samples …
3.2 Explore relationships between sample variables
3.3 Identify sample outliers (Boxplots and PCA)
4. Variable associations with gene expression
Identify technical sources of data variability to then reduce the systematic noise from the data.
4.1 Dimensionality reduction
4.2 Correlation between variables
4.2 Partition of gene expression variance
Conclusions
Answering these questions we may be able to formulate hypothesis that could guide future analysis or experiments.
EDA is extremely illuminating when performed comprehensively and you may find an answer to your data issues or lack of signals by looking back to this step.
References
Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “Twelve Years of SAMtools and BCFtools.”GigaScience 10 (2). https://doi.org/10.1093/gigascience/giab008.
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2012. “STAR: Ultrafast Universal RNA-Seq Aligner.”Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.”Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.
Liao, Yang, Gordon K. Smyth, and Wei Shi. 2013. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.”Bioinformatics 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.
Wilks, Christopher, Omar Ahmed, Daniel N. Baker, David Zhang, Leonardo Collado-Torres, and Ben Langmead. 2020. “Megadepth: Efficient Coverage Quantification for BigWigs and BAMs.”http://dx.doi.org/10.1101/2020.12.17.423317.
Wilks, Christopher, Shijie C. Zheng, Feng Yong Chen, Rone Charles, Brad Solomon, Jonathan P. Ling, Eddie Luidy Imada, et al. 2021. “Recount3: Summaries and Queries for Large-Scale RNA-Seq Expression and Splicing.”Genome Biology 22 (1). https://doi.org/10.1186/s13059-021-02533-6.