Exploratory Data Analysis for RNA-seq data

TODO
Author

Daianna Gonzalez-Padilla

Published

December 12, 2024

⚠️ This page is under development.

Introduction

Saying that over 50% of biological data analysis is about exploring and cleaning data is not an exaggeration. In fact, that’s the way it’s supposed to be. Biological data are inherently noisy. Not only biological differences exist between cells and samples, but all sampling and experimental procedures for data generation, introduce variability, errors, and biases into the data that we have to first spot to then clean or correct.

Exploratory Data Analysis (EDA) is a primordial initial step in which we get to know the overall and specific aspects of the data. This is a process where we create tons of plots to uncover and visualize the data structures, patterns, relationships, and outliers, through which we identify data issues, drive hypoteses, and guide subsequent statistical testing.

EDA is not a pre-established path with ordered steps to follow. What you do, what you plot depends on the particularities of the data you have and the questions you aim to answer; it is completely context dependent. Moreover, many exploratory analyses you perform depend on what previous explorations informed, adding to the global understanding of the data. However, there are basic exploratory steps that establish a foundation to move forward in your analyses.

In transcriptomic data analysis, EDA includes exploring the read count distribution, filtering of lowly-expressed genes and poor-quality samples/cells, determine biological and technical sources of gene expression variation, and detect confounding factors.

In this blog post, you will learn how to perform EDA by exploring an example gene expression data set, aiming to provide a methodological but flexible framework for you to guide your own exploratory analysis.

What you’ll learn here

  • Know the type of exploratory analyses that can be done for RNA-seq data and the rationale behind them.

  • Understand the insights that can be derived from explorations and how they inform posterior analysis decisions.

  • Learn of poweful visualization tools to explore your data.

An example data set

Download data

We’ll use recount3 (Leonardo Collado-Torres 2020; Wilks et al. 2021) package to download real gene expression data from a transcriptomic study, including sample metadata and quality-control (QC) metrics.

The chosen study can be found in the Sequence Read Archive (SRA) under ID SRP107565, and it’s titled “Multiomics Profiling Establishes the Polypharmacology of FDA-Approved CDK4/6 Inhibitors and the Potential for Differential Clinical Activity” (Marc Hafner et al., 2019).

library(recount3)

## Download all available projects in human in recount3
human_projects <- available_projects()

## Download gene expression data and metadata from SRP107565 study
proj_info <- subset(
    human_projects,
    project == "SRP107565" & project_type == "data_sources",
    recount3_url = "https://sciserver.org/public-data/recount3/data"
)

proj_info
    project organism file_source     project_home project_type n_samples
1 SRP107565    human         sra data_sources/sra data_sources       216
## Create RangedSummarizedExperiment object to handle RNA-seq and sample data 
rse <- create_rse(proj_info)

## Gene expression data in assay(rse): for first 5 genes and 5 samples
assay(rse)[1:5, 1:5]
                  SRR5579425 SRR5579426 SRR5579433 SRR5579434 SRR5579435
ENSG00000278704.1          0          0          0          0          0
ENSG00000277400.1          0          0          0          0          0
ENSG00000274847.1          0          0          0          0          0
ENSG00000277428.1          0          0          0          0          0
ENSG00000276256.1          0          0          0          0          0
## Sample metadata in colData(rse): for first 6 samples and 3 variables
head(colData(rse)[, 1:3])
DataFrame with 6 rows and 3 columns
             rail_id external_id       study
           <integer> <character> <character>
SRR5579425   1000031  SRR5579425   SRP107565
SRR5579426   1000045  SRR5579426   SRP107565
SRR5579433   1000255  SRR5579433   SRP107565
SRR5579434   1000270  SRR5579434   SRP107565
SRR5579435   1000286  SRR5579435   SRP107565
SRR5579436   1000301  SRR5579436   SRP107565
## Gene data in rowData(rse): for first 6 genes and 3 variables
head(rowData(rse)[, 1:3])
DataFrame with 6 rows and 3 columns
                    source     type bp_length
                  <factor> <factor> <numeric>
ENSG00000278704.1  ENSEMBL     gene      2237
ENSG00000277400.1  ENSEMBL     gene      2179
ENSG00000274847.1  ENSEMBL     gene      1599
ENSG00000277428.1  ENSEMBL     gene       101
ENSG00000276256.1  ENSEMBL     gene      2195
ENSG00000278198.1  ENSEMBL     gene      1468

Study design and aims

In this study, the authors performed transcriptomic, proteomic, biochemical, and phenotypic assays to compare the activities of 3 inhibitors of cyclin-dependent kinases 4/6 (CDK4/6): abemaciclib, palbociclib, and ribociclib, which are used to treat hormone receptor-positive breast cancer. For our purposes, only transcriptomic data have been downloaded to be explored.

Specifically, in the transcriptomic experiment, 7 breast cancer human cell lines were drug-treated with 0.3, 1 OR 3 µM of abemaciclib, palbociclib, OR ribociclib (OR untreated = control = drug concentration of 0 µM), and mRNA sequencing was performed after 6 OR 24 hrs of exposure. Including replicates, a total of 216 bulk samples were sequenced.

Ilustration taken from Figure 1A in Marc Hafner et al., 2019.

Exploring the RangedSummarizedExperiment (Martin Morgan 2017; Huber et al. 2015) object (rse) we just created, we can see we have raw expression data for 63,856 genes across 216 samples.

dim(rse)
[1] 63856   216

Processing the sample attributes contained in sra.sample_attributes, we can check the number of samples from each cell line, treated with each drug, at each concentration, and time. Let’s add individual columns in the sample metadata for these variables.

library(tidyr)

## Sample attributes
head(rse$sra.sample_attributes, 3)
[1] "agent;;Abemaciclib|cell line;;BT20|dose;;0.3 uM|source_name;;BT20 breast cancer cell line|time;;6 hr"
[2] "agent;;Abemaciclib|cell line;;BT20|dose;;0.3 uM|source_name;;BT20 breast cancer cell line|time;;6 hr"
[3] "agent;;Palbociclib|cell line;;BT20|dose;;3 uM|source_name;;BT20 breast cancer cell line|time;;6 hr"  
## Divide each string by "|" and extract drug
rse$Drug = sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][1], ";;")[[1]][2]}) %>% unname

## Num of samples treated with each drug
table(rse$Drug)

Abemaciclib     Control Palbociclib  Ribociclib 
         84          28          52          52 
## Num of samples of each cell line
rse$Cell_line = sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][2], ";;")[[1]][2]}) %>% unname

table(rse$Cell_line)

   BT20   BT549 HCC1419 HCC1806  HS578T    MCF7    T47D 
     32      32      32      32      32      24      32 
## Num of samples treated with each dose
rse$Concentration =  sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][3], ";;")[[1]][2]}) %>% unname

table(rse$Concentration)

  0 uM 0.3 uM   1 uM   3 uM 
    28     80      4    104 
## Num of samples sequenced after each time
rse$Time =  sapply(rse$sra.sample_attributes, function(x){strsplit(strsplit(x, "\\|")[[1]][5], ";;")[[1]][2]}) %>% unname

table(rse$Time)

24 hr  6 hr 
  108   108 

Sample variables

In any EDA, we first need to know what the sample variables in the metadata stand for in order to interpret accurately any insights derived from them. A second important consideration is that, even when there are well-established QC metrics used to filter samples, exploring a broader set of them is convenient to find unexpected sample quality differences.

We start by defining the sample-level variables we’ll explore in our EDA.

Main sample attributes

  • Cell_line: name of breast cancer cell line.

  • Drug: the drug with which a given cell line was treated.

  • Concentration: the concentration of the drug administered to the cell line.

  • Time: drug exposure time for each cell line.

Quality-control metrics

QC metrics are measurements of a sample’s RNA composition and the mapping of its reads to a reference genome, reflecting the integrity of the cells, and the quality of the mRNA extraction, library preparation, sequencing, and read alignment steps.

The following metrics were collected by recount3 through a number of tools.

  • Alignment-related metrics obtained through STAR (Dobin et al. 2012):

    • all_mapped_reads: total number of aligned reads aligned.

    • uniquely_mapped_reads_%: number of reads that mapped to a single locus, of the total input reads.

    • uniquely_mapped_reads_number: number of reads that mapped to a single locus.

    • %_of_reads_mapped_to_multiple_loci: number of reads that mapped to multiple loci, of the input reads.

    • number_of_reads_mapped_to_multiple_loci: number of reads that mapped to multiple loci.

    • %_of_reads_unmapped:_other: reads that didn’t map due to no acceptable seed/windows, of all input reads.

    • number_of_reads_unmapped:_other: number of reads left unmapped due to no acceptable seed/windows.

    • %_of_reads_unmapped:_too_many_mismatches: Number of reads where best alignment has more mismatches than max allowed number of mismatches divided by number of input reads

    • number_of_reads_unmapped:_too_many_mismatches: Number of reads where best alignment has more mismatches than max allowed number of mismatches

    • %_of_reads_unmapped:_too_short: Number of reads where best alignment was shorter than min allowed mapped length divided by number of input reads.

    • number_of_reads_unmapped:_too_short: Number of reads where best alignment was shorter than min allowed mapped length.

  • SAMtools (Danecek et al. 2021) statistics for chromosome-specific read mapping. Of special interest are the reads mapping to the mitochondrial and sex chromosomes (see explanation in Step X):

    • aligned_reads%.chrm: Percent of reads aligning to the mitochondrial genome.

    • aligned_reads%.chrx: Percent of reads aligning to chromosome X.

    • aligned_reads%.chry: Precent of reads aligning to chromosome Y.

  • Sequencing coverage summaries in gene annotations by megadepth (bc) (Wilks et al. 2020) and featureCounts (Liao, Smyth, and Shi 2013) (fc):

    • bc_auc.all_reads_all_bases: Area under coverage (total depth of coverage evaluated at all bases) for all alignments

    • bc_auc.all_reads_annotated_bases: Area under coverage for all alignments, but only for bases in annotated exons

    • bc_auc.unique_reads_all_bases: Area under coverage for uniquely aligned reads

    • bc_auc.unique_reads_annotated_bases: Area under coverage for uniquely aligned reads, but only for bases in annotated exons.

    • exon_fc_count_all.total: Total number of fragments, including multi-mappers, input to featureCounts

    • exon_fc_count_all.assigned: Number of fragments, including multi-mappers, assigned by featureCounts to an exon

    • exon_fc_count_unique.total: Total number of uniquely mapping fragments input to featureCounts

    • exon_fc_count_unique.assigned: Number of uniquely mapping fragments assigned by featureCounts to an exon.

    • gene_fc_count_all.total: Total number of fragments, including multi-mappers, input to featureCounts

    • gene_fc_count_all.assigned: Number of fragments, including multi-mappers, assigned by featureCounts to a gene

    • gene_fc_count_unique.total: Total number of uniquely mapping fragments input to featureCounts

    • gene_fc_count_unique.assigned: Number of uniquely mapping fragments assigned by featureCounts to a gene.

For more details on the computation of these metrics please refer to recount3 documentation and the manual of each tool.

1. Exploring count data TODO

1.1 Read count distribution TODO

A first thing to plot is the frequency of the raw counts across all genes and samples, before any transformation. This gives us an idea of how sparse the data are, i.e., if a large percentage of the counts are zero.

library("ggplot2")

## Raw counts
data <- data.frame(raw_counts = as.vector(assays(rse)$raw_counts))
plot <- ggplot(data, aes(x = raw_counts)) +
    geom_histogram(colour = "black", fill = "lightgray") +
    labs(x = "Raw counts", y = "Frecuency") +
    theme_classic()
plot + theme(plot.margin = unit(c(2, 4, 2, 4), "cm"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Percentage of counts that are zero
length(which(as.vector(assays(rse)$raw_counts) == 0)) / length(as.vector(assays(rse)$raw_counts)) * 100
[1] 61.05911

More than 60% of the total counts across all genes and bulk samples are 0. Even though this is a large fraction, it is not surprising as we don’t expect most genes to be expressed in every condition; scRNA-seq data is even more sparse. In step 1.3 we’ll deal with lowly-expressed genes …

1.2 Count log-normalization TODO

1.3 Filtering lowly-expressed genes TODO

There are multiple approaches to remove zero- and lowly-expressed genes. Here we’ll adopt the approach of removing genes with less than X CPM in at least X % samples …

https://carpentries-incubator.github.io/rna-seq-data-for-ml/episode5.html

3. Filtering low-quality samples

3.1 Explore differences in QC metrics between sample groups

Explore sex differences: sexual chr % can help to confirm the sex of the donors based on alignments to sex chromosomes.

https://frontlinegenomics.com/how-to-ngs-quality-control/

https://www.nature.com/articles/s41592-023-01946-4

3.2 Explore relationships between sample variables

3.3 Identify sample outliers (Boxplots and PCA)

4. Variable associations with gene expression

Identify technical sources of data variability to then reduce the systematic noise from the data.

4.1 Dimensionality reduction

4.2 Correlation between variables

4.2 Partition of gene expression variance

Conclusions

Answering these questions we may be able to formulate hypothesis that could guide future analysis or experiments.

EDA is extremely illuminating when performed comprehensively and you may find an answer to your data issues or lack of signals by looking back to this step.

References

Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “Twelve Years of SAMtools and BCFtools.” GigaScience 10 (2). https://doi.org/10.1093/gigascience/giab008.
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2012. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21. https://doi.org/10.1093/bioinformatics/bts635.
Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.
Leonardo Collado-Torres. 2020. “Recount3.” https://doi.org/10.18129/B9.BIOC.RECOUNT3.
Liao, Yang, Gordon K. Smyth, and Wei Shi. 2013. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30. https://doi.org/10.1093/bioinformatics/btt656.
Martin Morgan, Valerie Obenchain. 2017. “SummarizedExperiment.” https://doi.org/10.18129/B9.BIOC.SUMMARIZEDEXPERIMENT.
Wilks, Christopher, Omar Ahmed, Daniel N. Baker, David Zhang, Leonardo Collado-Torres, and Ben Langmead. 2020. “Megadepth: Efficient Coverage Quantification for BigWigs and BAMs.” http://dx.doi.org/10.1101/2020.12.17.423317.
Wilks, Christopher, Shijie C. Zheng, Feng Yong Chen, Rone Charles, Brad Solomon, Jonathan P. Ling, Eddie Luidy Imada, et al. 2021. “Recount3: Summaries and Queries for Large-Scale RNA-Seq Expression and Splicing.” Genome Biology 22 (1). https://doi.org/10.1186/s13059-021-02533-6.