20.1 Projecting bulk ATAC-seq data
Projection of bulk ATAC-seq data into a single-cell dimensionality reduction is handled by the projectBulkATAC()
function. This function accepts a SummarizedExperiment
object containing the relevant bulk ATAC-seq data, subsets this object based on overlap of the peak regions in the bulk ATAC-seq SummarizedExperiment
and single-cell ArchRProject
, simulates sub-sampled “pseudocells”, and projects these pseudocells into the specified dimensionality reduction and embedding. A “pseudocell” is a downsampled representation of a bulk ATAC-seq library
To get the projectBulkATAC()
function to work, the SummarizedExperiment
object must be properly formatted. In particular, there needs to be an assay
named “counts” containing the raw counts for each peak (row) for each sample (column). The rowRanges
can be any peak set and the projectBulkATAC()
function will match those rows to the peakSet
of your ArchRProject
based on overlaps. This means that it is not required to have a counts matrix based on the exact same set of peaks as used in your ArchRProject
, enabling projection of bulk ATAC-seq data analyzed previously, for example from publicly available sources.
Below, we demonstrate the core functionality of projectBulkATAC()
using a data set of bulk ATAC-seq data from FACS-enriched hematopoietic cell types (Corces et al. Nature Genetics 2015). First, lets download the counts matrix from GEO and fix the column names to be more human readable.
#Download bulk ATAC-seq data and organize raw counts matrix
download.file(url = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE74nnn/GSE74912/suppl/GSE74912_ATACseq_All_Counts.txt.gz", destfile = "Corces_HemeAML_BulkATAC.txt.gz")
download.file(url = "https://jeffgranja.s3.amazonaws.com/ArchR/TestData/Corces_HemeAML_ReHeader.txt", destfile = "Corces_HemeAML_ReHeader.txt")
<- read.table(file = "Corces_HemeAML_BulkATAC.txt.gz", header = TRUE, sep = '\t', stringsAsFactors = FALSE)
df <- read.table(file = "Corces_HemeAML_ReHeader.txt", header = TRUE, sep = '\t', stringsAsFactors = FALSE)
header colnames(df) <- header$NewHeader
1:6,1:6]
df[## chr start end Donor4983_1A_HSC Donor4983_2A_MPP Donor4983_3A_LMPP
## 1 chr1 10025 10525 1 0 4
## 2 chr1 13252 13752 0 5 6
## 3 chr1 16019 16519 4 2 6
## 4 chr1 96376 96876 0 0 0
## 5 chr1 115440 115940 0 3 0
## 6 chr1 235393 235893 0 2 2
Inspecting the first few rows and columns of this data.frame
shows that the first three columns are actually the peak coordinates and the remaining columns represent the individual samples. Because of this, we need to strip out the peak information and create a separate GenomicRanges
object to represent the peak regions.
<- GRanges(seqnames = df$chr, ranges = IRanges(start = df$start+1, end = df$end))
bulk_gr <- df[,which(colnames(df) %ni% c("chr","start","end"))]
df rownames(df) <- paste(bulk_gr)
We don’t need all of the data in this counts matrix so for simplicity we will subset the counts matrix to only include a few cell types (columns) with data that is well-represented in our tutorial dataset.
<- c("Donor4983_10A_CD8Tcell",
colsToKeep "Donor4983_9A_CD4Tcell",
"Donor5483_13A_Bcell",
"Donor5483_11B_NKcell",
"Donor7256_7B_Mono")
<- df[,colsToKeep] df
Now we are ready to create our SummarizedExperiment
object using the raw counts and peak regions.
<- SummarizedExperiment(assays = SimpleList(counts = as.matrix(df)), rowRanges = bulk_gr)
seBulk
seBulk## class: RangedSummarizedExperiment
## dim: 590650 5
## metadata(0):
## assays(1): counts
## rownames(590650): chr1:10026-10525 chr1:13253-13752 ...
## chrX:154912442-154912941 chrX:155259861-155260360
## rowData names(0):
## colnames(5): Donor4983_10A_CD8Tcell Donor4983_9A_CD4Tcell
## Donor5483_13A_Bcell Donor5483_11B_NKcell Donor7256_7B_Mono
## colData names(0):
To project these bulk samples using projectBulkATAC()
, we specify the reducedDims
and embedding
we want to use as well as the number of “pseudocells” we want to project.
<- projectBulkATAC(ArchRProj = projHeme5,
bulkPro seATAC = seBulk,
reducedDims = "IterativeLSI",
embedding = "UMAP",
n = 250)
## ArchR logging to : ArchRLogs/ArchR-projectBulkATAC-371b0186fadbb-Date-2022-12-23_Time-09-31-50.log
## If there is an issue, please report to github with logFile!
## Overlap Ratio of Reduced Dims Features = 0.99064
## 2022-12-23 09:31:51 :
## 09:31:58 Creating temp directory /tmp/Rtmp5J9pWv/dir371b0227801f0
## untar: using cmd = '/usr/bin/gtar -xf '/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme1/Embeddings/Save-Uwot-UMAP-Params-IterativeLSI-371b04f188bb0-Date-2022-12-23_Time-06-21-13.tar' -C '/tmp/Rtmp5J9pWv/dir371b0227801f0''
## 09:32:00 Read 6250 rows and found 30 numeric columns
## 09:32:00 Processing block 1 of 1
## 09:32:00 Writing NN index file to temp file /tmp/Rtmp5J9pWv/file371b016680773
## 09:32:00 Searching Annoy index using 8 threads, search_k = 3000
## 09:32:01 Commencing smooth kNN distance calibration using 8 threads
## 09:32:01 Initializing by weighted average of neighbor coordinates using 8 threads
## 09:32:01 Commencing optimization for 67 epochs, with 187500 positive edges
## 09:32:01 Finished
## ArchR logging successful to : ArchRLogs/ArchR-projectBulkATAC-371b0186fadbb-Date-2022-12-23_Time-09-31-50.log
bulkPro## List of length 3
## names(3): simulatedBulkUMAP singleCellUMAP simulatedReducedDims
The output of projectBulkATAC()
is a list object containing 3 entries discussed below.
simulatedBulkUMAP
is a DataFrame
containing the embedding coordinates for the simulated “pseduocells”.
head(bulkPro$simulatedBulkUMAP)
## DataFrame with 6 rows and 3 columns
## UMAP1 UMAP2 Type
## <numeric> <numeric> <Rle>
## Donor4983_10A_CD8Tcell#1 9.55414 0.0481353 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#2 8.85958 0.4011322 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#3 8.89925 0.5022770 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#4 9.17194 0.4374651 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#5 8.89082 0.5976226 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#6 8.89172 0.9010460 Donor4983_10A_CD8Tcell
singleCellUMAP
is a DataFrame
containing the embedding coordinates for all of the original single-cells in your ArchRProject
.
head(bulkPro$singleCellUMAP)
## DataFrame with 6 rows and 3 columns
## UMAP1 UMAP2 Type
## <numeric> <numeric> <Rle>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 -2.555330 0.113447 scATAC
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 7.815902 0.855727 scATAC
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 0.587185 8.790587 scATAC
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 1.454005 6.804272 scATAC
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 0.632415 8.228344 scATAC
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 -3.186445 -1.106662 scATAC
simulatedReducedDims
is a DataFrame
containing the coordinates in the given reduced dimensions (i.e. “IterativeLSI”) for each of the simulated “pseudocells”.
head(bulkPro$simulatedReducedDims[,1:5])
## LSI1 LSI2 LSI3 LSI4 LSI5
## Donor4983_10A_CD8Tcell#1 -4.727875 1.611101 -0.3813222 -0.10120349 -0.1299567
## Donor4983_10A_CD8Tcell#2 -4.691325 1.803942 -0.3394767 -0.05715622 -0.1734936
## Donor4983_10A_CD8Tcell#3 -4.716584 1.636141 -0.4405469 -0.29780896 -0.1893223
## Donor4983_10A_CD8Tcell#4 -4.652775 1.685015 -0.4968275 -0.15796376 -0.1129404
## Donor4983_10A_CD8Tcell#5 -4.609569 1.851648 -0.4296640 -0.01933770 -0.3534289
## Donor4983_10A_CD8Tcell#6 -4.642584 1.714165 -0.4699323 -0.16873922 -0.2273982
Once we have the output of projectBulkATAC()
, we often just want to plot this data to see where in our embedding the simulated “pseudocells” fall. There are many ways to do this but the easiest is to just concatenate the data from projectBulkATAC()
, create a color palette, and use ggPoint()
to plot.
#concatenate single-cell and pseudocell embedding positions
<- rbind(bulkPro$singleCellUMAP, bulkPro$simulatedBulkUMAP)
pro_df
#create a color palette and force the scATAC cells to be grey to enable visualization of the project bulk ATAC data
<- paletteDiscrete(values = unique(as.character(pro_df$Type)), set = "stallion")
pal "scATAC"] <- "#BABABA"
pal[
#plot using ggPoint
ggPoint(x = pro_df$UMAP1,
y = pro_df$UMAP2,
discrete = TRUE,
color = as.character(pro_df$Type),
pal = pal,
xlabel = "UMAP Dimension 1",
ylabel = "UMAP Dimension 2",
title = "Heme Bulk ATAC-seq Projection")
In the resulting projection UMAP plot, you can see that the bulk ATAC-seq pseudocells pile up in the regions occupied by the corresponding single-cells.
At this point in the tutorial, we are done modifying our ArchRProject
so lets save it one last time to make sure all of our changes are saved on disk. We will just save it on top of our previous projHeme5
directory.
projHeme5 <- saveArchRProject(ArchRProj = projHeme5, outputDirectory = “Save-ProjHeme5”, load = TRUE)