20.1 Projecting bulk ATAC-seq data

Projection of bulk ATAC-seq data into a single-cell dimensionality reduction is handled by the projectBulkATAC() function. This function accepts a SummarizedExperiment object containing the relevant bulk ATAC-seq data, subsets this object based on overlap of the peak regions in the bulk ATAC-seq SummarizedExperiment and single-cell ArchRProject, simulates sub-sampled “pseudocells”, and projects these pseudocells into the specified dimensionality reduction and embedding. A “pseudocell” is a downsampled representation of a bulk ATAC-seq library

To get the projectBulkATAC() function to work, the SummarizedExperiment object must be properly formatted. In particular, there needs to be an assay named “counts” containing the raw counts for each peak (row) for each sample (column). The rowRanges can be any peak set and the projectBulkATAC() function will match those rows to the peakSet of your ArchRProject based on overlaps. This means that it is not required to have a counts matrix based on the exact same set of peaks as used in your ArchRProject, enabling projection of bulk ATAC-seq data analyzed previously, for example from publicly available sources.

Below, we demonstrate the core functionality of projectBulkATAC() using a data set of bulk ATAC-seq data from FACS-enriched hematopoietic cell types (Corces et al. Nature Genetics 2015). First, lets download the counts matrix from GEO and fix the column names to be more human readable.

#Download bulk ATAC-seq data and organize raw counts matrix
download.file(url = "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE74nnn/GSE74912/suppl/GSE74912_ATACseq_All_Counts.txt.gz", destfile = "Corces_HemeAML_BulkATAC.txt.gz")
download.file(url = "https://jeffgranja.s3.amazonaws.com/ArchR/TestData/Corces_HemeAML_ReHeader.txt", destfile = "Corces_HemeAML_ReHeader.txt")

df <- read.table(file = "Corces_HemeAML_BulkATAC.txt.gz", header = TRUE, sep = '\t', stringsAsFactors = FALSE)
header <- read.table(file = "Corces_HemeAML_ReHeader.txt", header = TRUE, sep = '\t', stringsAsFactors = FALSE)
colnames(df) <- header$NewHeader
df[1:6,1:6]
##    chr  start    end Donor4983_1A_HSC Donor4983_2A_MPP Donor4983_3A_LMPP
## 1 chr1  10025  10525                1                0                 4
## 2 chr1  13252  13752                0                5                 6
## 3 chr1  16019  16519                4                2                 6
## 4 chr1  96376  96876                0                0                 0
## 5 chr1 115440 115940                0                3                 0
## 6 chr1 235393 235893                0                2                 2

Inspecting the first few rows and columns of this data.frame shows that the first three columns are actually the peak coordinates and the remaining columns represent the individual samples. Because of this, we need to strip out the peak information and create a separate GenomicRanges object to represent the peak regions.

bulk_gr <- GRanges(seqnames = df$chr, ranges = IRanges(start = df$start+1, end = df$end))
df <- df[,which(colnames(df) %ni% c("chr","start","end"))]
rownames(df) <- paste(bulk_gr)

We don’t need all of the data in this counts matrix so for simplicity we will subset the counts matrix to only include a few cell types (columns) with data that is well-represented in our tutorial dataset.

colsToKeep <- c("Donor4983_10A_CD8Tcell",
                "Donor4983_9A_CD4Tcell",
                "Donor5483_13A_Bcell",
                "Donor5483_11B_NKcell",
                "Donor7256_7B_Mono")
df <- df[,colsToKeep]

Now we are ready to create our SummarizedExperiment object using the raw counts and peak regions.

seBulk <- SummarizedExperiment(assays = SimpleList(counts = as.matrix(df)), rowRanges = bulk_gr)
seBulk
## class: RangedSummarizedExperiment 
## dim: 590650 5 
## metadata(0):
## assays(1): counts
## rownames(590650): chr1:10026-10525 chr1:13253-13752 ...
##   chrX:154912442-154912941 chrX:155259861-155260360
## rowData names(0):
## colnames(5): Donor4983_10A_CD8Tcell Donor4983_9A_CD4Tcell
##   Donor5483_13A_Bcell Donor5483_11B_NKcell Donor7256_7B_Mono
## colData names(0):

To project these bulk samples using projectBulkATAC(), we specify the reducedDims and embedding we want to use as well as the number of “pseudocells” we want to project.

bulkPro <- projectBulkATAC(ArchRProj = projHeme5,
                           seATAC = seBulk,
                           reducedDims = "IterativeLSI",
                           embedding = "UMAP",
                           n = 250)
## ArchR logging to : ArchRLogs/ArchR-projectBulkATAC-371b0186fadbb-Date-2022-12-23_Time-09-31-50.log
## If there is an issue, please report to github with logFile!
## Overlap Ratio of Reduced Dims Features = 0.99064
## 2022-12-23 09:31:51 :
## 09:31:58 Creating temp directory /tmp/Rtmp5J9pWv/dir371b0227801f0
## untar: using cmd = '/usr/bin/gtar -xf '/corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/Save-ProjHeme1/Embeddings/Save-Uwot-UMAP-Params-IterativeLSI-371b04f188bb0-Date-2022-12-23_Time-06-21-13.tar' -C '/tmp/Rtmp5J9pWv/dir371b0227801f0''
## 09:32:00 Read 6250 rows and found 30 numeric columns
## 09:32:00 Processing block 1 of 1
## 09:32:00 Writing NN index file to temp file /tmp/Rtmp5J9pWv/file371b016680773
## 09:32:00 Searching Annoy index using 8 threads, search_k = 3000
## 09:32:01 Commencing smooth kNN distance calibration using 8 threads
## 09:32:01 Initializing by weighted average of neighbor coordinates using 8 threads
## 09:32:01 Commencing optimization for 67 epochs, with 187500 positive edges
## 09:32:01 Finished
## ArchR logging successful to : ArchRLogs/ArchR-projectBulkATAC-371b0186fadbb-Date-2022-12-23_Time-09-31-50.log
bulkPro
## List of length 3
## names(3): simulatedBulkUMAP singleCellUMAP simulatedReducedDims

The output of projectBulkATAC() is a list object containing 3 entries discussed below.

simulatedBulkUMAP is a DataFrame containing the embedding coordinates for the simulated “pseduocells”.

head(bulkPro$simulatedBulkUMAP)
## DataFrame with 6 rows and 3 columns
##                              UMAP1     UMAP2                   Type
##                          <numeric> <numeric>                  <Rle>
## Donor4983_10A_CD8Tcell#1   9.55414 0.0481353 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#2   8.85958 0.4011322 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#3   8.89925 0.5022770 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#4   9.17194 0.4374651 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#5   8.89082 0.5976226 Donor4983_10A_CD8Tcell
## Donor4983_10A_CD8Tcell#6   8.89172 0.9010460 Donor4983_10A_CD8Tcell

singleCellUMAP is a DataFrame containing the embedding coordinates for all of the original single-cells in your ArchRProject.

head(bulkPro$singleCellUMAP)
## DataFrame with 6 rows and 3 columns
##                                       UMAP1     UMAP2   Type
##                                   <numeric> <numeric>  <Rle>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 -2.555330  0.113447 scATAC
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1  7.815902  0.855727 scATAC
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1  0.587185  8.790587 scATAC
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1  1.454005  6.804272 scATAC
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1  0.632415  8.228344 scATAC
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 -3.186445 -1.106662 scATAC

simulatedReducedDims is a DataFrame containing the coordinates in the given reduced dimensions (i.e. “IterativeLSI”) for each of the simulated “pseudocells”.

head(bulkPro$simulatedReducedDims[,1:5])
##                               LSI1     LSI2       LSI3        LSI4       LSI5
## Donor4983_10A_CD8Tcell#1 -4.727875 1.611101 -0.3813222 -0.10120349 -0.1299567
## Donor4983_10A_CD8Tcell#2 -4.691325 1.803942 -0.3394767 -0.05715622 -0.1734936
## Donor4983_10A_CD8Tcell#3 -4.716584 1.636141 -0.4405469 -0.29780896 -0.1893223
## Donor4983_10A_CD8Tcell#4 -4.652775 1.685015 -0.4968275 -0.15796376 -0.1129404
## Donor4983_10A_CD8Tcell#5 -4.609569 1.851648 -0.4296640 -0.01933770 -0.3534289
## Donor4983_10A_CD8Tcell#6 -4.642584 1.714165 -0.4699323 -0.16873922 -0.2273982

Once we have the output of projectBulkATAC(), we often just want to plot this data to see where in our embedding the simulated “pseudocells” fall. There are many ways to do this but the easiest is to just concatenate the data from projectBulkATAC(), create a color palette, and use ggPoint() to plot.

#concatenate single-cell and pseudocell embedding positions
pro_df <- rbind(bulkPro$singleCellUMAP, bulkPro$simulatedBulkUMAP)

#create a color palette and force the scATAC cells to be grey to enable visualization of the project bulk ATAC data
pal <- paletteDiscrete(values = unique(as.character(pro_df$Type)), set = "stallion")
pal["scATAC"] <- "#BABABA"

#plot using ggPoint
ggPoint(x = pro_df$UMAP1,
        y = pro_df$UMAP2,
        discrete = TRUE,
        color = as.character(pro_df$Type),
        pal = pal,
        xlabel = "UMAP Dimension 1",
        ylabel = "UMAP Dimension 2",
        title = "Heme Bulk ATAC-seq Projection")

In the resulting projection UMAP plot, you can see that the bulk ATAC-seq pseudocells pile up in the regions occupied by the corresponding single-cells.

At this point in the tutorial, we are done modifying our ArchRProject so lets save it one last time to make sure all of our changes are saved on disk. We will just save it on top of our previous projHeme5 directory.

projHeme5 <- saveArchRProject(ArchRProj = projHeme5, outputDirectory = “Save-ProjHeme5”, load = TRUE)