3.8 Using BAM files for Arrow File creation

Though fragment files are strongly preferred because of their very standardized format, BAM files can also be used as input for Arrow File creation in ArchR. The key to getting this to work is understanding your BAM file structure, in particular which field of the BAM file contains the cell-specific barcode information. In this example, we will download some data from 10x Genomics to illustrate how to use BAM files as input to createArrowFiles(). Using samtools view

#set timeout to prevent interrupting large file download
options(timeout=1000)
dir.create(path = "./PBMC_BAM")
## Warning in dir.create(path = "./PBMC_BAM"): './PBMC_BAM' already exists
down_df <- data.frame(
    fileUrl = c("https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_possorted_bam.bam",
    "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_possorted_bam.bam.bai"),
    md5sum = c("8140a2218ecdfd276aca5c4bb999c989","d3c5f5a00fec76378f2a947749ff2cf5")
)

#we will use a hidden ArchR function to do the download. This automatically checks the md5sum.
ArchR:::.downloadFiles(filesUrl = down_df, pathDownload = "./PBMC_BAM", threads = 1)
## Downloading files to ./PBMC_BAM...
## File exists! Skipping file atac_pbmc_500_nextgem_possorted_bam.bam...
## File exists! Skipping file atac_pbmc_500_nextgem_possorted_bam.bam.bai...
## [[1]]
## NULL
## 
## [[2]]
## NULL

If we were to use samtools view from the command line to look at this BAM file, the first line would look like this:

A00519:269:H7FM2DRXX:2:1201:1985:18834  83      chr1    9997    0       50M     =       10022   -25     CAGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  ,,,:FF,,F,FF,FFFFFFF:FFFFFFFFFFFFFFF:FF:FFF,:FFFFF      NM:i:1  MD:Z:1C48       AS:i:48 XS:i:47 CR:Z:GCGGGTTAGAACGTCG       CY:Z:FFFFFFFF:FFFFFFF   CB:Z:GCGGGTTAGAACGTCG-1 BC:Z:ATCGTACT   QT:Z:FFFFFFFF   RG:Z:atac_pbmc_500_nextgem:MissingLibrary:1:H7FM2DRXX:2

Here, the CB tag is used to store the cell-specific barcode, in this case CB:Z:GCGGGTTAGAACGTCG-1. This means that we will pass bcTag = "CB" to createArrowFiles() to tell it to look for the CB tag. If, for example, your BAM files come from a scATAC-seq technology other than 10x Genomics, the relevant tag will likely be different. On top of that, the format of the cell-specific barcode may also be different. In these cases, it may be necessary to use the gsubExpression parameter to clean up the cell-specific barcode string. The other important input parameter is bamFlag which will determine which reads or fragments are viewed as valid. The value passed to bamFlag should be in the format of a scanBamFlag used by ScanBam in Rsamtools.

In addition to understanding the structure of you BAM file, you may need to pre-process your BAM file by:

  1. Removing any fragments less than 20 bp in length
  2. Marking/removing PCR duplicates
  3. Removing barcode reads that are NA

In the case of our example data that we just downloaded, we can proceed to Arrow file creation.

addArchRGenome("hg38")
## Setting default genome to Hg38.

ArrowBam <- createArrowFiles(
  inputFiles = "./PBMC_BAM/atac_pbmc_500_nextgem_possorted_bam.bam",
  sampleNames = "bam10x",
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = FALSE,
  addGeneScoreMat = FALSE,
  bcTag = "CB",
  bamFlag = list(isMinusStrand = FALSE, isProperPair = TRUE, isDuplicate = FALSE)
)
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## ArchR logging to : ArchRLogs/ArchR-createArrows-3d01d231080e4b-Date-2022-12-20_Time-13-32-52.log
## If there is an issue, please report to github with logFile!
## Cleaning Temporary Files
## subThreading Disabled since ArchRLocking is TRUE see `addArchRLocking`
## 2022-12-20 13:32:52 : Batch Execution w/ safelapply!, 0 mins elapsed.
## (bam10x : 1 of 1) Determining Arrow Method to use!
## 2022-12-20 13:32:52 : (bam10x : 1 of 1) Reading In Fragments from inputFiles (readMethod = bam), 0.001 mins elapsed.
## 2022-12-20 13:32:52 : (bam10x : 1 of 1) Tabix Bam To Temporary File, 0.001 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:33:09 : (bam10x : 1 of 1) Reading BamFile 8 Percent, 0.291 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:33:24 : (bam10x : 1 of 1) Reading BamFile 17 Percent, 0.529 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:33:37 : (bam10x : 1 of 1) Reading BamFile 25 Percent, 0.759 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:33:51 : (bam10x : 1 of 1) Reading BamFile 33 Percent, 0.978 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:34:03 : (bam10x : 1 of 1) Reading BamFile 42 Percent, 1.194 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:34:17 : (bam10x : 1 of 1) Reading BamFile 50 Percent, 1.422 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:34:28 : (bam10x : 1 of 1) Reading BamFile 58 Percent, 1.608 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:34:41 : (bam10x : 1 of 1) Reading BamFile 67 Percent, 1.812 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:34:53 : (bam10x : 1 of 1) Reading BamFile 75 Percent, 2.021 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:35:06 : (bam10x : 1 of 1) Reading BamFile 83 Percent, 2.238 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:35:15 : (bam10x : 1 of 1) Reading BamFile 92 Percent, 2.394 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2022-12-20 13:35:24 : (bam10x : 1 of 1) Reading BamFile 100 Percent, 2.539 mins elapsed.
## 2022-12-20 13:35:25 : (bam10x : 1 of 1) Successful creation of Temporary File, 2.552 mins elapsed.
## 2022-12-20 13:35:25 : (bam10x : 1 of 1) Creating ArrowFile From Temporary File, 2.552 mins elapsed.
## 2022-12-20 13:36:02 : (bam10x : 1 of 1) Successful creation of Arrow File, 3.168 mins elapsed.
## 2022-12-20 13:36:59 : (bam10x : 1 of 1) CellStats : Number of Cells Pass Filter = 568 , 4.119 mins elapsed.
## 2022-12-20 13:36:59 : (bam10x : 1 of 1) CellStats : Median Frags = 16887 , 4.119 mins elapsed.
## 2022-12-20 13:36:59 : (bam10x : 1 of 1) CellStats : Median TSS Enrichment = 19.811 , 4.119 mins elapsed.
## 2022-12-20 13:37:00 : (bam10x : 1 of 1) Adding Additional Feature Counts!, 4.143 mins elapsed.
## 2022-12-20 13:37:52 : (bam10x : 1 of 1) Removing Fragments from Filtered Cells, 5.002 mins elapsed.
## 2022-12-20 13:37:52 : (bam10x : 1 of 1) Creating Filtered Arrow File, 5.002 mins elapsed.
## 2022-12-20 13:38:11 : (bam10x : 1 of 1) Finished Constructing Filtered Arrow File!, 5.322 mins elapsed.
## 2022-12-20 13:38:11 : (bam10x : 1 of 1) Finished Creating Arrow File, 5.323 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-createArrows-3d01d231080e4b-Date-2022-12-20_Time-13-32-52.log
ArrowBam
## [1] "bam10x.arrow"