3.8 Using BAM files for Arrow File creation

Though fragment files are strongly preferred because of their very standardized format, BAM files can also be used as input for Arrow File creation in ArchR. The key to getting this to work is understanding your BAM file structure, in particular which field of the BAM file contains the cell-specific barcode information. In this example, we will download some data from 10x Genomics to illustrate how to use BAM files as input to createArrowFiles(). Using samtools view

#set timeout to prevent interrupting large file download
options(timeout=1000)
dir.create(path = "./PBMC_BAM")
down_df <- data.frame(
    fileUrl = c("https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_possorted_bam.bam",
    "https://cf.10xgenomics.com/samples/cell-atac/2.0.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_possorted_bam.bam.bai"),
    md5sum = c("8140a2218ecdfd276aca5c4bb999c989","d3c5f5a00fec76378f2a947749ff2cf5")
)

#we will use a hidden ArchR function to do the download. This automatically checks the md5sum.
ArchR:::.downloadFiles(filesUrl = down_df, pathDownload = "./PBMC_BAM", threads = 1)
## Downloading files to ./PBMC_BAM...
## Downloading file atac_pbmc_500_nextgem_possorted_bam.bam...
## Downloading file atac_pbmc_500_nextgem_possorted_bam.bam.bai...
## [[1]]
## [1] 0
## 
## [[2]]
## [1] 0

If we were to use samtools view from the command line to look at this BAM file, the first line would look like this:

A00519:269:H7FM2DRXX:2:1201:1985:18834  83      chr1    9997    0       50M     =       10022   -25     CAGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC  ,,,:FF,,F,FF,FFFFFFF:FFFFFFFFFFFFFFF:FF:FFF,:FFFFF      NM:i:1  MD:Z:1C48       AS:i:48 XS:i:47 CR:Z:GCGGGTTAGAACGTCG       CY:Z:FFFFFFFF:FFFFFFF   CB:Z:GCGGGTTAGAACGTCG-1 BC:Z:ATCGTACT   QT:Z:FFFFFFFF   RG:Z:atac_pbmc_500_nextgem:MissingLibrary:1:H7FM2DRXX:2

Here, the CB tag is used to store the cell-specific barcode, in this case CB:Z:GCGGGTTAGAACGTCG-1. This means that we will pass bcTag = "CB" to createArrowFiles() to tell it to look for the CB tag. If, for example, your BAM files come from a scATAC-seq technology other than 10x Genomics, the relevant tag will likely be different. On top of that, the format of the cell-specific barcode may also be different. In these cases, it may be necessary to use the gsubExpression parameter to clean up the cell-specific barcode string. The other important input parameter is bamFlag which will determine which reads or fragments are viewed as valid. The value passed to bamFlag should be in the format of a scanBamFlag used by ScanBam in Rsamtools.

In addition to understanding the structure of you BAM file, you may need to pre-process your BAM file by:

  1. Removing any fragments less than 20 bp in length
  2. Marking/removing PCR duplicates
  3. Removing barcode reads that are NA

In the case of our example data that we just downloaded, we can proceed to Arrow file creation.

addArchRGenome("hg38")
## Setting default genome to Hg38.

ArrowBam <- createArrowFiles(
  inputFiles = "./PBMC_BAM/atac_pbmc_500_nextgem_possorted_bam.bam",
  sampleNames = "bam10x",
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = FALSE,
  addGeneScoreMat = FALSE,
  bcTag = "CB",
  bamFlag = list(isMinusStrand = FALSE, isProperPair = TRUE, isDuplicate = FALSE)
)
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## Using GeneAnnotation set by addArchRGenome(Hg38)!
## ArchR logging to : ArchRLogs/ArchR-createArrows-3c35f9e82ed-Date-2024-12-03_Time-22-53-56.790142.log
## If there is an issue, please report to github with logFile!
## Cleaning Temporary Files
## subThreading Disabled since ArchRLocking is TRUE see `addArchRLocking`
## 2024-12-03 22:53:57.365419 : Batch Execution w/ safelapply!, 0 mins elapsed.
## (bam10x : 1 of 1) Determining Arrow Method to use!
## 2024-12-03 22:53:57.425019 : (bam10x : 1 of 1) Reading In Fragments from inputFiles (readMethod = bam), 0.001 mins elapsed.
## 2024-12-03 22:53:57.43465 : (bam10x : 1 of 1) Tabix Bam To Temporary File, 0.001 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:54:24.046554 : (bam10x : 1 of 1) Reading BamFile 8 Percent, 0.445 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:54:45.353068 : (bam10x : 1 of 1) Reading BamFile 17 Percent, 0.8 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:55:04.430925 : (bam10x : 1 of 1) Reading BamFile 25 Percent, 1.118 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:55:22.53187 : (bam10x : 1 of 1) Reading BamFile 33 Percent, 1.419 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:55:40.110187 : (bam10x : 1 of 1) Reading BamFile 42 Percent, 1.712 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:55:58.812456 : (bam10x : 1 of 1) Reading BamFile 50 Percent, 2.024 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:56:14.042159 : (bam10x : 1 of 1) Reading BamFile 58 Percent, 2.278 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:56:30.364622 : (bam10x : 1 of 1) Reading BamFile 67 Percent, 2.55 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:56:48.562311 : (bam10x : 1 of 1) Reading BamFile 75 Percent, 2.853 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:57:08.5803 : (bam10x : 1 of 1) Reading BamFile 83 Percent, 3.187 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:57:20.790855 : (bam10x : 1 of 1) Reading BamFile 92 Percent, 3.39 mins elapsed.
## Warning in sprintf("%s Reading BamFile %s Percent", prefix, round(100 * : one
## argument not used by format '%s Reading BamFile %s Percent'
## 2024-12-03 22:57:32.016427 : (bam10x : 1 of 1) Reading BamFile 100 Percent, 3.578 mins elapsed.
## 2024-12-03 22:57:32.942728 : (bam10x : 1 of 1) Successful creation of Temporary File, 3.593 mins elapsed.
## 2024-12-03 22:57:32.949099 : (bam10x : 1 of 1) Creating ArrowFile From Temporary File, 3.593 mins elapsed.
## 2024-12-03 22:58:36.770678 : (bam10x : 1 of 1) Successful creation of Arrow File, 4.657 mins elapsed.
## 2024-12-03 22:59:41.547574 : (bam10x : 1 of 1) CellStats : Number of Cells Pass Filter = 568 , 5.736 mins elapsed.
## 2024-12-03 22:59:41.552092 : (bam10x : 1 of 1) CellStats : Median Frags = 16887 , 5.736 mins elapsed.
## 2024-12-03 22:59:41.557904 : (bam10x : 1 of 1) CellStats : Median TSS Enrichment = 19.811 , 5.737 mins elapsed.
## 2024-12-03 22:59:42.741133 : (bam10x : 1 of 1) Adding Additional Feature Counts!, 5.756 mins elapsed.
## 2024-12-03 23:00:41.631099 : (bam10x : 1 of 1) Removing Fragments from Filtered Cells, 6.738 mins elapsed.
## 2024-12-03 23:00:41.638938 : (bam10x : 1 of 1) Creating Filtered Arrow File, 6.738 mins elapsed.
## 2024-12-03 23:01:05.535157 : (bam10x : 1 of 1) Finished Constructing Filtered Arrow File!, 7.136 mins elapsed.
## 2024-12-03 23:01:05.66431 : (bam10x : 1 of 1) Finished Creating Arrow File, 7.138 mins elapsed.
## ArchR logging successful to : ArchRLogs/ArchR-createArrows-3c35f9e82ed-Date-2024-12-03_Time-22-53-56.790142.log
ArrowBam
## [1] "bam10x.arrow"