5.1 Creating An ArchRProject

First, we must create our ArchRProject by providing a list of Arrow files and a few other parameters. The outputDirectory here describes where all downstream analyses and plots will be saved. ArchR will automatically associate the previously provided geneAnnotation and genomeAnnotation with the new ArchRProject. These were stored when we ran addArchRGenome("hg19") in a previous chapter. Importantly, multiple ArchRProject objects cannot be combined later so any samples that you want to analyze must be included as Arrow files at this project creation step. You’ll note that we set the parameter copyArrows = TRUE which is recommended because we will modify the Arrow files in downstream operations and this preserves an original copy of the Arrow files for future usage as necessary.

projHeme1 <- ArchRProject(
  ArrowFiles = ArrowFiles, 
  outputDirectory = "HemeTutorial",
  copyArrows = TRUE
)
## Using GeneAnnotation set by addArchRGenome(Hg19)!
## Using GeneAnnotation set by addArchRGenome(Hg19)!
## Validating Arrows...
## Getting SampleNames...
## 
## Copying ArrowFiles to Ouptut Directory! If you want to save disk space set copyArrows = FALSE
## 1 2 3 
## Getting Cell Metadata...
## 
## Merging Cell Metadata...
## Initializing ArchRProject...
## 
##                                                    / |
##                                                  /    \
##             .                                  /      |.
##             \\\                              /        |.
##               \\\                          /           `|.
##                 \\\                      /              |.
##                   \                    /                |\
##                   \\#####\           /                  ||
##                 ==###########>      /                   ||
##                  \\##==......\    /                     ||
##             ______ =       =|__ /__                     ||      \\\
##         ,--' ,----`-,__ ___/'  --,-`-===================##========>
##        \               '        ##_______ _____ ,--,__,=##,__   ///
##         ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
##         -,____,---'       \\####\\________________,--\\_##,/
##            ___      .______        ______  __    __  .______      
##           /   \     |   _  \      /      ||  |  |  | |   _  \     
##          /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
##         /  /_\  \   |      /     |  |     |   __   | |      /     
##        /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
##       /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|
##

We call this ArchRProject “projHeme1” because it is the first iteration of our hematopoiesis project. Throughout this walkthrough we will modify and update this ArchRProject and keep track of which version of the project we are using by iterating the project number (i.e. “projHeme2”).

We can examine the contents of our ArchRProject:

projHeme1
## 
##            ___      .______        ______  __    __  .______      
##           /   \     |   _  \      /      ||  |  |  | |   _  \     
##          /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
##         /  /_\  \   |      /     |  |     |   __   | |      /     
##        /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
##       /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|
## 
## class: ArchRProject 
## outputDirectory: /workspace/ArchR/ArchR_Website_Testing/bookdown/HemeTutorial 
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
##   BlacklistRatio
## numberOfCells(1): 10660
## medianTSS(1): 16.815
## medianFrags(1): 3049.5

We can see from the above that our ArchRProject has been initialized with a few important attributes:

The specified outputDirectory.
The sampleNames of each sample which were obtained from the Arrow files.
A matrix called sampleColData which contains data associated with each sample.
A matrix called cellColData which contains data associated with each cell. Because we already computed doublet enrichment scores using addDoubletScores(), which added those values to each cell in the Arrow files, we can see columns corresponding to the “DoubletEnrichment” and “DoubletScore” in the cellColData matrix.
The total number of cells in our project which represents all samples after doublet identification and removal.
The median TSS enrichment score and the median number of fragments across all cells and all samples.

We can check how much memory is used to store the ArchRProject in memory within R:

paste0("Memory Size = ", round(object.size(projHeme1) / 10^6, 3), " MB")
## [1] "Memory Size = 37.477 MB"

We can also ask which data matrices are available within the ArchRProject which will be useful downstream once we start adding to this project:

getAvailableMatrices(projHeme1)
## [1] "GeneScoreMatrix" "TileMatrix"

Taking a closer look at cellColData, we can see all of the metadata that is stored here. As the name suggests, this is metadata that applies to each individual cell.

head(projHeme1@cellColData)
## DataFrame with 6 rows and 15 columns
##                                           Sample TSSEnrichment ReadsInTSS
##                                            <Rle>       <array>    <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 scATAC_BMMC_R1         7.204       1146
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 scATAC_BMMC_R1         7.949        831
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 scATAC_BMMC_R1         4.447        384
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 scATAC_BMMC_R1         6.941        659
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 scATAC_BMMC_R1         4.771        412
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 scATAC_BMMC_R1         9.185       1104
##                                   ReadsInPromoter ReadsInBlacklist
##                                           <array>          <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1            4306              611
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1            3542              502
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1            1686              311
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1            2811              475
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1            2108              331
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1            4457              341
##                                        PromoterRatio  PassQC  NucleosomeRatio
##                                              <array> <array>          <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 0.0822100882049716       1 3.22675919948354
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 0.0857710189848896       1 1.23149248892251
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1  0.044391785150079       1 3.69933184855234
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 0.0768200699606471       1 4.27870744373918
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 0.0603734677511742       1 4.03838383838384
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1  0.130253083172599       1 2.58453802639849
##                                   nMultiFrags nMonoFrags  nFrags nDiFrags
##                                       <array>    <array> <array>  <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1        3801       6196   26189    16192
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1        3448       9253   20648     7947
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1        3119       4041   18990    11830
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1        3853       3466   18296    10977
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1        3564       3465   17458    10429
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1        3605       4773   17109     8731
##                                       DoubletScore DoubletEnrichment
##                                            <array>           <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1                0              1.55
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 91.1760813271055              5.75
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1                0               1.3
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1  70.630505425003              5.05
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 163.664669675752               7.9
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 120.537764456854              6.65
##                                        BlacklistRatio
##                                               <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1  0.0116652029478025
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1  0.0121561410306083
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 0.00818852027382833
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1  0.0129809794490599
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 0.00947989460419292
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 0.00996551522590449

You’ll notice that many of these columns correspond to per-cell quality control information that was calculated during Arrow file creation. These columns are:

TSSEnrichment - the per-cell TSS enrichment score
ReadsInTSS - the number of reads that fall within TSS regions (default is 100 bp around TSS)
ReadsInPromoter - the number of reads that fall in promoter regions (default is -2000 to +100 from the TSS)
PromoterRatio - the ratio of reads in promoters to reads outside of promoters
ReadsInBlacklist - the number of reads that fall in blacklist regions
BlacklistRatio - the ratio of reads in blacklist regions to reads outside of blacklist regions
NucleosomeRatio - similar to above but for reads mapping to nucleosome-sized reads, defined as: (nDiFrags + nMultiFrags) / nMonoFrags)
nFrags - the number of fragments recovered per cell
nMonoFrags - the number of fragments that have length less than 2 * nucLength (nucLength is the length of DNA that wraps around a nucleosome (147 bp by default))
nDiFrags - the number of fragments that have length less than 3 * nucLength but greater than 2 * nucLength
nMultiFrags - the number of fragments that have length >= 3*nucLength
PassQC - equal to 1 if the cell passed QC filters or 0 if not

Throughout the tutorial, other data will get added to cellColData, for example the number of ReadsInPeaks or FRiP when we add a peak set to the project.