5.1 Creating An ArchRProject
First, we must create our ArchRProject
by providing a list of Arrow files and a few other parameters. The outputDirectory
here describes where all downstream analyses and plots will be saved. ArchR will automatically associate the previously provided geneAnnotation
and genomeAnnotation
with the new ArchRProject
. These were stored when we ran addArchRGenome("hg19")
in a previous chapter. Importantly, multiple ArchRProject
objects cannot be combined later so any samples that you want to analyze must be included as Arrow files at this project creation step. You’ll note that we set the parameter copyArrows = TRUE
which is recommended because we will modify the Arrow files in downstream operations and this preserves an original copy of the Arrow files for future usage as necessary.
<- ArchRProject(
projHeme1 ArrowFiles = ArrowFiles,
outputDirectory = "HemeTutorial",
copyArrows = TRUE
)## Using GeneAnnotation set by addArchRGenome(Hg19)!
## Using GeneAnnotation set by addArchRGenome(Hg19)!
## Validating Arrows...
## Getting SampleNames...
##
## Copying ArrowFiles to Ouptut Directory! If you want to save disk space set copyArrows = FALSE
## 1 2 3
## Getting Cell Metadata...
##
## Merging Cell Metadata...
## Initializing ArchRProject...
##
## / |
## / \
## . / |.
## \\\ / |.
## \\\ / `|.
## \\\ / |.
## \ / |\
## \\#####\ / ||
## ==###########> / ||
## \\##==......\ / ||
## ______ = =|__ /__ || \\\
## ,--' ,----`-,__ ___/' --,-`-===================##========>
## \ ' ##_______ _____ ,--,__,=##,__ ///
## , __== ___,-,__,--'#' ===' `-' | ##,-/
## -,____,---' \\####\\________________,--\\_##,/
## ___ .______ ______ __ __ .______
## / \ | _ \ / || | | | | _ \
## / ^ \ | |_) | | ,----'| |__| | | |_) |
## / /_\ \ | / | | | __ | | /
## / _____ \ | |\ \\___ | `----.| | | | | |\ \\___.
## /__/ \__\ | _| `._____| \______||__| |__| | _| `._____|
##
We call this ArchRProject
“projHeme1” because it is the first iteration of our hematopoiesis project. Throughout this walkthrough we will modify and update this ArchRProject
and keep track of which version of the project we are using by iterating the project number (i.e. “projHeme2”).
We can examine the contents of our ArchRProject
:
projHeme1##
## ___ .______ ______ __ __ .______
## / \ | _ \ / || | | | | _ \
## / ^ \ | |_) | | ,----'| |__| | | |_) |
## / /_\ \ | / | | | __ | | /
## / _____ \ | |\ \\___ | `----.| | | | | |\ \\___.
## /__/ \__\ | _| `._____| \______||__| |__| | _| `._____|
##
## class: ArchRProject
## outputDirectory: /corces/home/rcorces/scripts/github/ArchR_Website_Testing/bookdown/HemeTutorial
## samples(3): scATAC_BMMC_R1 scATAC_CD34_BMMC_R1 scATAC_PBMC_R1
## sampleColData names(1): ArrowFiles
## cellColData names(15): Sample TSSEnrichment ... DoubletEnrichment
## BlacklistRatio
## numberOfCells(1): 10660
## medianTSS(1): 16.815
## medianFrags(1): 3049.5
We can see from the above that our ArchRProject
has been initialized with a few important attributes:
- The specified
outputDirectory
. - The
sampleNames
of each sample which were obtained from the Arrow files. - A matrix called
sampleColData
which contains data associated with each sample. - A matrix called
cellColData
which contains data associated with each cell. Because we already computed doublet enrichment scores usingaddDoubletScores()
, which added those values to each cell in the Arrow files, we can see columns corresponding to the “DoubletEnrichment” and “DoubletScore” in thecellColData
matrix. - The total number of cells in our project which represents all samples after doublet identification and removal.
- The median TSS enrichment score and the median number of fragments across all cells and all samples.
We can check how much memory is used to store the ArchRProject
in memory within R:
paste0("Memory Size = ", round(object.size(projHeme1) / 10^6, 3), " MB")
## [1] "Memory Size = 37.477 MB"
We can also ask which data matrices are available within the ArchRProject
which will be useful downstream once we start adding to this project:
getAvailableMatrices(projHeme1)
## [1] "GeneScoreMatrix" "TileMatrix"
Taking a closer look at cellColData
, we can see all of the metadata that is stored here. As the name suggests, this is metadata that applies to each individual cell.
head(projHeme1@cellColData)
## DataFrame with 6 rows and 15 columns
## Sample TSSEnrichment ReadsInTSS
## <Rle> <array> <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 scATAC_BMMC_R1 7.204 1146
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 scATAC_BMMC_R1 7.949 831
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 scATAC_BMMC_R1 4.447 384
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 scATAC_BMMC_R1 6.941 659
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 scATAC_BMMC_R1 4.771 412
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 scATAC_BMMC_R1 9.185 1104
## ReadsInPromoter ReadsInBlacklist
## <array> <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 4306 611
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 3542 502
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 1686 311
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 2811 475
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 2108 331
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 4457 341
## PromoterRatio PassQC NucleosomeRatio
## <array> <array> <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 0.0822100882049716 1 3.22675919948354
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 0.0857710189848896 1 1.23149248892251
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 0.044391785150079 1 3.69933184855234
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 0.0768200699606471 1 4.27870744373918
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 0.0603734677511742 1 4.03838383838384
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 0.130253083172599 1 2.58453802639849
## nMultiFrags nMonoFrags nFrags nDiFrags
## <array> <array> <array> <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 3801 6196 26189 16192
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 3448 9253 20648 7947
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 3119 4041 18990 11830
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 3853 3466 18296 10977
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 3564 3465 17458 10429
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 3605 4773 17109 8731
## DoubletScore DoubletEnrichment
## <array> <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 3.5363477558501 1.95
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 100.438073596781 6.05
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 0 1.1
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 51.1024880399534 4.35
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 156.52186409145 7.7
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 145.975125007649 7.4
## BlacklistRatio
## <array>
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 0.0116652029478025
## scATAC_BMMC_R1#AAGATAGTCACCGCGA-1 0.0121561410306083
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 0.00818852027382833
## scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1 0.0129809794490599
## scATAC_BMMC_R1#TCCATCGGTCCCGTGA-1 0.00947989460419292
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1 0.00996551522590449
You’ll notice that many of these columns correspond to per-cell quality control information that was calculated during Arrow file creation. These columns are:
- TSSEnrichment - the per-cell TSS enrichment score
- ReadsInTSS - the number of reads that fall within TSS regions (default is 100 bp around TSS)
- ReadsInPromoter - the number of reads that fall in promoter regions (default is -2000 to +100 from the TSS)
- PromoterRatio - the ratio of reads in promoters to reads outside of promoters
- ReadsInBlacklist - the number of reads that fall in blacklist regions
- BlacklistRatio - the ratio of reads in blacklist regions to reads outside of blacklist regions
- NucleosomeRatio - similar to above but for reads mapping to nucleosome-sized reads, defined as:
(nDiFrags + nMultiFrags) / nMonoFrags)
- nFrags - the number of fragments recovered per cell
- nMonoFrags - the number of fragments that have length less than
2 * nucLength
(nucLength is the length of DNA that wraps around a nucleosome (147 bp by default)) - nDiFrags - the number of fragments that have length less than
3 * nucLength
but greater than2 * nucLength
- nMultiFrags - the number of fragments that have length >=
3*nucLength
- PassQC - equal to
1
if the cell passed QC filters or0
if not
Throughout the tutorial, other data will get added to cellColData
, for example the number of ReadsInPeaks or FRiP when we add a peak set to the project.