TreeCorTreat 0.9.0
In this tutorial, we are going to introduce the input data structure of TreeCorTreat and demonstrate some basic R functions to import gene expression matrix, cell-level and sample-level metadata from csv files. If you are familiar with single-cell RNA-seq data and R programming, you can skip this tutorial and check the step-by-step user manual here.
Here are some general suggestions for data preparation from scratch:
Collect or download high throughput sequencing data
Depending on which protocol you use, preprocess the data accordingly. For example, Cell Ranger is a standard pipeline that process 10x Chromium single-cell data to align reads, generate feature-barcode matrices, etc.
Prepare a sample-level metadata spreadsheet with sample
(e.g. a unique sample ID), phenotype of interest and other sample-level covariates.
Load in processed data into R. For example, Seurat provide a variety of functions to load in data from 10x or remote/local mtx files.
Organize the cell-level metadata and gene expression matrix. Gene expression data can be stored in a (sparse) matrix. Cell-level metadata is a data frame and it must contain barcode
and sample
two columns (or celltype
if you have already run data integration).
Please note that more advanced topics, such as details of preprocessing steps, will not be covered in this tutorial.
TreeCorTreat takes a gene expression matrix (raw count), cell-level metadata and sample-level metadata as input.
Gene expression is a matrix, where each row is a gene and each column is a cell. Each row names refers to a specific gene and each column encodes a cell’s unique barcode.
For demonstration purpose, we generate a small toy example, which can be accessed here. This dataset contains contains 100 genes and 100 cells of one sample.
For gene expression matrices stored in csv or txt format, users can read in the dataset using fread
function from data.table
R package.
library(data.table)
input_data <- fread('Gene_Expression.csv')
input_data[1:3,1:3]
## V1 AAACCTGAGATGTAAC.1.1 AAACCTGCATGGGACA.1.1
## 1: ABHD2 1 0
## 2: AC005921.2 0 0
## 3: AC009163.4 0 0
# class of this object
class(input_data)
## [1] "data.table" "data.frame"
The first column encodes gene names and all the remaining columns (e.g. from 2nd to the last column) annotate cell barcodes. The input_data
is a data.frame
object. Next, we are going to convert input_data
into a matrix
object:
# remove 1st column (gene names) from input_data
input_mat <- as.matrix(input_data[,-1])
# specify gene names as rownames
rownames(input_mat) <- input_data$V1
dim(input_mat)
## [1] 100 100
input_mat[1:3,1:3]
## AAACCTGAGATGTAAC.1.1 AAACCTGCATGGGACA.1.1 AAACCTGGTAGAGTGC.1.1
## ABHD2 1 0 0
## AC005921.2 0 0 0
## AC009163.4 0 0 0
class(input_mat)
## [1] "matrix" "array"
From the above example, we are able to load in gene expression dataset from one sample. What if we have multiple samples? We can repeat the aforementioned process across all samples. Here provides another example of four gene expression datasets (saved as csv format), corresponding to four different samples. Each file is marked by samples’ unique ID: Sample1, Sample2, Sample3 and Sample4.
# multisample directory
current_work_dir <- './multisample/'
filenames <- list.files(current_work_dir)
filenames
## [1] "Sample1.csv" "Sample2.csv" "Sample3.csv" "Sample4.csv"
We can then use a for
loop to read in these four files and save the files as a list:
# initialize a list
expr.ls <- list()
# for loop
for(i in 1:length(filenames)){
input_data <- fread(paste0(current_work_dir,filenames[i])) # read in i th file
input_mat <- as.matrix(input_data[,-1])
# specify gene names as rownames
rownames(input_mat) <- input_data$V1
# revise column names as 'sampleID:barcode' format
sampleID <- gsub('\\.csv','',filenames[i])
colnames(input_mat) <- paste0(sampleID,':',colnames(input_mat))
# store in i th element in the list
expr.ls[[i]] <- input_mat
}
names(expr.ls) <- gsub('\\.csv','',filenames)
names(expr.ls)
## [1] "Sample1" "Sample2" "Sample3" "Sample4"
Let’s do some sanity checks before concatenate different samples: (1) the rownames (or order of the genes to be included) should be the same across all samples
sapply(expr.ls,function(x) identical(rownames(x),rownames(expr.ls[[1]])))
## Sample1 Sample2 Sample3 Sample4
## TRUE TRUE TRUE TRUE
and (2) there is no duplicated cell barcode across all samples
all_column_names <- do.call(c,lapply(expr.ls,function(x) colnames(x)))
length(all_column_names)
## [1] 400
sum(duplicated(all_column_names))
## [1] 0
Then, we can proceed to combine gene expression matrices across different samples into one large matrix, where row are genes and columns are cells:
expr_mat <- do.call(cbind,expr.ls)
dim(expr_mat)
## [1] 100 400
The second component of our input is cell-level metadata, which contains cell barcode and its corresponding sample ID (i.e. which cell is coming from). Using the multi-sample example again, we can extract the cell-level data by implementing the following code:
cell_meta <- data.frame(barcode = colnames(expr_mat),
sample = sub('\\:.*','',colnames(expr_mat)))
head(cell_meta)
## barcode sample
## 1 Sample1:ACGAGCCGTCTTCAAG.1.1 Sample1
## 2 Sample1:ACGAGCCTCCTCGCAT.1.1 Sample1
## 3 Sample1:ACGAGCCTCGTCCAGG.1.1 Sample1
## 4 Sample1:ACGAGGAAGAGCCTAG.1.1 Sample1
## 5 Sample1:ACGAGGACAGGTCCAC.1.1 Sample1
## 6 Sample1:ACGAGGAGTCCTCTTG.1.1 Sample1
Cell barcode is used to couple cell-level metadata and gene expression matrix. Users can also add an additional column celltype
if one knows celltype annotation of a specific cell. Alternatively, one can infer celltype
in Module1: data integration.
The third component is sample-level metadata, which documents a sample’s phenotype(s) of interests (e.g. clinical outcome) and other related covariates (e.g. age and sex). The first column contains unique sample IDs, and the remaining columns contain phenotypes and covariates. The cell-level metadata and sample-level metadata can be linked via unique sample IDs (i.e.sample
).
In the multi-sample example, the sample metadata can be accessed here. Similarly, we can use fread
or read.csv
functions to load in the sample-meta.
sample_meta <- read.csv('SampleMeta.csv')
sample_meta
## sample age sex severity
## 1 Sample1 28 Female HD
## 2 Sample2 45 Male HD
## 3 Sample3 80 Female Severe
## 4 Sample4 61 Male Severe