TreeCorTreat - Data Prepartion

Boyang Zhang1, Zeyu Chen2,3,4, Zhicheng Ji5, E. John Wherry2,3,4 and Hongkai Ji1

1Deparment of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University
2Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania
3 Institute for Immunology, Perelman School of Medicine, University of Pennsylvania
4Parker Institute for Cancer Immunotherapy at University of Pennsylvania
5Department of Biostatistics and Bioinformatics, Duke University School of Medicine

15 November 2021

Package

TreeCorTreat 0.9.0

1 Introduction
2 Input data

1 Introduction

In this tutorial, we are going to introduce the input data structure of TreeCorTreat and demonstrate some basic R functions to import gene expression matrix, cell-level and sample-level metadata from csv files. If you are familiar with single-cell RNA-seq data and R programming, you can skip this tutorial and check the step-by-step user manual here.

Here are some general suggestions for data preparation from scratch:

Collect or download high throughput sequencing data
Depending on which protocol you use, preprocess the data accordingly. For example, Cell Ranger is a standard pipeline that process 10x Chromium single-cell data to align reads, generate feature-barcode matrices, etc.
Prepare a sample-level metadata spreadsheet with sample (e.g. a unique sample ID), phenotype of interest and other sample-level covariates.
Load in processed data into R. For example, Seurat provide a variety of functions to load in data from 10x or remote/local mtx files.
Organize the cell-level metadata and gene expression matrix. Gene expression data can be stored in a (sparse) matrix. Cell-level metadata is a data frame and it must contain barcode and sample two columns (or celltype if you have already run data integration).

Please note that more advanced topics, such as details of preprocessing steps, will not be covered in this tutorial.

2 Input data

TreeCorTreat takes a gene expression matrix (raw count), cell-level metadata and sample-level metadata as input.

2.1 Gene expression data

Gene expression is a matrix, where each row is a gene and each column is a cell. Each row names refers to a specific gene and each column encodes a cell’s unique barcode.

2.1.1 Gene expression matrix of one sample

For demonstration purpose, we generate a small toy example, which can be accessed here. This dataset contains contains 100 genes and 100 cells of one sample.

For gene expression matrices stored in csv or txt format, users can read in the dataset using fread function from data.table R package.

library(data.table)
input_data <- fread('Gene_Expression.csv')
input_data[1:3,1:3]

##            V1 AAACCTGAGATGTAAC.1.1 AAACCTGCATGGGACA.1.1
## 1:      ABHD2                    1                    0
## 2: AC005921.2                    0                    0
## 3: AC009163.4                    0                    0

# class of this object
class(input_data)

## [1] "data.table" "data.frame"

The first column encodes gene names and all the remaining columns (e.g. from 2nd to the last column) annotate cell barcodes. The input_data is a data.frame object. Next, we are going to convert input_data into a matrix object:

# remove 1st column (gene names) from input_data
input_mat <- as.matrix(input_data[,-1])

# specify gene names as rownames
rownames(input_mat) <- input_data$V1

dim(input_mat)

## [1] 100 100

input_mat[1:3,1:3]

##            AAACCTGAGATGTAAC.1.1 AAACCTGCATGGGACA.1.1 AAACCTGGTAGAGTGC.1.1
## ABHD2                         1                    0                    0
## AC005921.2                    0                    0                    0
## AC009163.4                    0                    0                    0

class(input_mat)

## [1] "matrix" "array"

2.1.2 Gene expression matrices of multiple samples

From the above example, we are able to load in gene expression dataset from one sample. What if we have multiple samples? We can repeat the aforementioned process across all samples. Here provides another example of four gene expression datasets (saved as csv format), corresponding to four different samples. Each file is marked by samples’ unique ID: Sample1, Sample2, Sample3 and Sample4.

# multisample directory
current_work_dir <- './multisample/'
filenames <- list.files(current_work_dir)
filenames

## [1] "Sample1.csv" "Sample2.csv" "Sample3.csv" "Sample4.csv"

We can then use a for loop to read in these four files and save the files as a list:

# initialize a list
expr.ls <- list()

# for loop
for(i in 1:length(filenames)){
  input_data <- fread(paste0(current_work_dir,filenames[i])) # read in i th file
  input_mat <- as.matrix(input_data[,-1])
  
  # specify gene names as rownames
  rownames(input_mat) <- input_data$V1
  
  # revise column names as 'sampleID:barcode' format
  sampleID <- gsub('\\.csv','',filenames[i])
  colnames(input_mat) <- paste0(sampleID,':',colnames(input_mat))
  
  # store in i th element in the list
  expr.ls[[i]] <- input_mat
}

names(expr.ls) <- gsub('\\.csv','',filenames)
names(expr.ls)

## [1] "Sample1" "Sample2" "Sample3" "Sample4"

Let’s do some sanity checks before concatenate different samples: (1) the rownames (or order of the genes to be included) should be the same across all samples

sapply(expr.ls,function(x) identical(rownames(x),rownames(expr.ls[[1]])))

## Sample1 Sample2 Sample3 Sample4 
##    TRUE    TRUE    TRUE    TRUE

and (2) there is no duplicated cell barcode across all samples

all_column_names <- do.call(c,lapply(expr.ls,function(x) colnames(x)))
length(all_column_names)

## [1] 400

sum(duplicated(all_column_names))

## [1] 0

Then, we can proceed to combine gene expression matrices across different samples into one large matrix, where row are genes and columns are cells:

expr_mat <- do.call(cbind,expr.ls)
dim(expr_mat)

## [1] 100 400

2.2 Cell-level metadata

The second component of our input is cell-level metadata, which contains cell barcode and its corresponding sample ID (i.e. which cell is coming from). Using the multi-sample example again, we can extract the cell-level data by implementing the following code:

cell_meta <- data.frame(barcode = colnames(expr_mat),
                        sample = sub('\\:.*','',colnames(expr_mat)))
head(cell_meta)

##                        barcode  sample
## 1 Sample1:ACGAGCCGTCTTCAAG.1.1 Sample1
## 2 Sample1:ACGAGCCTCCTCGCAT.1.1 Sample1
## 3 Sample1:ACGAGCCTCGTCCAGG.1.1 Sample1
## 4 Sample1:ACGAGGAAGAGCCTAG.1.1 Sample1
## 5 Sample1:ACGAGGACAGGTCCAC.1.1 Sample1
## 6 Sample1:ACGAGGAGTCCTCTTG.1.1 Sample1

Cell barcode is used to couple cell-level metadata and gene expression matrix. Users can also add an additional column celltype if one knows celltype annotation of a specific cell. Alternatively, one can infer celltype in Module1: data integration.

2.3 Sample-level metadata

The third component is sample-level metadata, which documents a sample’s phenotype(s) of interests (e.g. clinical outcome) and other related covariates (e.g. age and sex). The first column contains unique sample IDs, and the remaining columns contain phenotypes and covariates. The cell-level metadata and sample-level metadata can be linked via unique sample IDs (i.e.sample).

In the multi-sample example, the sample metadata can be accessed here. Similarly, we can use fread or read.csv functions to load in the sample-meta.

sample_meta <- read.csv('SampleMeta.csv')
sample_meta

##    sample age    sex severity
## 1 Sample1  28 Female       HD
## 2 Sample2  45   Male       HD
## 3 Sample3  80 Female   Severe
## 4 Sample4  61   Male   Severe