yascp

nf-core/yascp: Output

Introduction

This guide provides a comprehensive overview of the outputs generated by the pipeline.

The pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Pipeline Overview

Utilizing Nextflow, our pipeline orchestrates a series of data processing steps. The structure of the overall results folder is outlined below, offering a snapshot of the diverse outputs from different stages of the pipeline:

Results Folder Structure

The pipeline delivers outputs across several key areas:

CellSNP: Variant calling on single cells.
Cell Type Identification: Classification of cells into types.
CITE-seq Data Processing: Handling of CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) data.
Clustering and Integration: Grouping cells based on similarities and integrating datasets.
Sample Deconvolution: Disentangling mixtures of cells from different donors.
Doublet Detection: Identifying artificial doublet cells.
Genotype Matching: Determining sample matches through genotype comparison.
Inferred Genotypes: Vireo and Freebayes generated VCF files for each deconvoluted donor in the pool.
Handover: Storage of summary statistics, plots, and final QC’d and annotated H5AD files per donor.
Merged H5AD Files: Consolidated H5AD files from various preprocessing steps, enabling restarts from the clustering phase.
NF-Preprocessing: Includes CellBender results for ambient RNA removal.
Pipeline Info: Statistics and logs from the pipeline execution.
Plots: A collection of quality control visualizations.
Resources: Reference genomes utilized in data processing.
UMAPS: Quick-reference UMAP plots for data visualization.

Detailed explanations of each step and the corresponding outputs are provided below:

Alignment step

Cellranger - Curently users have to run Cellranger upstream of pipeline - we suggest to use the no-cores pipeline - https://nf-co.re/scrnaseq/2.5.1

Ambient RNA removal

Ambient RNA Removal using Cellbender

Reads the Cellranger outputs and removes the ambient RNA using Cellbender

Output file structure ( nf-preprocessing/cellbender ):

Here we have multiple different plots and output files, however the most important ones are the matrix and h5ad files after the ambient rna removal: such as cellbenderFPR_0pt1filtered_10x_mtx/ cellbender_FPR_0.1_filtered.h5

Cellbender output plots:

Cellbender output plots:

Genotype processing and Donor deconvolutions

If more than 1 donor is in the pool and Multiplet/Unassigned cell removal

Genotype processing

If users provide the genotypes this step slices and dices the genotypes to prepeare these for the CellSNP/Vireo deconvolutions and GT matches

Donor Deconvolution using CellSnp/Vireo

We run cellsnp and vireo to deconvolute donors if the input file has indicated that there are more than 1 donors in the pool.

Cellsnp

Cellsnp profiles each of the droplets for the variants in them, which is later utilised by vireo to assign the particular cell to the donor cluster:

Cellsnp Output files:

Output:

Vireo

Vireo takes the cellsnp variant pileups and assigns donors the particular cell to the donor cluster:

Vireo Output files:

Output:

Genotype matching

The infered genotypes (both from Freebayes and From Vireo) will be used to double check the identities of the donors in the pool. Vireo and Freebayes are used to produce infered genotypes from scRNA data for each of the deconvoluted donors. These are then used in the bcftools gtcheck to match each of the infered genotypes against the provided genotype cohorts. This will produce statistics and info of the gt matching against the provided genotype cohorts, in particular distributions are calculated and z0 (best donor match statistic) and z1 (second best donor statistic) is calculated per cohort and then these scores are compared in between cohorts to determine best match out of all cohorts. This allows Yascp to determine which donor is the best and how well it matches the donor within each of the cohorts.

z0 is a best gt match score as per bcftools gtcheck / SD of all scores
z1 is the second best match as per bcftools gtcheck / SD of all scores

Screenshot 2024-04-11 at 17 15 19

Doublet Detection

Screenshot 2024-04-02 at 15 43 16

Scrublet

Scrublet Output files:

By default we always run Scrublet - if we have no donors pooled in the run (i.e if we have only 1 donor), then the doublets will be removed by scrublet instead of vireo:

DoubletDecon

DoubletDecon Output files:

DoubletDecon output files contain barcode and label of whether its a singlet or a doublet:

doubletDetection

doubletdetection Output files:

doubletDetection output files contain barcode and label of whether its a singlet or a doublet:

DoubletFinder

DoubletFinder Output files:

DoubletFinder output files contain barcode and label of whether its a singlet or a doublet:

scDblFinder

scDblFinder Output files:

scDblFinder output files contain barcode and label of whether its a singlet or a doublet:

SCDS

SCDS Output files:

SCDS output files contain barcode and label of whether its a singlet or a doublet:

Donor Deconvolution using Souporcell - Souporcell option both removes the ambioent RNA and deconvolutes the donors [currently however this option is broken and will be fixed soon]

GT match - This step utilises the prepeared genotypes and the infered genotypes by Vireo and picks out the donor that corresponds to the right reads.

GT input files:

Users can provide multipple different cohort VCFs and that are split per chromosomes or one big vcf/bcf file.:

GT match results structure:

GT match produces multiple metrics that assesses whether donor is the one we expect and what is the relatedness within pool.
Results indicate which donor from Vireo deconvolutions is which:

Celltype identification

Azimuth

Uses Azimuth PBMC l2 reference (pipeline will be adjusted later to be more general for other tissue types) to assign the celltypes. Downstream it maps the l2 to l1 and l3 as per https://github.com/wtsi-hgi/yascp/blob/main/assets/azimuth/Azimuth_Mappings.txt

Azimuth Output files:

By default we run azimuth l2 celltype assignment:

Celltypist

Performs cellype assignment using celltypist Imule Low and Imune High profiles (this will be adjusted to use more references)

Celltypist Output files:

By default we run Imune High, Imune Low and Imune PBMC reference celltype assignment:

Combined celltypes file:

Keras celltype transfer - This is utilising pretrained reference panels for celltype assignment - curently only works in Sanger.

Combined File - A combined Celltypes file is produced by pipeline where all different references are combined in one spreadsheet.:

Output:

Donor and Cell QC

We perform different types of QC, Adaptive Isolation Forests, Adaptive Isolation Forests per celltype, Hard Filters tresholds.

Data QC output folder structure:

QC output Folder structure:

Isolation Forest

We parfor Isolation forests in different resolutions - All data together, Per Celltype adaptive qc:

All together Isolation Forests:
Per Celltype Isolation Forests:

Hard filters

We also perform hard filters if user has specified that this is something thats required.

Integration and clustering

By default multiple different clustering resolutions will be run for both BBKNN and Harmony resulting in a subfolder structure. Pipeline automatically estimates the best number of PCs to use for clustering using knee and elbow plots that can be found in plots section.

Output file structure ( clustering ):

Clustering combines all different integration methodologies utilised and in addition different plots in a structure represented in this layout:

BBKNN

BBKNN file structure ( clustering ):

BBKNN is performed with different clustering resolutions and each of the clusters assesed ussing sccaf: *

BBKNN sample UMAPS Coloured:

Resolution 0.1: BBKNN is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
Resolution 5: BBKNN is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
Mitochondial transcripts: Coloured UMAP: We also color each of the bespoke clusters with different metrics:

Harmony

Harmony file structure ( clustering ):

Harmony is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:

Harmony sample UMAPS Coloured:

Resolution 0.1: Harmony is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
Resolution 5: Harmony is performed with different clustering resolutions and each of the clusters assesed ussing sccaf: *
Mitochondial transcripts: Coloured UMAP: We also color each of the bespoke clusters with different metrics:

Harmony cluster evaluations and cluster markers:

Histograms: Multiple useful prolts are produced to look at the clusterings:
Dotplots: Multiple useful prolts are produced to look at the clusterings:

PCA

PCA file structure ( clustering ):

PCA is performed on the integrated data:

PCA file structure ( clustering ):

Gene Loadings for each of the PCA is evaluated:

Cluster assesments

Sccaf We perform Sccaf to asses the clustering accuracies, these are useful metrics in picking the best resolution for clustrering.

Sccaf file structure ( clustering ):

As described above clustering is assesed using scaff: directory structure:
Precission recall curves:
ROC:
Accuracy:

Lisi We also have a capability in running LISI cluster assesments, however curently this option does not run by default as it is memory demanding and requires some further optimisations

Citeseq

Citeseq folder will be present if your data contains citeseq

Screenshot 2024-04-03 at 17 02 05

In this folder we have a couple of subfolders:

DSB - folder contains DSB citeseq normalisation statistics and RDS files
all_data_integrated - contains Seurats integration of Citeseq and if available VDJ data as well as some UMAPs produced by these processes
filtered - folder contains data modalities split appart - i.e if the data is hastaged this layer is stored speratelly to the antibody data and also seperatelly to GEX data
raw - similarly to the above, but the difference is that these are the raw cellranger files split according to the modality.

Merged h5ad files

Screenshot 2024-04-11 at 17 04 10

Pipeline will create merged h5ad files for the most important preprocessing steps -

Post deconvolution and celltype assignemt merged files that contain any extra metadata provded.
Post hard filters merged h5ad file, where cells that are not passing ceitrain thresholds are dropped (or flagged, depends on settings used)
Post adaptive filters h5ad file where cells that dont pass these filters are dropped (or flagged depending on settings used) Note that the Handover folder discussed next contains per donor final h5ad files that include all the information from the above mentioned files.

Handover

Summary Statistics, Per Donor h5ad files, Summary Plots

Screenshot 2024-04-03 at 16 44 35

In this folder we can see 3 different folders:

Donor_Quantification - where we can see the Cellranger filtered, Cellranger raw, Cellbender filtered files that are used to produce the filal per donor h5ad files and the metadata features in the per donor tsv files
Donor_Quantification_summary folder where we have summary statistics per donor and summary statistics per tranche (collection of all pools that were run in this run).
Summary _plots contains the most important plots per each of the steps for a quick inversigations of the performance of the scRNA runs and the performance of the analysis.

Plots

Some summary plots for quick inspections

Screenshot 2024-04-11 at 17 11 42

Exactution reports

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Screenshot 2024-04-11 at 17 09 20

This site is open source. Improve this page.