yascp

nf-core/yascp: Output

Introduction

This guide provides a comprehensive overview of the outputs generated by the pipeline.

The pipeline will create the following files in your working directory:

work            # Directory containing the nextflow working files
results         # Finished results (configurable, see below)
.nextflow_log   # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

Pipeline Overview

Utilizing Nextflow, our pipeline orchestrates a series of data processing steps. The structure of the overall results folder is outlined below, offering a snapshot of the diverse outputs from different stages of the pipeline:

Results Folder Structure

The pipeline delivers outputs across several key areas:

Detailed explanations of each step and the corresponding outputs are provided below:

Alignment step

Cellranger - Curently users have to run Cellranger upstream of pipeline - we suggest to use the no-cores pipeline - https://nf-co.re/scrnaseq/2.5.1

Ambient RNA removal

Ambient RNA Removal using Cellbender

Reads the Cellranger outputs and removes the ambient RNA using Cellbender

Output file structure ( nf-preprocessing/cellbender ):
Cellbender output plots:

Genotype processing and Donor deconvolutions

If more than 1 donor is in the pool and Multiplet/Unassigned cell removal

Genotype processing

If users provide the genotypes this step slices and dices the genotypes to prepeare these for the CellSNP/Vireo deconvolutions and GT matches

Donor Deconvolution using CellSnp/Vireo

We run cellsnp and vireo to deconvolute donors if the input file has indicated that there are more than 1 donors in the pool.

Cellsnp

Cellsnp profiles each of the droplets for the variants in them, which is later utilised by vireo to assign the particular cell to the donor cluster:

Cellsnp Output files:

Vireo

Vireo takes the cellsnp variant pileups and assigns donors the particular cell to the donor cluster:

Vireo Output files:

Genotype matching

The infered genotypes (both from Freebayes and From Vireo) will be used to double check the identities of the donors in the pool. Vireo and Freebayes are used to produce infered genotypes from scRNA data for each of the deconvoluted donors. These are then used in the bcftools gtcheck to match each of the infered genotypes against the provided genotype cohorts. This will produce statistics and info of the gt matching against the provided genotype cohorts, in particular distributions are calculated and z0 (best donor match statistic) and z1 (second best donor statistic) is calculated per cohort and then these scores are compared in between cohorts to determine best match out of all cohorts. This allows Yascp to determine which donor is the best and how well it matches the donor within each of the cohorts.

Screenshot 2024-04-11 at 17 15 19

Doublet Detection

Screenshot 2024-04-02 at 15 43 16

Scrublet

Scrublet Output files:

DoubletDecon

DoubletDecon Output files:

doubletDetection

doubletdetection Output files:

DoubletFinder

DoubletFinder Output files:

scDblFinder

scDblFinder Output files:

SCDS

SCDS Output files:

Donor Deconvolution using Souporcell - Souporcell option both removes the ambioent RNA and deconvolutes the donors [currently however this option is broken and will be fixed soon]

GT match - This step utilises the prepeared genotypes and the infered genotypes by Vireo and picks out the donor that corresponds to the right reads.

GT input files:
GT match results structure:

Celltype identification

Azimuth

Uses Azimuth PBMC l2 reference (pipeline will be adjusted later to be more general for other tissue types) to assign the celltypes. Downstream it maps the l2 to l1 and l3 as per https://github.com/wtsi-hgi/yascp/blob/main/assets/azimuth/Azimuth_Mappings.txt

Azimuth Output files:

Celltypist

Performs cellype assignment using celltypist Imule Low and Imune High profiles (this will be adjusted to use more references)

Celltypist Output files:
Combined celltypes file:

Keras celltype transfer - This is utilising pretrained reference panels for celltype assignment - curently only works in Sanger.

Combined File - A combined Celltypes file is produced by pipeline where all different references are combined in one spreadsheet.:

Donor and Cell QC

We perform different types of QC, Adaptive Isolation Forests, Adaptive Isolation Forests per celltype, Hard Filters tresholds.

Data QC output folder structure:

Isolation Forest

We parfor Isolation forests in different resolutions - All data together, Per Celltype adaptive qc:

Hard filters

We also perform hard filters if user has specified that this is something thats required.

Integration and clustering

By default multiple different clustering resolutions will be run for both BBKNN and Harmony resulting in a subfolder structure. Pipeline automatically estimates the best number of PCs to use for clustering using knee and elbow plots that can be found in plots section.

Output file structure ( clustering ):

BBKNN

BBKNN file structure ( clustering ):
BBKNN sample UMAPS Coloured:

Harmony

Harmony file structure ( clustering ):
Harmony sample UMAPS Coloured:
Harmony cluster evaluations and cluster markers:

PCA

PCA file structure ( clustering ):
PCA file structure ( clustering ):

Cluster assesments

Sccaf We perform Sccaf to asses the clustering accuracies, these are useful metrics in picking the best resolution for clustrering.

Sccaf file structure ( clustering ):

Lisi We also have a capability in running LISI cluster assesments, however curently this option does not run by default as it is memory demanding and requires some further optimisations

Citeseq

Citeseq folder will be present if your data contains citeseq

Screenshot 2024-04-03 at 17 02 05

In this folder we have a couple of subfolders:

Merged h5ad files

Screenshot 2024-04-11 at 17 04 10

Pipeline will create merged h5ad files for the most important preprocessing steps -

Handover

Summary Statistics, Per Donor h5ad files, Summary Plots

Screenshot 2024-04-03 at 16 44 35

In this folder we can see 3 different folders:

Plots

Some summary plots for quick inspections

Screenshot 2024-04-11 at 17 11 42

Exactution reports

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Screenshot 2024-04-11 at 17 09 20