nf-core/yascp: Output
Introduction
This guide provides a comprehensive overview of the outputs generated by the pipeline.
The pipeline will create the following files in your working directory:
work # Directory containing the nextflow working files
results # Finished results (configurable, see below)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
Pipeline Overview
Utilizing Nextflow, our pipeline orchestrates a series of data processing steps. The structure of the overall results folder is outlined below, offering a snapshot of the diverse outputs from different stages of the pipeline:

The pipeline delivers outputs across several key areas:
- CellSNP: Variant calling on single cells.
- Cell Type Identification: Classification of cells into types.
- CITE-seq Data Processing: Handling of CITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing) data.
- Clustering and Integration: Grouping cells based on similarities and integrating datasets.
- Sample Deconvolution: Disentangling mixtures of cells from different donors.
- Doublet Detection: Identifying artificial doublet cells.
- Genotype Matching: Determining sample matches through genotype comparison.
- Inferred Genotypes: Vireo and Freebayes generated VCF files for each deconvoluted donor in the pool.
- Handover: Storage of summary statistics, plots, and final QC’d and annotated H5AD files per donor.
- Merged H5AD Files: Consolidated H5AD files from various preprocessing steps, enabling restarts from the clustering phase.
- NF-Preprocessing: Includes CellBender results for ambient RNA removal.
- Pipeline Info: Statistics and logs from the pipeline execution.
- Plots: A collection of quality control visualizations.
- Resources: Reference genomes utilized in data processing.
- UMAPS: Quick-reference UMAP plots for data visualization.
Detailed explanations of each step and the corresponding outputs are provided below:
Alignment step
Cellranger - Curently users have to run Cellranger upstream of pipeline - we suggest to use the no-cores pipeline - https://nf-co.re/scrnaseq/2.5.1
Ambient RNA removal
Reads the Cellranger outputs and removes the ambient RNA using Cellbender
Output file structure ( nf-preprocessing/cellbender ):
- Here we have multiple different plots and output files, however the most important ones are the matrix and h5ad files after the ambient rna removal: such as cellbenderFPR_0pt1filtered_10x_mtx/ cellbender_FPR_0.1_filtered.h5
Cellbender output plots:
Genotype processing and Donor deconvolutions
If more than 1 donor is in the pool and Multiplet/Unassigned cell removal
If users provide the genotypes this step slices and dices the genotypes to prepeare these for the CellSNP/Vireo deconvolutions and GT matches
We run cellsnp and vireo to deconvolute donors if the input file has indicated that there are more than 1 donors in the pool.
Cellsnp
Cellsnp profiles each of the droplets for the variants in them, which is later utilised by vireo to assign the particular cell to the donor cluster:
Cellsnp Output files:
Vireo
Vireo takes the cellsnp variant pileups and assigns donors the particular cell to the donor cluster:
Vireo Output files:
The infered genotypes (both from Freebayes and From Vireo) will be used to double check the identities of the donors in the pool. Vireo and Freebayes are used to produce infered genotypes from scRNA data for each of the deconvoluted donors. These are then used in the bcftools gtcheck to match each of the infered genotypes against the provided genotype cohorts. This will produce statistics and info of the gt matching against the provided genotype cohorts, in particular distributions are calculated and z0 (best donor match statistic) and z1 (second best donor statistic) is calculated per cohort and then these scores are compared in between cohorts to determine best match out of all cohorts. This allows Yascp to determine which donor is the best and how well it matches the donor within each of the cohorts.
- z0 is a best gt match score as per bcftools gtcheck / SD of all scores
- z1 is the second best match as per bcftools gtcheck / SD of all scores

Doublet Detection

Scrublet
Scrublet Output files:
- By default we always run Scrublet - if we have no donors pooled in the run (i.e if we have only 1 donor), then the doublets will be removed by scrublet instead of vireo:
DoubletDecon
DoubletDecon Output files:
- DoubletDecon output files contain barcode and label of whether its a singlet or a doublet:
doubletDetection
doubletdetection Output files:
- doubletDetection output files contain barcode and label of whether its a singlet or a doublet:
DoubletFinder
DoubletFinder Output files:
- DoubletFinder output files contain barcode and label of whether its a singlet or a doublet:
scDblFinder
scDblFinder Output files:
- scDblFinder output files contain barcode and label of whether its a singlet or a doublet:
SCDS
SCDS Output files:
- SCDS output files contain barcode and label of whether its a singlet or a doublet:
Donor Deconvolution using Souporcell - Souporcell option both removes the ambioent RNA and deconvolutes the donors [currently however this option is broken and will be fixed soon]
GT match - This step utilises the prepeared genotypes and the infered genotypes by Vireo and picks out the donor that corresponds to the right reads.
GT input files:
- Users can provide multipple different cohort VCFs and that are split per chromosomes or one big vcf/bcf file.:
GT match results structure:
- GT match produces multiple metrics that assesses whether donor is the one we expect and what is the relatedness within pool.
- Results indicate which donor from Vireo deconvolutions is which:
Celltype identification
Uses Azimuth PBMC l2 reference (pipeline will be adjusted later to be more general for other tissue types) to assign the celltypes. Downstream it maps the l2 to l1 and l3 as per https://github.com/wtsi-hgi/yascp/blob/main/assets/azimuth/Azimuth_Mappings.txt
Azimuth Output files:
Performs cellype assignment using celltypist Imule Low and Imune High profiles (this will be adjusted to use more references)
Celltypist Output files:
Combined celltypes file:
Keras celltype transfer - This is utilising pretrained reference panels for celltype assignment - curently only works in Sanger.
Combined File - A combined Celltypes file is produced by pipeline where all different references are combined in one spreadsheet.:
Donor and Cell QC
We perform different types of QC, Adaptive Isolation Forests, Adaptive Isolation Forests per celltype, Hard Filters tresholds.
Data QC output folder structure:
- QC output Folder structure:
We parfor Isolation forests in different resolutions - All data together, Per Celltype adaptive qc:
- All together Isolation Forests:
- Per Celltype Isolation Forests:
We also perform hard filters if user has specified that this is something thats required.
Integration and clustering
By default multiple different clustering resolutions will be run for both BBKNN and Harmony resulting in a subfolder structure. Pipeline automatically estimates the best number of PCs to use for clustering using knee and elbow plots that can be found in plots section.
Output file structure ( clustering ):
- Clustering combines all different integration methodologies utilised and in addition different plots in a structure represented in this layout:
BBKNN file structure ( clustering ):
- BBKNN is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
*

BBKNN sample UMAPS Coloured:
- Resolution 0.1: BBKNN is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
- Resolution 5: BBKNN is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
- Mitochondial transcripts: Coloured UMAP: We also color each of the bespoke clusters with different metrics:
Harmony file structure ( clustering ):
- Harmony is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
Harmony sample UMAPS Coloured:
- Resolution 0.1: Harmony is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
- Resolution 5: Harmony is performed with different clustering resolutions and each of the clusters assesed ussing sccaf:
*

- Mitochondial transcripts: Coloured UMAP: We also color each of the bespoke clusters with different metrics:
Harmony cluster evaluations and cluster markers:
- Histograms: Multiple useful prolts are produced to look at the clusterings:
- Dotplots: Multiple useful prolts are produced to look at the clusterings:
PCA file structure ( clustering ):
- PCA is performed on the integrated data:
PCA file structure ( clustering ):
- Gene Loadings for each of the PCA is evaluated:
Cluster assesments
Sccaf file structure ( clustering ):
- As described above clustering is assesed using scaff: directory structure:
- Precission recall curves:
- ROC:
- Accuracy:
Lisi We also have a capability in running LISI cluster assesments, however curently this option does not run by default as it is memory demanding and requires some further optimisations
Citeseq folder will be present if your data contains citeseq

In this folder we have a couple of subfolders:
-
DSB - folder contains DSB citeseq normalisation statistics and RDS files

-
all_data_integrated - contains Seurats integration of Citeseq and if available VDJ data as well as some UMAPs produced by these processes

-
filtered - folder contains data modalities split appart - i.e if the data is hastaged this layer is stored speratelly to the antibody data and also seperatelly to GEX data

-
raw - similarly to the above, but the difference is that these are the raw cellranger files split according to the modality.


Pipeline will create merged h5ad files for the most important preprocessing steps -
- Post deconvolution and celltype assignemt merged files that contain any extra metadata provded.
- Post hard filters merged h5ad file, where cells that are not passing ceitrain thresholds are dropped (or flagged, depends on settings used)
- Post adaptive filters h5ad file where cells that dont pass these filters are dropped (or flagged depending on settings used)
Note that the Handover folder discussed next contains per donor final h5ad files that include all the information from the above mentioned files.
Summary Statistics, Per Donor h5ad files, Summary Plots

In this folder we can see 3 different folders:
-
Donor_Quantification - where we can see the Cellranger filtered, Cellranger raw, Cellbender filtered files that are used to produce the filal per donor h5ad files and the metadata features in the per donor tsv files

-
Donor_Quantification_summary folder where we have summary statistics per donor and summary statistics per tranche (collection of all pools that were run in this run).

-
Summary _plots contains the most important plots per each of the steps for a quick inversigations of the performance of the scRNA runs and the performance of the analysis.

Some summary plots for quick inspections

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
