Install Nextflow
(>=21.04.0
)
Install Docker
or Singularity
for full pipeline reproducibility.
Download/clone the pipeline:
git clone https://github.com/wtsi-hgi/yascp.git
You don’t need to install anything. YASCP is already installed on Farm and can be loaded as a module
To run the whole pipeline use the next commands:
For a test dataset run:
nextflow run /path/to/cloned/yascp -profile test,<docker/singularity,institute>
For your dataset run:
nextflow run /path/to/cloned/yascp -profile <docker/singularity,institute> -c inputs.nf -resume
To run YASCP you need to specify several core Nextflow arguments like in the example commands above.
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different computing environments.
Several generic profiles are included with the pipeline, guiding it to use software packaged via various methods such as Docker and Singularity (see below). When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g
You will need to use Docker or Singularity containers for full pipeline reproducibility as currently, we do not support Conda.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended.
docker
singularity
test
institute
institute
with your institution profile name. Many institutions provide profiles (look for yours https://github.com/nf-core/configs/tree/master/conf)-c
-resume
Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
-c
Specify the path to a config file (including the input declaration config file). See the nf-core website documentation for more information.
This file specifies all inputs to the pipeline and general pipeline parameters. You can find an example input declaration here.
Core required/optional inputs are described below.
params {
//REQUIRED
input_data_table = '/path/to/input.tsv' //A samplesheet file containing paths to all the cellranger and pool definition files
//OPTIONAL
input = 'cellbender' //This parameter defines whether the ambient RNA removal is skipped ('cellranger') or not ('cellbender'). The default value is 'cellbender'.
//cellbender_location='/path/to/existing/folder/nf-preprocessing/cellbender' //!!!!! Uncomment this and edit the path, if cellbender results are already available (even partial results). The pipeline will skip the cellbender step for samples that already have cellbender results.
existing_cellsnp="" //Provide a path to cellsnp results (if they are already available, even partial results) to skip cellsnp step for the files with results.
genotype_input {
run_with_genotype_input=true //This parameter defines whether the genotype_input is used (true) or not(false). If this is set to true tsv_donor_panel_vcfs has to be specified
tsv_donor_panel_vcfs = "/path/to/reference/panel/vcf_inputs.tsv" //A file containing paths to vcf files with a priori known genotypes that we want to compare the genotypes from samples with
}
}
input_data_table
- a samplesheet file containing paths to all the cellranger and pool definition files.
input
- This parameter defines whether the ambient RNA removal is skipped (‘cellranger’) or not (‘cellbender’). The default value is ‘cellbender’. This option can be useful if you can’t use GPUs. For more details see Tips to avoid rerunning time-consuming parts of the pipeline.
below.
cellbender_location
- uncomment this and edit the path, if cellbender results are already available (even partial results). The pipeline will skip the cellbender step for samples that already have cellbender results. For more details see Tips to avoid rerunning time-consuming parts of the pipeline.
below.
existing_cellsnp
- provide a path to cellsnp results (if they are already available, even partial results) to skip cellsnp step for the files whth results. For more details see Tips to avoid rerunning time-consuming parts of the pipeline.
below.
run_with_genotype_input
- this parameter defines whether the genotype_input is used (true) or not(false). If this is set to true tsv_donor_panel_vcfs
has to be specified.
tsv_donor_panel_vcfs
- a file containing paths to vcf files with a priori known genotypes that we want to compare the genotypes from samples with.
This file specifies sample IDs, the number of pooled donors, IDs of individuals with priori known genotypes, and paths to 10x files. It has to be a tab-separated file with 4 columns and a header as shown in the example below. You can find an example samplesheet here.
experiment_id | n_pooled | donor_vcf_ids | data_path_10x_format |
---|---|---|---|
Pool1 | 1 | “id3” | path/to/10x_folder |
Pool2 | 2 | “id1,id2” | path/to/10x_folder |
Columns description:
path/to/10x_folder can contain output files from both cellranger 6 and cellranger 7. Overall we need the following files for the pipeline to run smoothly:
10x_folder/
./possorted_genome_bam.bai
./possorted_genome_bam.bam
./raw_feature_bc_matrix
./matrix.mtx.gz
./features.tsv.gz
./barcodes.tsv.gz
./filtered_feature_bc_matrix
./matrix.mtx.gz
./features.tsv.gz
./barcodes.tsv.gz
./metrics_summary.csv
./web_summary.html
./molecule_info.h5
You could also provide a path to this file by using a flag:
--input_data_table '[path to samplesheet file]'
This file contains paths to VCFs and cohort labels associated with them. A genotypesheet can be provided to the pipeline to improve sample deconvolution and detect whether the sample you have is the sample you are expecting (through GT matching). The pipeline will determine which cohort the deconvoluted sample comes from (if any).
In the following example, we have 3 cohorts: Cohort1 has genotypes for each of the chromosomes - this is acceptable, as the pipeline will use all chromosome files to identify whether the sample is part of this cohort. The other 2 cohorts have a merged VCF file for all the chromosomes. This is also acceptable, as it will determine whether the sample belongs to this cohort in one step. After evaluating all cohorts the pipeline will assign the sample to the single donor that is the most likely real match.
You can find an example genotypesheet here.
label | vcf_file_path |
---|---|
Cohort1 | /ful/path/to/vcf_bcf/file/in/hg38/format/without/chr/prefix/chr1.vcf.gz |
Cohort1 | /ful/path/to/vcf_bcf/file/in/hg38/format/without/chr/prefix/chr2.vcf.gz |
…. | …. |
Cohort2 | /ful/path/to/vcf_bcf/file/in/hg38/format/without/chr/prefix/full_cohort2_for_all_chr.vcf.gz |
Cohort3 | /ful/path/to/vcf_bcf/file/in/hg38/format/without/chr/prefix/full_cohort2_for_all_chr.vcf.gz |
To avoid rerunning time-consuming steps of the pipeline you can specify the next parameters in the input declaration config file:
You can skip the cellbender step by adding input = 'cellranger'
to the input declaration config file. You might consider this option because the cellbender step is time-consuming and requires GPUs.
The pipeline will skip ambient RNA removal and proceed with deconvolution based on cellranger. For more details see optional parameters
params{
input_data_table = '/path/to/input.tsv' //A samplesheet file containing paths to all the cellranger and pool definition files
input = 'cellranger'
}
You can avoid running cellbender multiple times if you have complete or partial cellbender results. If you specify a path to the folder with cellbender results in the input declaration config file, cellbender will be run on all the samples without results.
params{
input_data_table = '/path/to/input.tsv' //A samplesheet file containing paths to all the cellranger and pool definition files
cellbender_location='/full/path/to/results/nf-preprocessing/cellbender'
}
The cellbender results folder structure should look like this:
Sample1
Sample2
Sample3
qc_cluster_input_files
file_paths_10x-*FPR_0pt1
file_paths_10x-*FPR_0pt05
file_paths_10x-*FPR_0pt01
You can avoid running cellsnp multiple times if you have complete or partial cellsnp results. If you specify a path to cellsnp files in the input declaration config file, cellsnp will work only with the files that haven’t yet been run:
params{
input_data_table = '/path/to/input.tsv' //A samplesheet file containing paths to all the cellranger and pool definition files
existing_cellsnp='/full/path/to/results/cellsnp'
}
If you need to customise the pipeline please read custom configuration for more details.
It is a good idea to specify a pipeline version (or a checkout tag indicated when running git log
) when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
<!– TODO - add a description about reproducibility something like this: currently we don’t have a release;
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the nf-core/yascp releases page and find the latest version number - numeric only (eg. 1.3.1
). Then specify this when running the pipeline with -r
(one hyphen) - eg. -r 1.3.1
.
This version number will be logged in reports when you run the pipeline so that you’ll know what you used when you look back in the future. –>
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or a similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted by your job scheduler (from where it submits more jobs).
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'