A pipeline for analyzing raw historical/ancient DNA obtained from a sequencing facility. Using Snakemake, it ensures efficient resource management and automated handling of software dependencies. It processes and generates reports on sequence quality and contamination, with checks specifically suited for aDNA short reads, to assess whether an extraction was successful and the sample is free of major contamination. Additionally, reads are mapped and rescaled according to their damage profiles, ready for downstream analyses. It also optionally enables comparisons of key genomic features across time points, such as transposon insertions, gene copy number changes, or endosymbiont strain replacements.
Below is an overview of the steps of the pipeline:
For detailed information about the processing steps, see the Process Overview page.
The pastForward pipeline is implemented using Snakemake. Information about the setup as well as configuration options can be found in the Setup Instructions.
For more information on Snakemake, see the Snakemake website.
The pastForward pipeline is implemented using Snakemake, a workflow management system. Snakemake ensures reproducibility and efficient execution of the pipeline.
To run the pipeline, navigate to the root directory containing the workflow/Snakefile and execute:
# minimum command to run the pipeline
#snakemake --cores <number_of_threads> --use-conda
# suggested command to run the pipeline
snakemake --cores <number_of_threads> --use-conda --keep-going --rerun-trigger mtimeReplace <number_of_threads> with the number of CPU threads you want to allocate for the pipeline.
Note:
- The
--use-condaflag enables the use of conda environments specified in theSnakefile. - The
--keep-goingflag allows the pipeline to continue even if a rule fails. Somtimes the analysis of ECMSD fails due to issues with the input data (e.g. low quality reads or low coverage). In this case, the rest of the pipeline can still be executed. - The number of threads can be adjusted using the
--coresoption when running Snakemake. - The
–rerun-trigger mtimeflag ensures that the pipeline only re-runs rules if the input files have been modified since the last run.
Other useful flags:
--dryrunor-nto simulate the execution of the pipeline without actually running it--configfile <path_to_config.yaml>to specify a custom config file--rerun-incompleteto re-run rules that failed or were cancelled in the previous run--rerun-triggerto specify which triggers to use for rerunning rules- Possible choices: code, input, mtime, params, software-env
- Define what triggers the rerunning of a job. By default, all triggers are used, which guarantees that results are consistent with the workflow code and configuration. If you rather prefer the traditional way of just considering file modification dates, use
–rerun-trigger mtime.
--touchto touch output files (mark them up to date without really changing them) instead of running their commands. This is used to pretend that the rules were executed, in order to fool future invocations of snakemake. Note that this will only touch files that would otherwise be recreated by Snakemake (e.g. because their input files are newer). For enforcing a touch, combine this with –force, –forceall, or –forcerun. Note however that you lose the provenance information when the files have been created in reality. Hence, this should be used only as a last resort.
For more information on Snakemake command-line options, see the Snakemake documentation.
Depending on the size of the data, it may take some time to complete the pipeline. Thus it is recommended to run the pipeline in the background. You can do this by running the following command:
nohup snakemake --cores 40 --use-conda --keep-going --rerun-trigger mtime > pipeline.log 2>&1 &Snakemake automatically tracks the state of the pipeline and will only re-run steps that are incomplete or outdated. If you want to restart the pipeline from the beginning, you can delete the relevant output files and re-run the pipeline.
If you want to restart the pipeline, because it has crashed or was terminated, you might need to use the --rerun-incomplete flag. This will re-run all incomplete steps, even if they have not been modified since the last run.
pastForward generates a MultiQC report for:
-
each species (including all samples from this species, to compare the results across all samples)
- Location:
{species}/results/summary/individual_level/{individual}_multiqc.html
- Location:
-
each individual sample
- Location:
{species}/results/summary/species_level/{species}_multiqc.overall.html
- Location:
The reports include a comprehensive summary of reads before and after trimming, contamination analysis, coverage analysis, deduplication and damage rescaling. The reports are essential for assessing the quality of the sequenced reads and for making decisions about the need of additional library preparation.
By leveraging the AI functionality in the MultiQC reports, you can also use AI to interpret the results of the pipeline.