Comprehensive command-line interface reference
The kevlar command-line interface is designed around a single command kevlar
.
From this one command, a variety of tasks and procedures can be invoked using several subcommands.
Once kevlar is installed, available subcommands can be listed by executing kevlar -h
.
To see instructions for running a specific subcommand, execute kevlar <subcommand> -h
(of course replacing subcommand
with the actual name of the subcommand).
kevlar count
Compute k-mer abundances for the provided sample. Supports k-mer banding:
see http://kevlar.readthedocs.io/en/latest/banding.html for more details.
usage: kevlar count [-h] [-k K] [-c C] [-M MEM] [--max-fpr FPR] [--mask MSK]
[--count-masked] [--num-bands N] [--band I] [-t T]
counttable seqfile [seqfile ...]
Positional Arguments
counttable |
name of the file to which the output (a k-mer count table) will be written; the suffix “.counttable” will be applied if the provided file name does not end in “.ct” or “.counttable” |
seqfile |
input files in Fastq/Fasta format |
Named Arguments
-k, --ksize |
k-mer size; default is 31 |
-c, --counter-size |
| Possible choices: 1, 4, 8
number of bits to allocate for counting each k-mer; options are 1 (max count: 1), 4 (max count: 15), and 8 (max count: 255); default is 8
|
-M, --memory |
memory to allocate for the count table |
--max-fpr |
terminate if the estimated false positive rate for any sample is higher than “FPR”; default is 0.2 |
--mask |
counttable or nodetable of k-mers to ignore when counting k-mers |
--count-masked |
by default, when a mask is provided k-mers in the mask are ignored; this setting inverts the behavior so that only k-mers in the mask are counted |
--num-bands |
number of bands into which to divide the hashed k-mer space |
--band |
a number between 1 and N (inclusive) indicating the band to be processed |
-t, --threads |
number of threads to use for file processing; default is 1 |
Example:
kevlar count --memory 500M case1.ct case1-reads.fastq
Example:
kevlar count --ksize 25 --memory 12G --max-fpr 0.01 --threads 8 \
proband.counttable \
proband-R1.fq.gz proband-R2.fq.gz proband-unpaired.fq.gz
kevlar novel
Identify “interesting” (potentially novel) k-mers and output the
corresponding reads. Here we define “interesting” k-mers as those which are
high abundance in each case sample and effectively absent (below some
specified abundance threshold) in each control sample.
usage: kevlar novel --case F [F ...] [--case-counts F [F ...]]
[--control F [F ...]] [--control-counts F [F ...]] [-x X]
[-y Y] [-M MEM] [--max-fpr FPR] [--num-bands N] [--band I]
[-o FILE] [--save-case-counts CT [CT ...]]
[--save-ctrl-counts CT [CT ...]] [-h] [-k K]
[--abund-screen INT] [-t T] [--skip-until ID]
Case/control config
Specify input files, as well as thresholds for selecting “interesting”
k-mers. A single pass is made over input files for control samples (to
compute k-mer abundances), while two passes are made over input files for
case samples (to compute k-mer abundances, and then to identify
“interesting” k-mers). The k-mer abundance computing steps can be skipped
if pre-computed k-mer abunandances are provided using the “–case-counts”
and/or “–control-counts” settings. If “–control-counts” is declared, then
all “–control” flags are ignored. If “–case-counts” is declared,
FASTA/FASTQ files must still be provided with “–case” for selecting
“interesting” k-mers and reads.
--case |
one or more FASTA/FASTQ files containing reads from a case sample; can be declared multiple times corresponding to multiple case samples, see examples below |
--case-counts |
counttable file(s) corresponding to each case sample; if not provided, k-mer abundances will be computed from FASTA/FASTQ input; only one counttable per sample, see examples below |
--control |
one or more FASTA/FASTQ files containing reads from a control sample; can be declared multiple times corresponding to multiple control samples, see examples below |
--control-counts |
| counttable file(s) corresponding to each control sample; if not provided, k-mer abundances will be computed from FASTA/FASTQ input; only one counttable per sample, see examples below |
-x, --ctrl-max |
k-mers with abund > X in any control sample are uninteresting; default is X=1 |
-y, --case-min |
k-mers with abund < Y in any case sample are uninteresting; default is Y=6 |
-M, --memory |
total memory allocated to k-mer abundance for each sample; default is 1M; ignored when pre-computed k-mer abundances are supplied via counttable |
--max-fpr |
terminate if the expected false positive rate for any sample is higher than the specified FPR; default is 0.2 |
K-mer banding
If memory is a limiting factor, it is possible to get a linear decrease in
memory consumption by running kevlar novel in “banded” mode. Splitting
the hashed k-mer space into N bands and only considering k-mers from one
band at a time reduces the memory consumption to approximately 1/N of the
total memory required. This implements a scatter/gather approach in which
kevlar novel is run N times, after the results are combined using
kevlar filter.
--num-bands |
number of bands into which to divide the hashed k-mer space |
--band |
a number between 1 and N (inclusive) indicating the band to be processed |
Output settings
-o, --out |
file to which interesting reads will be written; default is terminal (stdout) |
--save-case-counts |
| save the computed k-mer counts for each case sample to the specified count table file(s) |
--save-ctrl-counts |
| save the computed k-mer counts for each control sample to the specified count table file(s) |
Miscellaneous settings
-k, --ksize |
k-mer size; default is 31 |
--abund-screen |
discard reads with any k-mers whose abundance is < INT |
-t, --threads |
number of threads to use for file processing; default is 1 |
--skip-until |
when re-running kevlar novel, skip all reads in the case input until read with name ID is observed |
Example:
kevlar novel --out novel-reads.augfastq --case proband-reads.fq.gz \
--control father-reads-r1.fq.gz father-reads-r2.fq.gz \
--control mother-reads.fq.gz
Example:
kevlar novel --out novel-reads.augfastq.gz \
--control-counts father.counttable mother.counttable \
--case-counts proband.counttable --case proband-reads.fastq \
--ctrl-max 0 --case-min 10 --ksize 27
Example:
kevlar novel --out output.augfastq \
--case proband1.fq --case proband2.fq \
--control control1a.fq control1b.fq \
--control control2a.fq control2b.fq \
--save-case-counts p1.ct p2.ct --save-ctrl-counts c1.ct c2.ct
kevlar filter
Discard k-mers and reads that are contaminant in origin or whose abundances
were inflated during the preliminary k-mer counting stage.
usage: kevlar filter [-h] [-M MEM] [--max-fpr FPR] [--mask MSK] [-x X] [-y Y]
[-o FILE]
augfastq
Positional Arguments
augfastq |
putatively novel reads in augmented Fastq format |
Named Arguments
-M, --memory |
memory to allocate for the k-mer re-counting |
--max-fpr |
terminate early if the estimated false positive rate for re-computed k-mer abundances is higher than “FPR”; default is 0.01 |
--mask |
counttable or nodetable of k-mers to ignore when re-counting k-mers |
-x, --ctrl-max |
k-mers with abund > X in any control sample are uninteresting; default is X=1 |
-y, --case-min |
k-mers with abund < Y in any case sample are uninteresting; default is Y=6 |
-o, --out |
output file; default is terminal (stdout) |
kevlar partition
Construct a graph to group reads by shared interesting k-mers. Nodes in the
graph represent reads, and edges between a pair of nodes indicate that the
two corresponding reads have one or more interesting k-mers in common.
Connected components in the undirected graph correspond to distinct
variants (or variant-related breakpoints).
usage: kevlar partition [-h] [-s] [--min-abund X] [--max-abund Y] [--no-dedup]
[--gml FILE] [--split OUTPREFIX] [-o FILE]
infile
Positional Arguments
infile |
input reads in augmented Fast[q|a] format |
Named Arguments
-s, --strict |
require perfect identity between overlapping reads for inclusion in the same partition; by default, only a shared interesting k-mer is required |
--min-abund |
ignore k-mers with abundance lower than X; default is 2 |
--max-abund |
ignore k-mers with abundance higher than Y; default is 200 |
--no-dedup |
skip step to remove duplicates |
--gml |
write read graph to .gml file |
--split |
write each partition to a separate output file, each with a filename like “OUTPREFIX.cc#.augfastq.gz” |
-o, --out |
output file; default is terminal (stdout) |
kevlar assemble
Assemble reads into contigs representing putative variants
usage: kevlar assemble [-h] [-p ID] [--max-reads N] [-o FILE] augfastq
Positional Arguments
augfastq |
annotated reads in augmented Fastq/Fasta format |
Named Arguments
-p, --part-id |
only assemble partition “ID” in the input |
--max-reads |
do not attempt to assemble any partitions with more than N reads (default: 10000) |
-o, --out |
output file; default is terminal (stdout) |
kevlar localize
For each partition, compute the reference target sequence to use for
variant calling. NOTE: this command relies on the bwa program being in
the PATH environmental variable.
usage: kevlar localize [-h] [-d Δ] [-p ID] [-o FILE] [-z Z] [-x X]
[--include REGEX] [--exclude REGEX]
refr contigs [contigs ...]
Positional Arguments
refr |
BWA indexed reference genome |
contigs |
assembled reads in augmented Fasta format |
Named Arguments
-d, --delta |
retrieve the genomic interval from the reference by extending beyond the span of all k-mer starting positions by Δ bp |
-p, --part-id |
only localize partition “ID” in the input |
-o, --out |
output file; default is terminal (stdout) |
-z, --seed-size |
| seed size; default is 51 |
-x, --max-diff |
split and report multiple reference targets if the distance between two seed matches is > X; by default, X is set dynamically for each partition and is equal to 3 times the length of the longest contig in the partition; each resulting bin specifies a reference target sequence to which assembled contigs will subsequently be aligned |
--include |
discard alignments to any chromosomes whose sequence IDs do not match the given pattern |
--exclude |
discard alignments to any chromosomes whose sequence IDs match the given pattern |
kevlar call
Align variant-related reads to the reference genome and call the variant
from the alignment.
usage: kevlar call [-A A] [-B B] [-O O] [-E E] [--gen-mask FILE]
[--mask-mem MEM] [--mask-max-fpr FPR] [-h] [-d]
[--no-homopoly-filter] [--max-target-length L]
[--refr FILE] [-o FILE] [-k K]
queryseq targetseq
Alignment scoring
-A, --match |
match score; default is 1 |
-B, --mismatch |
mismatch penalty; default is 2 |
-O, --open |
gap open penalty; default is 5 |
-E, --extend |
gap extension penalty; default is 0 |
Mask generation settings
--gen-mask |
generate a nodetable containing all k-mers that span any variant call |
--mask-mem |
memory to allocate for the node table |
--mask-max-fpr |
terminate if the estimated false positive rate is higher than “FPR”; default is 0.01 |
Miscellaneous settings
-d, --debug |
show debugging output |
--no-homopoly-filter |
| by default, short indels adjacent to homopolymers are filtered out; use this flag to disable that filter |
--max-target-length |
| do not attempt to call variants if the target genomic sequence is > L bp; by default, L=10000 |
--refr |
reference genome indexed for BWA search; if provided, mates of interesting reads will be used to diambiguate multi-mapping contigs |
-o, --out |
output file; default is terminal (stdout) |
-k, --ksize |
k-mer size; default is 31 |
kevlar simlike
Sort variants by likelihood score
usage: kevlar simlike --case CT --controls CT [CT ...] --refr REFR
[--ctrl-max X] [--case-min Y] [--mu μ] [--sigma σ]
[--epsilon ε] [--ctrl-abund-high H] [--case-abund-low L]
[--min-like-score S] [--drop-outliers]
[--ambig-thresh A] [-h] [--sample-labels LBL [LBL ...]]
[-f] [-o OUT]
vcf [vcf ...]
K-mer count files
Likelihood scores are based on the abundance of alternate allele k-mers in each sample and on the abundance of reference allele k-mers in the reference genome.
--case |
k-mer counttable for case/proband |
--controls |
k-mer counttables for controls/parents/siblings, 1 counttable per sample |
--refr |
k-mer smallcounttable for reference genome |
K-mer count thresholds
The thresholds originally used to detect novel k-mers are used at this stage to distinguish true variants from spurious predictions.
--ctrl-max |
maximum abundance threshold for controls; default is 1 |
--case-min |
minimum abundance threshold for proband; default is 6 |
K-mer coverage
Likelihood scores also depend on the estimated or observed distribution of k-mer abundances in each sample.
--mu |
mean k-mer abundance; default is 30.0 |
--sigma |
standard deviation of k-mer abundance; default is 8.0 |
--epsilon |
error rate; default is 0.001 |
Heuristic filters
The following heuristic filters can improve accuracy when calling de novo variants but may require tuning for your particular data set.
--ctrl-abund-high |
| a variant call will be filtered out if either of the control samples has >H high abundance k-mers spanning the variant (where high abundance means > –ctrl-max); by default, H=4; set H<=0 to disable the filter |
--case-abund-low |
| a variant call will be filtered out if the case sample has L or more consecutive low abundance k-mers spanning the variant (where low abundance means < –case-min); by default, L=5; set L<=0 to disable the filter |
--min-like-score |
| filter out variant predictions with likelihood scores < S; by default, S = 0.0, but it’s often possible to improve specificity without sacrificing sensitivity by raising S to, for example, 50.0 |
--drop-outliers |
| discard terminal variant-spanning k-mers with abunance much higher than average (representing k-mers that should be in the reference genome but are not); this will increase sensitivity, but will potentially introduce many false calls as well |
--ambig-thresh |
discard contigs that result in > A distinct, equally optimal variant calls; by default, A = 10; set A=0 to disable this filter |
Miscellaneous settings
--sample-labels |
| list of sample labels (with case/proband first) |
-f, --fast-mode |
| whenever possible, stop computations prematurely for any putative variants that have already been filtered out |
-o, --out |
output file; default is terminal (standard output) |
kevlar alac
Assemble partitioned reads, localize to the reference genome, align the
assembled contig to the reference, and call variants.
usage: kevlar alac [-p ID] [--max-reads N] [-z Z] [-d D] [-x X]
[--include REGEX] [--exclude REGEX] [--max-target-length L]
[-A A] [-B B] [-O O] [-E E] [--gen-mask FILE]
[--mask-mem MEM] [--mask-max-fpr FPR] [-h] [-o FILE] [-i I]
[-k K] [-t T]
infile refr
Read assembly
-p, --part-id |
only process partition “ID” in the input |
--max-reads |
do not attempt to assemble any partitions with more than N reads (default: 10000) |
Alignment scoring
-A, --match |
match score; default is 1 |
-B, --mismatch |
mismatch penalty; default is 2 |
-O, --open |
gap open penalty; default is 5 |
-E, --extend |
gap extension penalty; default is 0 |
Mask generation settings
--gen-mask |
generate a nodetable containing all k-mers that span any variant call |
--mask-mem |
memory to allocate for the node table |
--mask-max-fpr |
terminate if the estimated false positive rate is higher than “FPR”; default is 0.01 |
Miscellaneous settings
-o, --out |
output file; default is terminal (stdout) |
-i, --min-ikmers |
| do not report calls that a supported by fewer than I interesting k-mers |
-k, --ksize |
k-mer size; default is 31 |
-t, --threads |
process T partitions at a time using T threads |
kevlar unband
When kevlar is run in k-mer banding mode, the same read will typically
appear in multiple output files, annotated with a different set of
potentially novel k-mers in each case. This command will consolidate these
duplicated records across files into a single non-redundant set of reads
with the complete set of novel k-mer annotations.
usage: kevlar unband [-h] [-n N] [-o FILE] infile [infile ...]
Positional Arguments
infile |
input files in augmented Fasta/Fastq format |
Named Arguments
-n, --n-batches |
| number of batches into which records will be split; default is 16; N temporary files are created and each record from the input is written to a temporary file based on its read name; then each batch is loaded into memory and duplicate records are resolved |
-o, --out |
output file; default is terminal (stdout) |
kevlar augment
Internally, kevlar annotates sequences with “interesting k-mers” and uses
“augmented” Fastq and Fasta formats. Processing sequences with third-part
tools usually requires discarding these annotations. This command is used
to augment/reaugment a set of sequences using annotations from an already
augmented sequence file.
usage: kevlar augment [-h] [-o FILE] augseqs seqs
Positional Arguments
augseqs |
augmented sequence file |
seqs |
sequences to annotate |
Named Arguments
-o, --out |
output file; default is terminal (stdout) |
kevlar mutate
Apply the specified mutations to the genome provided.
usage: kevlar mutate [-h] [-o FILE] mutations genome
Positional Arguments
mutations |
mutations file |
genome |
genome to mutate |
Named Arguments
-o, --out |
output file; default is terminal (stdout) |