CRISPRCasFinder Help [online]

FASTA Format

Definition

The first line starts with a greater than sign ">" and contains a name or other identifier for the sequence. This is the sequence header and must be in a single line. The remaining lines contain the sequence data. The sequence can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. Multiple sequences can be present in the same file as long as each sequence has its own header.

Supported Nucleic acid code

A	→	adenosine
C	→	cytidine
G	→	guanine
T	→	thymidine
U	→	uridine
R	→	G A (purine)
Y	→	T C (pyrimidine)
K	→	G T (keto)

M	→	A C (amino)
S	→	G C (strong)
W	→	A T (weak)
B	→	G T C
D	→	G A T
H	→	A C T
V	→	G C A
N	→	A G C T (any)
-	→	gap of indeterminate length

Ns are accepted, IUB/GCG letters (MRWSYKVHDBX) will be converted to Ns. Any other characters will be deleted.

Example

The FASTA format is a plain text format which looks something like this:

>Escherichia coli UTI89|886538|887045
GTTCACTGCCGTACAGGCAGCTTAGAAA TGACGCCATATGCAGATCATTGAGGCGAAACC
GTTCACTGCCGTACAGGCAGCTTAGAAA ACGTTCGCACCGGTCAGGGTACTGCGCAGCGT
GTTCACTGCCGTACAGGCAGCTTAGAAA GAAACCAGAGCGCCCGCATAAAACAGGCACAA
GTTCACTGCCGTACAGGCAGCTTAGAAA GCCAGCATAAAACCGCCTTTGATATTTTATTG
GTTCACTGCCGTACAGGCAGCTTAGAAA TCAGCCGGAGGCTCTCAATTTCAGCCGCGCGG
GTTCACTGCCGTACAGGCAGCTTAGAAA AGCACGGCTGCGGGGAATGGCTCAATCTCTGC
GTTCACTGCCGTACAGGCAGCTTAGAAA TGATGGCGCAGCAGTCCTCCCTCCTGCCGCCA
GTTCACTGCCGTACAGGCAGCTTAGAAA CTGAACGTTGAAGAGTGCGACCGTCTCTCCTT
GTTCACTGCCGTACAGGCAGTATTCACA

CRISPR Advanced Settings

The default parameters have been set to detect repeats with high homology level.
It is possible to modify some parameters defining the maximal repeat and the CRISPR properties.

- Minimal Repeat length (default value = 23 ; allowed numerical values between 1 and 70),
- Maximal Repeat length (default value = 55 ; allowed numerical values between 2 and 80),
- Allow mismatch between repeats (default value = 1; allowed numerical values are 1 or 0),
- Minimal Spacers size in function of Repeat size (default value = 0.6 ; allowed numerical values between 0.1 and 60),
- Maximal Spacers size in function of Repeat size (default value = 2.5 ; allowed numerical values between 1.5 and 60),
- Maximal allowed percentage of similarity between Spacers (default value= 60 ; allowed numerical values between 1 and 100),
- Percentage mismatches allowed between Repeats (default value= 20 ; allowed numerical values between 1 and 100),
- Percentage mismatches allowed for truncated Repeat (default value= 33.3 ; allowed numerical values between 1 and 100),

CRISPR Other Settings

- The size of Flanking regions in base pairs (bp) for each analyzed CRISPR array can be modified (default value= 100 ; allowed numerical values between 10 and 1000).
- Alternative way to detect the truncated repeat. Mismatches are search in the first half of the repeat flanking the array.

CAS Settings

The "Perform CAS detection" button allows users to choose between three stringency levels to identify cas genes. The first level (General) allows a permissive search (i.e. CAS will be detected whatever the system type or subtype). The two other levels (Typing and SubTyping) produce more stringent analyses. See MacSyFinder documentation (http://macsyfinder.readthedocs.io/en/latest/) for further information.

The "Unordered" button allows users to perform a search for non-clustered cas genes in unordered or smaller sequences (such as contigs). This functionality uses "-p meta" option of Prodigal and "--db-type unordered" option of MacSyFinder.

The search for CAS will return no result if the cluster of genes is not complete and therefore not functional. In such cases some cas genes may be present and will be detected by using the General clustering model.

Viewing Result

The summary displays information on CRISPR arrays and cas gene clusters in the order in which they lie along the chromosome. Direction is the proposed orientation of the CRISPR array (ND is for Not determined) according to the CRISPRDirection program. In Details is shown, in addition, the potential orientation of the CRISPR array based on the AT percentage in 100bp flanking sequences.

"Conservation DR" corresponds to the EBcons (Entropy-Based conservation) of repeats as described in the related manuscript (Couvin et al., NAR 2018).
"Conservation Spacer" indicates the conservation of spacers based on BioPerl's overall percentage identity (see the publication for more details).