The first line starts with a greater than sign ">" and contains a name or other identifier for the sequence. This is the sequence header and must be in a single line. The remaining lines contain the sequence data. The sequence can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. Multiple sequences can be present in the same file as long as each sequence has its own header.
|R||→||G A (purine)|
|Y||→||T C (pyrimidine)|
|K||→||G T (keto)|
|M||→||A C (amino)|
|S||→||G C (strong)|
|W||→||A T (weak)|
|B||→||G T C|
|D||→||G A T|
|H||→||A C T|
|V||→||G C A|
|N||→||A G C T (any)|
|-||→||gap of indeterminate length|
Ns are accepted, IUB/GCG letters (MRWSYKVHDBX) will be converted to Ns. Any other characters will be deleted.
The FASTA format is a plain text format which looks something like this:
>Escherichia coli UTI89|886538|887045 GTTCACTGCCGTACAGGCAGCTTAGAAA TGACGCCATATGCAGATCATTGAGGCGAAACC GTTCACTGCCGTACAGGCAGCTTAGAAA ACGTTCGCACCGGTCAGGGTACTGCGCAGCGT GTTCACTGCCGTACAGGCAGCTTAGAAA GAAACCAGAGCGCCCGCATAAAACAGGCACAA GTTCACTGCCGTACAGGCAGCTTAGAAA GCCAGCATAAAACCGCCTTTGATATTTTATTG GTTCACTGCCGTACAGGCAGCTTAGAAA TCAGCCGGAGGCTCTCAATTTCAGCCGCGCGG GTTCACTGCCGTACAGGCAGCTTAGAAA AGCACGGCTGCGGGGAATGGCTCAATCTCTGC GTTCACTGCCGTACAGGCAGCTTAGAAA TGATGGCGCAGCAGTCCTCCCTCCTGCCGCCA GTTCACTGCCGTACAGGCAGCTTAGAAA CTGAACGTTGAAGAGTGCGACCGTCTCTCCTT GTTCACTGCCGTACAGGCAGTATTCACA
The default parameters have been set to detect repeats with high homology level.
It is possible to modify some parameters defining the maximal repeat and the CRISPR properties.
- Minimal Repeat length (default value = 23 ; allowed numerical values between 1 and 70),
- Maximal Repeat length (default value = 55 ; allowed numerical values between 2 and 80),
- Allow mismatch between repeats (default value = 1; allowed numerical values are 1 or 0),
- Minimal Spacers size in function of Repeat size (default value = 0.6 ; allowed numerical values between 0.1 and 60),
- Maximal Spacers size in function of Repeat size (default value = 2.5 ; allowed numerical values between 1.5 and 60),
- Maximal allowed percentage of similarity between Spacers (default value= 60 ; allowed numerical values between 1 and 100),
- Percentage mismatches allowed between Repeats (default value= 20 ; allowed numerical values between 1 and 100),
- Percentage mismatches allowed for truncated Repeat (default value= 33.3 ; allowed numerical values between 1 and 100),
- The size of Flanking regions in base pairs (bp) for each analyzed CRISPR array can be modified (default value= 100 ; allowed numerical values between 10 and 1000).
- Alternative way to detect the truncated repeat. Mismatches are search in the first half of the repeat flanking the array.
The "Perform CAS detection" button allows users to choose between three stringency levels to identify cas genes. The first level (General) allows a permissive search (i.e. CAS will be detected whatever the system type or subtype). The two other levels (Typing and SubTyping) produce more stringent analyses. See MacSyFinder documentation (http://macsyfinder.readthedocs.io/en/latest/) for further information.
The "Unordered" button allows users to perform a search for non-clustered cas genes in unordered or smaller sequences (such as contigs). This functionality uses "-p meta" option of Prodigal and "--db-type unordered" option of MacSyFinder.
The summary displays information on CRISPR arrays and cas gene clusters in the order in which they lie along the chromosome. Direction is the proposed orientation of the CRISPR array (ND is for Not determined) according to the CRISPRDirection program. In Details is shown, in addition, the potential orientation of the CRISPR array based on the AT percentage in 100bp flanking sequences.
"Conservation DR" corresponds to the EBcons (Entropy-Based conservation) of repeats as described in the related manuscript
(Couvin et al., NAR 2018).
"Conservation Spacer" indicates the conservation of spacers based on BioPerl's overall percentage identity (see the publication for more details).