A maximal repeat is a repeat with no possible extension to the right or the left without incurring a mismatch.
Maximal repeats have interesting computational properties since they can be computed in linear time using
a suffix-tree-based algorithm and their number is linear (at most equal to the sequence length).
A CRISPR structure is a succession of maximal repeats (the direct repeats) separated by the spacers.
CRISPRFinder uses this property to find possible localizations of CRISPRs.
Finding the maximal repeats is done with VMatch
which is the upgrade of REPuter (Kurtz 1999 )
based on an efficient implementation of enhanced suffix arrays (Abouelhoda 2004 ).
The main idea of the CRISPRFinder program is to find possible CRISPR localizations and
then to check if these regions contain a cluster that meets CRISPR structure standards.
Finding possible CRISPR localizations is achieved by detecting maximal repeats (see paragraph below).
This step is performed by the VMatch package.
Default parameters used are the following : a repeat length of 23 to 55 bp a gap size between repeats of 25 to 60 bp.
20% nucleotide mismatch between repeats.
Filters are added to help validate a CRISPR:
spacer size compared to the DR size.
By default, the spacer size should be from 0,6* to 2,5* the DR size.
the spacers should be not identical.
This filter is set to eliminate tandem repeats.
The spacers comparison is made by aligning them (using default parameters of the ClustalW program).
Spacers similarity percentage is calculated with the function percentage_identity()
of the (Bio)perl interface ( AlignIO methods, ClustalW interface ).
By default, this parameter is set to 60%.
Small CRISPR-like structures, i.e having only two or three DRs are
often by-products of CRISPRfinder identification that are not true CRISPRs.
Therefore they are now classified using an Evidence level, rated 1 to 4,
where 1 includes small CRISPRs (with 3 or less spacers) and 2
to 4 are classified on the basis of DR and spacer similarity.
As compared to the first version of CRISPRfinder,
tests have been introduced to check the internal conservation of the candidate DRs,
and the divergence of the candidate spacers.
This allows a more accurate identification of true CRISPR arrays.
Those that obtained a high-score are retained.
Orientation of the CRISPR
To identify the potential orientation of CRISPRs, two tests have been implemented.
The CRISPRDirection program predicts orientation by comparison to a curated dataset (updated in May 2017) of consensus repeats.
The orientation is shown in the summary table as a + (if the region is the left flanking) or - sign (if the region is the right flanking)
ND (Not determined) is shown when no prediction can be made. In addition, the AT% is calculated in 100bp flanking the array on both sides.
The region with the higher AT% is considered as a leader and the result is shown in the detailed results.
Searching for cas genes
The first step consists in the identification of open reading frames (ORF) with
Then these ORFs are analysed by the MacSyFinder
program by HMM search of a library of known Cas proteins .
The Cas type and subtype are found by analysis of cluster of Cas.
1- Stefan Kurtz, Chris Schleiermacher. REPuter: Fast Computation of Maximal Repeats in Complete Genomes,
Bioinformatics 15(5): 426-427, 1999.
2- M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing Suffix Trees with Enhanced Suffix Arrays,
Journal of Discrete Algorithms, 2:53-86, 2004.
3- Biswas, A., Fineran, P.C. and Brown, C.M. Accurate computational prediction of the transcribed strand of CRISPR non-coding RNAs,
Bioinformatics 30(13): 1805-1813, 2014.
4- Hyatt,D. Prodigal: prokaryotic gene recognition and translation initiation site identification,
BMC Bioinformatics, 11: 119, 2010
5- Abby, S. S.,Neron, B., Menager, H., Touchon, M., Rocha, E. P. : MacSyFinder: a program to mine genomes
for molecular systems with an application to CRISPR-Cas systems,
PloSOne, 9: e110726, 2014.