Preparing the data


What do we need ?

To design a new animal DNA metabarcode we download from the NCBI the following data

  • The complete set of whole mitochondrial genomes
  • The NCBI taxonomy

We need also:


Downloading the mitochondrial genomes

We can use an internet browser and clic 6,000 times on each genome

or run the following command lines

mkdir mitochondria
cd mitochondria
wget 'ftp://ftp.ncbi.nlm.nih.gov/genomes/MITOCHONDRIA/Metazoa/*.gbk'
cd ..

The mitochondria directory contains now about 5000 files


Downloading the complete taxonomy

wget 'ftp://ftp.ncbi.nlm.nih.gov://pub/taxonomy/taxdump.tar.gz'

The NCBI taxonomy contains all the relationship between taxa. Each taxon is identified by a unique numerical id: taxid


Extracting the taxonomy

Taxonomy is downloaded as a compacted archive containing several text files. We have to reformat it according to the OBITools format.

mkdir ncbi20150518
cd ncbi20150518/
tar xf ../taxdump.tar
cd ..

file: nodes.dmp

1  |  1  |  no rank |       |   8   |   0   | ...
2   |   131567  |   superkingdom    |       |   0   |   0   |
6   |   335928  |   genus   |       |   0   |   1   |
7   |   6   |   species |   AC  |   0   |   1   |
9   |   32199   |   species |   BA  |   0   |
10  |   135621  |   genus   |       |   0   |
11  |   1707    |   species |   CG  |   0   |   1   |
13  |   203488  |   genus   |       |   0   |   1   |
14  |   13  |   species |   DT  |   0   |   1   |

file: names.dmp

1   |   root    |       |   scientific name |
2   |   Bacteria    |   Bacteria <prokaryote>   |   scientific name |
2   |   Monera  |   Monera <Bacteria>   |   in-part |
2   |   Procaryotae |   Procaryotae <Bacteria>  |   in-part |
2   |   Prokaryota  |   Prokaryota <Bacteria>   |   in-part |
2   |   Prokaryotae |   Prokaryotae <Bacteria>  |   in-part |
2   |   bacteria    |   bacteria <blast2>   |   blast name  |
2   |   eubacteria  |       |   genbank common name |
2   |   prokaryote  |   prokaryote <Bacteria>   |   in-part |
...
10  |   Cellvibrio  |       |   scientific name |
11  |   [Cellvibrio] gilvus |       |   scientific name |
13  |   Dictyoglomus    |       |   scientific name |
14  |   Dictyoglomus thermophilum   |       |   scientific name |

Formating the taxonomy for OBITools

obitaxonomy -t ncbi20150518 -d ncbi20150518
> ls
mitochondria      ncbi20150518.rdx
ncbi20150518      ncbi20150518.tdx
ncbi20150518.adx  taxdump.tar.gz
ncbi20150518.ndx

The four files ncbi20150518.adx, ncbi20150518.ndx, ncbi20150518.rdx, ncbi20150518.tdx constitue the reformated taxonomy.


Preparing the set of complete genomes

Merge and convert all genomes in a single fasta file

obiconvert mitochondria/* > mito.all.fasta
head -5 mito.all.fasta

five first lines of the new mito.all.fasta file

>AC_000022 organism=Rattus norvegicus; taxid=10116; Rattus norvegicus strain Wistar mitochondrion, complete genome
gttaatgtagcttataataaagcaaagcactgaaaatgcttagatggattcaaaaatccc
ataaacacaaaggtttggtcctggccttataattaattggaggtaagattacacatgcaa
acatccataaaccggtgtaaaatcccttaaagatttgcctaaaacttaaggagagggcat
caagcacataatatagctcaagacgccttgcctagccacacccccacgggactcagcagt

We want:

  • annotate sequences by their species taxid
  • keep a single genome per species
  • extract only vertebrate genome

Looking for the Vertebrata’s taxid

> ecofind -d ncbi20150518 '^vertebrata$'
#
#  opening ncbi20150518 database
Reading 1283820 taxa...
No local taxon
#  1283820 taxons
#
#  searching for '^vertebrata$' pattern
#  taxonomy id   |   taxonomy rank  |    name     
#
      7742  |         no rank   |    Vertebrata  
   1261581  |           genus   |    Vertebrata  
#  2 records found

A genus called Vertebrata

> ecofind -d ncbi20150518 -p 1261581
#
#  opening ncbi20150518 database
Reading 1283820 taxa...
No local taxon
#  taxonomy id   |   taxonomy rank  |    name                                                   |    class name         |    scientific name
#
   1261581  |           genus   |    Vertebrata    
      2803  |          family   |    Rhodomelaceae 
      2802  |           order   |    Ceramiales  
      2806  |           class   |    Florideophyceae 
      2763  |         no rank   |    Rhodophyta  
      2759  |    superkingdom   |    Eukaryota  
    131567  |         no rank   |    cellular organisms
         1  |         no rank   |    root 
#  7 parent(s) found
#

Reannotation and selection of the genomes

> obiannotate -d ncbi20150518 \
            --with-taxon-at-rank=species \
            mito.all.fasta | \
  obiannotate -S 'ori_taxid=taxid' | \
  obiannotate -S 'taxid=species' | \
  obiuniq -c taxid | \
  obiselect -c taxid -n 1 -f count -M > mito.one.fasta
obicount mito.all.fasta
4829 4858

Selection of the vertebrata genomes

obigrep -d ncbi20150518 -r 7742 mito.one.fasta > mito.vert.fasta
obicount mito.vert.fasta
3250 3273

Reformat the Fasta file into an ecoPCR database

obiconvert -d ncbi20150518 --ecopcrdb-output=mito.vert mito.vert.fasta

Looking for the Teleostei taxid

ecofind -d ncbi20150518 '^Teleostei$'
#
#  opening ncbi20150518 database
Reading 1283820 taxa...
No local taxon
#  1283820 taxons
#
#  searching for '^Teleostei$' pattern
#  taxonomy id   |   taxonomy rank  |    name                                                   |    class name         |    scientific name
#
     32443  |      infraclass   |    Teleostei  
#  1 records found

Selecting the best primer pairs

ecoPrimers -d mito.vert \ 
           -e 3 -3 2 \ 
           -l 30 -L 150 \ 
           -r 32443 \ 
           -c > Teleostei.ecoprimers

267 potential markers identified


  • Primer ID : 4

 

Primer | sequence | tm max | tm min | GC count ———|——————–|——–|——–|———- Forward | ACACCGCCCGTCACTCTC | 62.5 | 38.7 | 12
Reverse | CCAAGTGCACCTTCCGGT | 60.7 | 32.5 | 11

 

  • amplifying 1531/1550 sequences
  • identify 1207/1550 Species
  • Size ranging from 60pb to 102pb (mean: 74.75 pb)

The new primer pair is tested using ecoPCR

ecoPCR -d mito.vert \ 
       -e 5 \ 
       -l 30 -L 150 \ 
       -c \
        ACACCGCCCGTCACTCTC CCAAGTGCACCTTCCGGT > Teleostei.ecoprimers

We are now switching to R


Preparing our R session

First we have to download the two follong libraries

library(ROBITools)
## Warning: replacing previous import by 'igraph::parent' when loading
## 'ROBITools'
## Warning: replacing previous import by 'igraph::path' when loading
## 'ROBITools'
## ROBITools package
library(ROBITaxonomy)
## 
## Attaching package: 'ROBITaxonomy'
## 
## The following object is masked from 'package:stats':
## 
##     family
library(ROBIBarcodes)

Loading the data

fish = read.ecopcr.result('Teleostei.04.vert.ecopcr')
taxo = read.taxonomy('ncbi20150518')

A brief look into the ecoPCR result

head(fish,n = 2)
##          AC seq_length  taxid    rank species             species_name
## 1 NC_013146      16960 100858 species  100858 Threskiornis aethiopicus
## 2 NC_016427      16379  82464 species   82464           Myodes regulus
##    genus   genus_name family       family_name superkingdom
## 1 100857 Threskiornis  33574 Threskiornithidae         2759
## 2 447134       Myodes 337677        Cricetidae         2759
##   superkingdom_name strand      forward_match forward_mismatch forward_tm
## 1         Eukaryota      D ATACCGCCCGTCACCCTC                2      45.64
## 2         Eukaryota      D ACACCGCCCGTCACCCTC                1      53.84
##        reverse_match reverse_mismatch reverse_tm amplicon_length
## 1 CTAAGTGCACATTCCGGT                2      45.55              74
## 2 CCAAGCACACTTTCCAGT                4      16.28              79
##                                                                          sequence
## 1      CTCATAAGCTACTGACTCCCATAACTAATACCCTAATTAGCCGAAGATGAGGTAAGTCGTAACAAGGTAAGTGT
## 2 CTCAAACTAAATAAATGAGATCTATACATAATTACATCAAACTTTTACGAGAGGAGATAAGTCGTAACAAGGTAAGCAT
##                                                definition
## 1 Threskiornis aethiopicus mitochondrion, complete genome
## 2           Myodes regulus mitochondrion, complete genome

Identifying which sequences belongs fish

teleo.taxid = ecofind(taxo,'^Teleostei$')
teleo.taxid
## [1] 32443
is_a_fish=is.subcladeof(taxo,fish$taxid,teleo.taxid)
table(is_a_fish)
## is_a_fish
## FALSE  TRUE 
##  1673  1577

Testing the conservation of the priming sites

Fish.forward = ecopcr.forward.shanon(ecopcr = fish,
                                     group = is_a_fish)
Fish.reverse = ecopcr.reverse.shanon(ecopcr = fish,
                                     group = is_a_fish)

\[latex H = - \sum_{i \in \{A,C,G,T\}} p_i \times \frac{\log(p_i)}{\log(2)} \]


Ploting the results

par(mfcol=c(2,2))
dnalogoplot(Fish.forward$'TRUE',
            primer = "ACACCGCCCGTCACTCTC",
            main='Forward Fish')
dnalogoplot(Fish.forward$'FALSE',
            primer = "ACACCGCCCGTCACTCTC",
            main='Forward not Fish')

dnalogoplot(Fish.reverse$'TRUE',
            primer = "CCAAGTGCACCTTCCGGT",
            main='Reverse Fish')
dnalogoplot(Fish.reverse$'FALSE',
            primer = "CCAAGTGCACCTTCCGGT",
            main='Reverse not Fish')


How mismatches influence taxonomical selection

par(mfcol=c(1,1))
mismatchplot(fish,group = is_a_fish,
             legend=c('Not a fish','Fish'))


What is the taxonomic resolution of our marker

only.fish=fish[is_a_fish,]
res = resolution(taxo,only.fish)
resolution = with(only.fish,
                  unique(data.frame(species_name,taxid,rank=res))
                 )
t(t(sort(table(resolution$rank)/length(resolution$rank),decreasing = TRUE)))
##             
##                     [,1]
##   species    0.780177891
##   genus      0.116264295
##   family     0.071791614
##   no rank    0.010165184
##   subfamily  0.008894536
##   tribe      0.006988564
##   subspecies 0.005717916

Is it the same for Cyprinideae ?

is_a_cyprinidae = is.subcladeof(taxo,fish$taxid,7953)
only.cyprinidae=fish[is_a_cyprinidae,]
res = resolution(taxo,only.cyprinidae)
resolution = with(only.cyprinidae,
                  unique(data.frame(species_name,taxid,rank=res))
                 )
t(t(sort(table(resolution$rank)/length(resolution$rank),decreasing = TRUE)))
##             
##                    [,1]
##   species    0.66563467
##   family     0.26006192
##   genus      0.06811146
##   subspecies 0.00619195

And consequently for other fish

no.cyprinidae=fish[is_a_fish & !is_a_cyprinidae,]
res = resolution(taxo,no.cyprinidae)
resolution = with(no.cyprinidae,
                  unique(data.frame(species_name,taxid,rank=res))
                 )
t(t(sort(table(resolution$rank)/length(resolution$rank),decreasing = TRUE)))
##             
##                     [,1]
##   species    0.810551559
##   genus      0.128697042
##   family     0.029576339
##   subfamily  0.011191047
##   tribe      0.008792966
##   no rank    0.005595524
##   subspecies 0.005595524

Go back to Unix


We try to select marker specific of Cyprinidae

ecofind -d ncbi20150518 '^cyprinidae$'
#
#  opening ncbi20150518 database
Reading 1283820 taxa...
No local taxon
#  1283820 taxons
#
#  searching for '^cyprinidae$' pattern
#  taxonomy id   |   taxonomy rank  |    name                                                   |    class name         |    scientific name
#
      7953  |          family   |    Cyprinidae                                             |    scientific name    |    Cyprinidae
#  1 records found

We run ecoPrimers and ecoPCR on the select primer pair

ecoPrimers -d mito.vert \
           -e 3 -3 2 \
           -l 30 -L 150 \
           -r 7953 -c > Cyprinidae.ecoprimers
ecoPCR -d mito.vert \
       -e 5 \
       -l 30 -L 500 \
       -c \
       ACGGCGTAAAGGGTGGTT TATCTAATCCCAGTTTGT \
       > Cyprinidae.14.vert.ecopcr

Go back to R


Loading the new ecoPCR results

cyprinidae = read.ecopcr.result('Cyprinidae.14.vert.ecopcr')
is_a_fish=is.subcladeof(taxo,cyprinidae$taxid,32443)
is_a_cyprinidae = is.subcladeof(taxo,cyprinidae$taxid,7953)
group = rep('vertebrate',length(is_a_fish))
group[is_a_fish]='fish'
group[is_a_cyprinidae]='cyprinidae'
group=as.factor(group)
table(group)
## group
## cyprinidae       fish vertebrate 
##        333       1318       1626

Studying the priming site conservation

Cyprinidae.forward = ecopcr.forward.shanon(ecopcr = cyprinidae,group = group)
Cyprinidae.reverse = ecopcr.reverse.shanon(ecopcr = cyprinidae,group = group)
par(mfcol=c(3,2))
dnalogoplot(Cyprinidae.forward$vertebrate,primer = "ACGGCGTAAAGGGTGGTT",main='Forward Vertebrata')
dnalogoplot(Cyprinidae.forward$fish,primer = "ACGGCGTAAAGGGTGGTT",main='Forward Fish')
dnalogoplot(Cyprinidae.forward$cyprinidae,primer = "ACGGCGTAAAGGGTGGTT",main='Forward Cyprinidae')


dnalogoplot(Cyprinidae.reverse$vertebrate,primer = "TATCTAATCCCAGTTTGT",main='Reverse Vertebrata')
dnalogoplot(Cyprinidae.reverse$fish,primer = "TATCTAATCCCAGTTTGT",main='Reverse Fish')
dnalogoplot(Cyprinidae.reverse$cyprinidae,primer = "TATCTAATCCCAGTTTGT",main='Reverse Cyprinidae')