Guide: How are names internally curated?


Register List Endorsement and Validation Guide

Contributors: Maria Chuvochina, Marike Palmer, Miguel Rodriguez-R

Workflow for Registration

The workflow for registration is presented below only for reference and to standardize the terminology used throughout:

  1. Initial name(s) registration: The names are entered into the Registry by the submitter or via automatic search by the Excubia Bot. At this stage, names can be in the status of “Automatically discovered” (publicly visible) or “draft” (private).
  2. Submitter claim: The names are claimed by the submitter (unless registered by them). Names are in the status of “draft” (private).
  3. Name(s) in register list: The submitter adds the name(s) to the register list using the “propose name” button.
  4. Name submission (through a Register List submission): The submitter completes a Register List submission, which automatically changes the status for the list and all the names to “submitted”. For Path 2, this leads directly to step 6 (without preliminary curation).
  5. Name endorsement (through a Register List endorsement): The curators endorse the Register List, which automatically changes the status for the list and all the names to “endorsed”. This step is bypassed in Path 2.
  6. SeqCode notification: The submitter notifies the SeqCode Registry of the publication of the manuscript (the typeset version of the effective publication). The list changes to the status of “notified” (names remain either “submitted” for Path 2 or “endorsed” for Path 1).
  7. Name validation (through a Register List validation): The curators validate the Register List, thereby rendering all contained names “validly published”.
  8. Register List publication: The DataCite format is reviewed for consistency and a DOI is obtained for the Register List.

General Procedures on Register List Curation for Endorsement and Validation

  1. All detected issues are to be addressed before validation, but some can be ignored for endorsement (and are explicitly indicated in the SeqCode Registry as Quality Check warnings). Changes can be introduced after endorsement, but ideally should be addressed by then. No changes are to be introduced after validation.
  2. When a list is submitted, curators might identify individual issues and record them in the SeqCode Registry using one of the following spaces:
    • Name Correspondence with Curators: If the issues refer to a specific name (and not the entire list), the preferred space to document issues is in the name’s page itself.
    • Register List Correspondence with Curators: If the issues refer to the entire list (and not just a specific entry), the preferred space to document issues is in the registry list’s page. For single-entry lists, both spaces can be treated as equivalent.
    • Notes on “Return to submitter”: If only minor issues remain that can be quickly addressed and don’t need permanent documentation, they may be communicated to authors in the notes when returning the registry list back to submitters. However, note that this is NOT to remain in the permanent record, and will be purged upon validation. In any cases, this space should also be used to highlight to the authors the actions that must be taken before endorsement, even if this is elaborated on in the Correspondence with Curators space.
  3. When a curator is satisfied with the quality of all names in a register list, they should document their endorsement first in the register list, indicating the type of review they have performed (nomenclature or genomics). The system allows for the endorsement of individual names, but this procedure should be avoided, preferring instead actions at the level of the register lists. Note that progress in individual names can be documented through the Expert Curation checks in each name. Similarly, individual names can be returned to submitters, but this too should be avoided as it does not generate an email alert.
  4. When an entry has been revised for nomenclature and genomic data quality, it can be endorsed and/or validated by a curator.

Most potential issues covered by the rules and recommendations of the SeqCode are now implemented as Expert Checks. Nomenclature Review and Genomics Review can be accomplished by addressing the expert checks one by one. Failed checks associated with recommendations are identified as warnings, whereas errors corresponding to rules can prevent the endorsement or validation of the names in a register list. Although the Expert Checks include a brief description and the corresponding excerpt of the SeqCode, see the recommendations below for a standardized curation process.

Nomenclature Review

To evaluate the nomenclature of a genus and species, follow these steps:

  1. Confirm that the correct rank for the taxon is indicated. For species, information on the specific epithet should be provided on the name page, with the binomial indicated in the name section, while the information on the genus name is required on a separate genus page.
  2. Confirm that the species is linked to a genus as its parent. If the genus is a validly published genus under the SeqCode, ICNP, or ICN, only page/s for the species name/s, containing the binomial, are necessary, move to point 7. If a new genus is proposed, ensure that both the species name and the genus name pages are completely filled in and linked to the same register list if proposed in the same effective publication. If the earliest species in the genus is the designated type, go to the Expert Curation section of the genus name, and check the green mark for “Later species as genus type”.
  3. Evaluate the genus name first to ensure that the first part of the species name is formed correctly and the specific epithet is of a gender consistent with the genus name when appropriate.
  4. If the genus name is formed from a personal name, check that the name is well formed (check green mark for "Inapt personal name” entry), is feminine in gender (check green mark for “Personal genus must be feminine”), and is not contentious (check green mark for “Contentious name”).
  5. If the genus name is a compound name (i.e., consist of two or more word stems), ensure that i) the name is formed by combining the stems of the first or preceding words [to identify the stem one needs to identify the genitive case of the word and drop its ending]; ii) -i- as the connecting vowel is used, except where established usage from Greek origin use -o- connecting vowels, or where the following word starts with a vowel; iii) the genus gender is determined by the gender of the last word in the compound name. Confirm the appropriate gender is assigned to the genus and check the green mark for “Grammatical gender varies from source”. Confirm that roots from existing languages are used, where words consisting of suffixes alone do not count as roots, as these are modifiers of a root word. If roots are from existing or extinct languages, check the green mark for the entry “Missing roots from existing languages”. If mnemonic cues are present for the name, check the green mark for “Lacking mnemonic cues”.
  6. From all of these, ensure the genus name is spelled correctly, and check the green mark for the “Incorrect spelling” entry. Confirm that no significant emendations of the taxon name have been missed for names that are already in use in the literature (but are not yet validly published).
  7. Proceed to specific epithet evaluation. Determine whether the epithet is an adjective, a nominative noun in apposition, or a noun in genitive case. If the epithet is an adjective, confirm that it is of the same gender as the genus, and check the green mark for “Inconsistent epithet relationship to genus”. If it is a noun in genitive case or noun in opposition, no gender agreement is necessary.
  8. Confirm that names are formed: i) preferably from Latin (check mark “Latin should be preferred”); ii) in a non-contentious way (check green mark for “Contentious name”); and iii) with mnemonic cues and roots from existing languages (check green marks for “Lacking mnemonic cues” and “Missing roots from existing languages”). Morphemes consisting of Latin suffixes do not count as roots, as suffixes are modifiers of a root word, thus roots from existing or extinct languages are required.
  9. Confirm spelling of name and that no emendations for the names have been missed (check green marks for “Missing publication of emendation” and “Incorrect spelling”).
  10. If no revisions on the register list is required and all names on the register list can be endorsed or validated, tick the Nomenclature review checkmark in the “Curator team tracking” section of the register page. If the Genomic review is also checked, endorse or validate all names in the list. If revisions to the register lists or names on the register list is required, clearly indicate required revisions in the “correspondence with curators” space on each name page, write a short description of the necessary revisions in the register list “correspondence with curators” space, and send the list back to the contributor.

For registry list exclusively composed of names of genera and higher taxa, identify the type genus and species, and confirm whether these have been validated, have been endorsed, or have been submitted for evaluation. Higher taxon names cannot be validated if the subordinate types have not been validated. Cases where the higher taxon names may be submitted for evaluation on a separate register list to that of the subordinate types are where they were proposed in a different effective publication than the nomenclatural types for these taxa (e.g., a genus and species is proposed as incertae sedis in a publication or no intermediate taxonomic ranks between genus and domain were proposed), and a subsequent publication proposes the intermediate ranks of family, order, class, and/or phylum.

To evaluate the nomenclature of a family, order, class, and/or phylum, follow these steps:

  1. Confirm that the correct rank for the taxon is indicated. Current recommendations are that families should end in the suffix -aceae, orders should end in the suffix -ales, classes should end in the suffix -ia, and phyla should end in the suffix -ota.
  2. Check that the name is formed from the inferred stem of the type genus name. The “Nomenclature” section includes the inferred stem to aid with this process. If the name is properly formed according to recommendations, not contentious, and correctly spelled, check the entries as passed (green check marks) for “Inapt personal name”, “Contentious name”, “Lacking mnemonic cues”, “Missing roots from existing languages”, and “Incorrect spelling”.
  3. Confirm under the etymology section that the information (language, grammar, particle, and description or translation) of the first morpheme is that of the full word of the type genus.
  4. Confirm that the designated type genus is the first legitimate genus in the taxon. Under the SeqCode, higher taxonomic ranks should be formed from the first legitimate genus name belonging to the taxon. Confirm that no earlier correct, preferred, and legitimate genera under the SeqCode, ICNP, or ICN exist in this taxon. If there are no earlier legitimate genera to serve as type, select the green check mark for “Later taxon as type” entry. However, note that there could be contention on what the earliest legitimate genus is, depending on taxonomic opinion. In any cases, defer to the taxonomy being used or proposed by the name’s author to determine the earliest legitimate genus in the taxon.
  5. For the second morpheme, no language or grammar information is necessary, with the particle indicated as the appropriate suffix (e.g., -aceae, -ales, -ai, or -ota) and the description being “ending to denote a [replace with taxonomic rank]”.
  6. The etymology of the full word should be as follows:
    Language: “N.L.” (for Neo-Latin)
    Grammar: “fem. pl. n.” (for family and order) or “neut. pl. n.” (for class and phylum)
  7. Confirm that the taxon is assigned a parent taxon and at least one child taxon (that which serves as nomenclatural type), and ideally classification to domain-level.
  8. Confirm that any taxonomic emendations (for taxa published prior to registration) are linked under the publications section.
  9. If no revisions on the register list is required and all names on the register list can be endorsed or validated, tick the Nomenclature review checkmark in the “Curator team tracking” section of the register page. If the Genomic review is also checked, endorse or validate all names in the list. If revisions to the register lists or names on the register list is required, clearly indicate required revisions in the “correspondence with curators” space on each name page, write a short description of the necessary revisions in the register list “correspondence with curators” space, and send the list back to the contributor.

Genomics Review

For register lists exclusively composed of names of genera and above ranks, genome quality is not to be revised. Simply mark the list as Genomics Review complete in the “Curator team tracking” section of the list.

To evaluate genome quality of species and subspecies types, follow these steps for each of the species and subspecies names:

  1. Visit the name page, and confirm that the automated checks have not identified any suspicious results. If the system flags inconsistent annotations, request the methods be filled in the Submitter Comments (unless already provided). a. In the case of inconsistent number and fraction of the 16S rRNA genes, note that oftentimes 16S genes may be fragmented, and the authors could report 1 copy at 100% even though it’s split in two contigs. MiGA will find 2 partial copies instead in such cases, but it’s not truly inconsistent information.
  2. Open the “Expert Curation” box and scroll down to the Genomics subsection.
  3. Confirm that the genome entry is properly linked. The SeqCode Registry will automatically flag missing entries, but you can also confirm the availability of the entry in INSDC by clicking the first link/s in the Genomics subsection. In general, assembly entries are preferred over Nucleotide accessions, but both can be processed.
  4. Open the genome page, the link with format “Genome sc|XXXXXXX”. Scroll down to the section “Sample Metadata”, and confirm that the source sample has been properly linked. If the Sample Metadata section doesn’t exist in the genome page, make sure that the link/s in the “Source” entry of the “Genomics” section is/are correctly formed and point to existing pages. Once confirmed, you could also click on “Update external metadata” in the “Curator Actions” section of the genome page, wait a couple of second, and reload the page. Consider the following: a. BioSamples are generally preferred over SRA Experiments or Runs, except in cases of ambiguity. For example, if a single BioSample has multiple SRA Experiments associated and only one was used to obtain the genome. b. In some cases, the BioSample explicitly reference another BioSample; e.g., when separate entries were created for the binning process and the metagenome itself. In such cases, capture all BioSamples (separated by commas) in the genome information and notify the submitter via Correspondence with Curators so they can check the changes for correctness. The system can often identify these automatically, and you can simply click on the “Link” button. c. Note that for names effectively published before January 1 2023 the data source is recommended but not required.
  5. From the BioSample, revise the metadata retrieved by the Registry. At minimum, the BioSample should report: date, location (coordinates), toponym (location name), environment (or biome), and at least one other field (e.g., temperature, pH, host, etc). The system aids with this check by automatically detecting frequent fields as part of each of the categories. However, consider revising the values, as sometimes they’re non-informative (e.g., a value of “NA”, “missing”, or “unreported”). The system also detects the metadata package, as this can sometimes help with this step. Finally, if any data field is empty (marked by the system as “Not detected”), you can also examine the full metadata retrieved by the Registry by clicking on the BioSample link under “All retrieved samples”. Once you check everything from steps 4 and 5, go back to the name page and mark the “Missing metadata in databases” as appropriate (green if passed, red if failed).
  6. Back in the Genomics subsection, click on the MiGA link, and revise the 16S rRNA classification (if any) for consistency with the genome classification (“Ribosomal and transfer RNA” section in MiGA). If the taxonomic classification is not consistent, flag the entry as fail “Inconsistent 16S assignment” (red mark), or mark it as pass otherwise (green mark).
  7. Check the ANI/AAI against both TypeMat genomes (“Taxonomy” section in MiGA) and other genomes of the SeqCode Check project (“Distances” section in MiGA). In the cases covered by issues on classification 1-4 (see below), request additional justification of novelty. Make sure that this justification is also included in the taxon description, since the correspondence with curators will remain confidential and won’t be part of the public version of the name.
  8. If minimum requirements of quality are not met, or in extreme cases of point 7 above (e.g., ANI >> 95%), flag the entry as fail “Ambiguous type material” (red mark), or mark it as pass otherwise (green mark).

Once the process is finished for all relevant names in a given register list, mark the Genomics Review complete.

Issues on Classification

Freedom of taxonomic opinion is a guiding principle of the SeqCode. However, recommendations can be made in the Correspondence with Curators when any of the following are observed in the MiGA check page of the genome (linked in the Genomics section).

These checks are only meant to identify potentially erroneous data being submitted (e.g., a mistaken genome accession or a missing taxonomic evaluation), and do not constitute an endorsement of techniques such as ANI or AAI. Therefore, any issues should be brought up to the authors as a courtesy, but ultimately the taxonomic classification defers to that registered by the submitters:

  1. For novel species: The ANI against any genome from the TypeMat database in MiGA is greater than 95% (“Taxonomy” section in MiGA) and no justification has been explicitly given in the description of the species.
  2. For novel species: The ANI against any genome from the SeqCode Check project (“Distances” section in MiGA) is greater than 95% and the associated name is in a protected status (i.e., proposed less than a year prior, or already validly published). To view the SeqCode genome page of a MiGA match, visit seqco.de/g:XXXX, where XXXX should be replaced by the genome ID. In particular, note that high ANI between genomes of different species from the same submission are often detected, because authors might check novelty against databases (e.g., GTDB or MiGA) but not among their own collection.
  3. For novel genera or above: The AAI against any genome from TypeMat (“Taxonomy” section in MiGA) or the SeqCode Check project (“Distances” section in MiGA) is greater than 70% and no justification has been explicitly given in the description of the genus.

Additionally, the following classification issues should block the process and be addressed before endorsement or validation:

  1. The classification lineage does not extend up to domain. Note that this extension can be achieved either by means of a complete lineage or through the use of incertae sedis.

Other Issues

Any of the issues below should be followed up via Correspondence with Curators to maintain a traceable record:

  1. If a species name is proposed within a genus that is not validly published, the genus name should be proposed first (or together) with the species name in the same register list. However, note that this is not a SeqCode rule (or even recommendation), so it’s only a preferred practice. Similarly for the proposal of subspecies names in a species without a validly published name.
  2. Whenever possible, capture previously used names other than Latin or Latinized names (e.g., alphanumeric strings or so-called placeholder names) in the name’s notes.
  3. The description must not consist of a single reference to the effective publication, but should instead include the actual description of the taxon. We do not enforce a specific format or minimum information, but it is preferred that the description includes at least the properties of the taxon (observed or inferred) and the proposed circumscription (e.g., RED, AAI, ANI, phylogenetic reconstruction, or any other methods).
  4. If a description is emended, the description should indicate the original publication first (in bold), and all emendations in separate paragraphs with the emending publication(s) clearly indicated (also in bold). All referenced citations should be linked to the name page.
  5. If a name is proposed in a publication on the basis of 16S rRNA, any other marker gene, a low-quality genome, or any other basis, the manuscript should be linked as a citing publication, but that is not sufficient basis to be considered the effective publication. If the type material (a genome sequence) is determined in a subsequent (different) publication, the latter should be considered the effective publication.
  6. If the spelling of a name is corrected in a subsequent a publication, the publication should be identified as a corrigendum (not as effective publication). Similarly, if the description of a taxon is modified by a publication, the publication should be identified as emedavit (not as effective publication). The exception to this rule is in cases in which the emendation consists of establishing a new type and the name was not previously considered as validly published (see issue 3 above).