gff-structure
Bioinformatics reference for Atlantic salmon GFF3 file structure, covering Ensembl and NCBI annotations on Ssal_v3.1 assembly for parsing and processing tasks.
Introduction
This skill acts as a comprehensive technical reference for navigating and parsing GFF3 annotation files related to the Atlantic salmon (Ssal_v3.1) genome assembly. It is designed for bioinformatics researchers, computational biologists, and software engineers working on gene mapping, annotation comparison, and workflow development. The resource provides granular details on the differences between Ensembl and NCBI annotation formats, including specific handling of gene-level features, naming conventions, and attribute structures.
-
Detailed breakdown of Ensembl GFF3 attributes, including ID prefixes (gene:), biotype categorization, and URL-encoded description parsing for sources like ZFIN, RFAM, and HGNC.
-
Analysis of NCBI GFF3 structure, focusing on the gene feature type, LOC ID naming patterns, and the distinction between internal gene IDs and Dbxref references.
-
Comparison tables highlighting schema differences, such as nomenclature for biotype vs. gene_biotype and numeric ID storage locations.
-
Guidance on identifying orthologs using NCBI gene ortholog datasets for cross-species analysis, specifically for tax_id 8030 (Atlantic salmon) versus human (9606).
-
Practical constraints for GFF3 processing, emphasizing the importance of preserving gene block order during sorting and the usage of scripts like gff_block_sort.py.
-
Use this for building bioinformatics pipelines involving tools such as Liftoff, LiftoffTools, GffCompare, and ParsEval for cross-assembly mapping.
-
Intended for developers writing scripts to extract gene attributes, validate annotation consistency, or generate mapping tables for platforms like Salmobase.
-
Expected inputs include raw or processed .gff3 files; outputs include validated metadata, cleaned gene feature lists, or comparison metrics such as CDS and exon overlap.
-
Users should adhere to the established project conventions regarding sequential processing and the use of environment.yml for reproducible dependencies. Ensure all GFF files maintain structural integrity before proceeding to downstream statistical analysis or database ingestion.
Repository Stats
- Stars
- 0
- Forks
- 0
- Open Issues
- 0
- Language
- HTML
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- May 3, 2026, 08:20 PM