pysam
Genomic file toolkit for NGS data processing. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, and FASTA/FASTQ sequences using Pysam with a Pythonic interface to htslib.
Introduction
Pysam is a comprehensive Python wrapper for htslib, specifically designed for bioinformatics and next-generation sequencing (NGS) data processing pipelines. It provides an intuitive, Pythonic interface for handling high-throughput genomic data files, allowing developers and researchers to programmatically manipulate large-scale sequencing datasets. The skill is intended for bioinformaticians, data scientists, and computational biologists who need to perform rapid prototyping or build robust production-grade genomic analysis workflows.
Key features and capabilities include:
- Seamless reading and writing of standard genomic file formats: SAM, BAM, CRAM for read alignments, VCF and BCF for variant data, and FASTA/FASTQ for raw sequencing data.
- Advanced regional queries and random access using tabix indexing, enabling efficient data extraction from massive alignment or variant files without loading entire datasets into memory.
- Built-in capabilities for pileup analysis, allowing users to calculate per-base coverage, read depth, and quality statistics across specific genomic intervals.
- Direct integration with samtools and bcftools command-line functionality, bridging the gap between existing command-line tools and Python-based automation.
- Support for complex genomic operations, including filtering reads by mapping quality, flags, or genomic coordinates, as well as accessing raw quality scores and CIGAR strings.
Practical usage notes and constraints:
- Coordinate Systems: Pysam adheres to 0-based, half-open intervals for Python-side fetching, but users must remain cautious as some file formats (like VCF) and internal samtools conventions use 1-based indexing. Always verify coordinate usage when interacting with different file types.
- Performance: While Pysam is high-performance due to its C-level htslib backend, intensive operations on massive BAM/CRAM files should utilize indexed access rather than sequential scanning whenever possible.
- Dependency: The skill assumes the presence of a Python environment with the pysam library installed via package managers like uv, pip, or conda.
- Inputs/Outputs: Primarily takes path-based file references as input; outputs include programmatically accessible data objects like AlignedSegment, VariantRecord, or sequence strings, suitable for downstream processing in NumPy, Pandas, or other analysis frameworks.
Repository Stats
- Stars
- 181
- Forks
- 24
- Open Issues
- 4
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- Apr 29, 2026, 01:49 PM