Engineering
pysam avatar

pysam

Genomic file toolkit for NGS data processing. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, and FASTA/FASTQ sequences using Pysam with a Pythonic interface to htslib.

Introduction

Pysam is a comprehensive Python wrapper for htslib, specifically designed for bioinformatics and next-generation sequencing (NGS) data processing pipelines. It provides an intuitive, Pythonic interface for handling high-throughput genomic data files, allowing developers and researchers to programmatically manipulate large-scale sequencing datasets. The skill is intended for bioinformaticians, data scientists, and computational biologists who need to perform rapid prototyping or build robust production-grade genomic analysis workflows.

Key features and capabilities include:

  • Seamless reading and writing of standard genomic file formats: SAM, BAM, CRAM for read alignments, VCF and BCF for variant data, and FASTA/FASTQ for raw sequencing data.
  • Advanced regional queries and random access using tabix indexing, enabling efficient data extraction from massive alignment or variant files without loading entire datasets into memory.
  • Built-in capabilities for pileup analysis, allowing users to calculate per-base coverage, read depth, and quality statistics across specific genomic intervals.
  • Direct integration with samtools and bcftools command-line functionality, bridging the gap between existing command-line tools and Python-based automation.
  • Support for complex genomic operations, including filtering reads by mapping quality, flags, or genomic coordinates, as well as accessing raw quality scores and CIGAR strings.

Practical usage notes and constraints:

  • Coordinate Systems: Pysam adheres to 0-based, half-open intervals for Python-side fetching, but users must remain cautious as some file formats (like VCF) and internal samtools conventions use 1-based indexing. Always verify coordinate usage when interacting with different file types.
  • Performance: While Pysam is high-performance due to its C-level htslib backend, intensive operations on massive BAM/CRAM files should utilize indexed access rather than sequential scanning whenever possible.
  • Dependency: The skill assumes the presence of a Python environment with the pysam library installed via package managers like uv, pip, or conda.
  • Inputs/Outputs: Primarily takes path-based file references as input; outputs include programmatically accessible data objects like AlignedSegment, VariantRecord, or sequence strings, suitable for downstream processing in NumPy, Pandas, or other analysis frameworks.

Repository Stats

Stars
181
Forks
24
Open Issues
4
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 29, 2026, 01:49 PM
View on GitHub