Research
pytdc avatar

pytdc

Access AI-ready datasets, benchmarks, and molecular oracles for drug discovery, including ADME, toxicity, DTI, and molecular generation tasks.

Introduction

PyTDC (Therapeutics Data Commons) is a comprehensive open-science platform designed to facilitate machine learning in drug discovery and therapeutic development. It serves as a centralized hub for standardized datasets, evaluation metrics, and model benchmarks that are essential for researchers and AI agents working on pharmaceutical problems. By providing high-quality, curated data for the entire therapeutics pipeline, it enables seamless training, testing, and validation of models ranging from small-molecule property prediction to large-scale interaction networks.

  • Access a vast repository of datasets for single-instance prediction, including ADME (Absorption, Distribution, Metabolism, Excretion), toxicity profiles (hERG, AMES, DILI), and High-Throughput Screening (HTS) bioactivity data.

  • Handle complex multi-instance prediction tasks such as Drug-Target Interaction (DTI) using BindingDB or DAVIS, Drug-Drug Interaction (DDI) within DrugBank, and Protein-Protein Interaction (PPI) networks.

  • Utilize generative capabilities for molecule discovery, including scaffold-based splitting to ensure robust model generalization and evaluation.

  • Leverage molecular oracles for property-directed optimization, allowing users to score and refine novel molecules during the generation process.

  • Support standardized data splitting methods including random, scaffold, and cold-start splits to simulate realistic clinical or experimental conditions.

  • Install the library via pip (uv pip install PyTDC) and use a consistent programmatic pattern (from tdc.<problem> import <Task>) to fetch data frames.

  • Typical inputs include dataset identifiers within categories like single_pred, multi_pred, or generation; outputs are provided as standardized pandas DataFrames.

  • Ensure you are working within scientific or pharmaceutical ML contexts where specific pharmacokinetics or biochemical binding constraints apply.

  • The tool is best suited for agents that need to benchmark model performance against state-of-the-art pharmaceutical datasets or iterate on molecule design workflows.

Repository Stats

Stars
19,782
Forks
2,207
Open Issues
41
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 30, 2026, 09:58 AM
View on GitHub