pytdc
Access AI-ready datasets, benchmarks, and molecular oracles for drug discovery, including ADME, toxicity, DTI, and molecular generation tasks.
Introduction
PyTDC (Therapeutics Data Commons) is a comprehensive open-science platform designed to facilitate machine learning in drug discovery and therapeutic development. It serves as a centralized hub for standardized datasets, evaluation metrics, and model benchmarks that are essential for researchers and AI agents working on pharmaceutical problems. By providing high-quality, curated data for the entire therapeutics pipeline, it enables seamless training, testing, and validation of models ranging from small-molecule property prediction to large-scale interaction networks.
-
Access a vast repository of datasets for single-instance prediction, including ADME (Absorption, Distribution, Metabolism, Excretion), toxicity profiles (hERG, AMES, DILI), and High-Throughput Screening (HTS) bioactivity data.
-
Handle complex multi-instance prediction tasks such as Drug-Target Interaction (DTI) using BindingDB or DAVIS, Drug-Drug Interaction (DDI) within DrugBank, and Protein-Protein Interaction (PPI) networks.
-
Utilize generative capabilities for molecule discovery, including scaffold-based splitting to ensure robust model generalization and evaluation.
-
Leverage molecular oracles for property-directed optimization, allowing users to score and refine novel molecules during the generation process.
-
Support standardized data splitting methods including random, scaffold, and cold-start splits to simulate realistic clinical or experimental conditions.
-
Install the library via pip (uv pip install PyTDC) and use a consistent programmatic pattern (from tdc.<problem> import <Task>) to fetch data frames.
-
Typical inputs include dataset identifiers within categories like single_pred, multi_pred, or generation; outputs are provided as standardized pandas DataFrames.
-
Ensure you are working within scientific or pharmaceutical ML contexts where specific pharmacokinetics or biochemical binding constraints apply.
-
The tool is best suited for agents that need to benchmark model performance against state-of-the-art pharmaceutical datasets or iterate on molecule design workflows.
Repository Stats
- Stars
- 19,782
- Forks
- 2,207
- Open Issues
- 41
- Language
- Python
- Default Branch
- main
- Sync Status
- Idle
- Last Synced
- Apr 30, 2026, 09:58 AM