MedAgentGym Documentation
Info
This repository provides documentation for MedAgentGym: Interactive Medical Coding Environment. The codebase is designed to clearly expose implementation details and allow easy extensibility. If you encounter any issues or have suggestions, please feel free to reach out—we welcome your feedback!
MedAgentGym is the first publicly available training envrionment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. We are actively expanding the benchmark to include more medical coding backbone models, agent scaffolds and datasets. This is an arduous task, and we welcome contribution or collaboration in any form.
Info
The following of this page introduces the basic information and structure of the MedAgentGym project. For its utilization or customization, please visit the Experiments or Customization pages.
Backbones
The following backbone models are implemented in MedAgentGym, with their performance benchmarked and discussed in the article.
Backbone Models | Paper | Model Link |
---|---|---|
API-based Proprietary LLMs | ||
gpt-4o-mini | link | link |
gpt-4o | link | link |
gpt-4.1(-mini) | link | link |
o4-mini | link | link |
OSS (Base Size): Less than 10B parameters | ||
Gemma-3-4b-it | link | HuggingFace |
Qwen3-1.7B | link | HuggingFace |
Qwen3-4B | link | HuggingFace |
Qwen3-8B | link | HuggingFace |
Qwen2.5-7B-Instruct | link | HuggingFace |
Llama-3.1-8B-Instruct | link | HuggingFace |
Ministral-8B | link | HuggingFace |
OSS (Large Size): 10 - 30B parameters | ||
Qwen3-14B | link | HuggingFace |
Qwen2.5-14B-Instruct | link | HuggingFace |
DeepSeek-R1-Distill-Qwen-14B | link | HuggingFace |
OSS (XL Size): More than 30B parameters | ||
Qwen3-32B | link | HuggingFace |
Qwen2.5-32B-Instruct | link | HuggingFace |
DeepSeek-R1-Distill-Qwen-32B | link | HuggingFace |
QwQ-32B | link | HuggingFace |
Llama-3.1-70B-Instruct | link | HuggingFace |
DeepSeek-R1-Distill-Llama-70B | link | HuggingFace |
Coding LLMs and Medical Reasoning LLMs | ||
Qwen2.5-Coder-7B-Instruct | link | HuggingFace |
Qwen2.5-Coder-14B-Instruct | link | HuggingFace |
HuatuoGPT-o1-7B | link | HuggingFace |
m1-7B-23K | link | HuggingFace |
MedReason-8B | link | HuggingFace |
Baichuan-M1-14B-Instruct | link | HuggingFace |
MedAgentGym also support easy integration of your own backbone models. To use your own backbones, please check the customization guide.
Supported Medical Reasoning Tasks
Currently, MedAgentGym supports training and evaluation over 12 authentic real-world biomedical datasets.
Dataset | Data Type | # Task Type | Paper | Data Link |
---|---|---|---|---|
Training and Internal Validation (In-Distribution) | ||||
MIMIC-III | Tabular | 9 | MIMIC-III, EHRSQL, EHRAgent | Raw, Preprocessed1, Preprocessed2 |
eICU | Tabular | 10 | eICU, EHRSQL, EHRAgent | Raw, Preprocessed1, Preprocessed2 |
TREQS | Tabular | 4 | TREQS, EHRAgent | Preprocessed1, Preprocessed2 |
MedCalcBench | Text | 55 | MedCalcBench | Data |
MedAgentBench | Tabular | 10 | MedAgentBench | Data |
BioCoder | Text | 8 | BioCoder | link |
EHRShot | Tabular | 15 | EHRShot | link |
BioDSBench | Text | 12 | BioDSBench | link |
External Validation (Out-Distribution) | ||||
EHR-SeqSQL | Tabular | 4 | EHR-SeqSQL | link |
EHRCon | Tabular | 3 | EHRCon | link |
MIMIC-Extract | Tabular | 3 | MIMIC-Extract | link |
N-PowerAI | Text | 6 | NPowerAI | - |
Data
Info
All our prepared data can be downloaded through a download script on GitHub. The script can help pull all the data from a private HuggingFace repository owned by an anonymous account.
MedAgentGym focuses on verifiable medical reasoning tasks that benefit from code-based solutions. Clinically, we prioritize tasks originating from real-world health-care scenarios and validated by a multi-disciplinary panel of healthcare experts. For example, MedAgentGym involves MIMIC-III and eICU in EHRSQL collected from 222 hospital staff members and annotated by human programmers. Computationally, we integrate diverse biomedical coding tasks, ranging from structured medical information retrieval to open-ended biomedical research, ensuring comprehensive coverage and task diversity.
To standardize tasks across various sources, each instance in MedAgentGym is structured with: (1) a problem description, (2) verifiable ground-truth outputs, and (3) optional data resources (e.g., EHRs). Additionally, standardized system and user prompts are designed to initiate the problem-solving process. MedAgentGym is highly flexible, easily accommodating new tasks that include clear descriptions and verifiable ground-truth outputs. For coding-centric with only code solutions (e.g., BioCoder), we perform verification based on the execution output of provided code solution, which are more reliable than code alone. For tasks involving additional data resources (e.g., EHRSQL), we include metadata on data access and sources. Additional task-specific preparation details are documented.
Typically, each dataset should contain two primary files: train_tasks.jsonl
, test_tasks.jsonl
.
Additionally, depending on the task type, you may include supplementary data files necessary for the agent to access during coding.
For example, if it involves database-related tasks, you should include data base files *.csv
, or the SQL-integrated version *.db
to enable data access by the agent.
For machine learning tasks requiring label prediction based on training features, ensure both serialized feature files (*.pkl
) and corresponding label files (*.csv
) are present.
In EHRShot and MIMIC-Extract, we provide these features as *.pkl
files and labels as *.csv
files.
Besides individual dataset files, we also provide a metadata.json
file, providing metadata for all datasets, including the total number of tasks designated for training and testing.
*_tasks.csv
file consists of the following two columns:
idx
: A unique identifier for each query sample, typically used for debugging and ensuring reproducibility.question
: The question presented to the model. It is designed to assess the model’s capability for code-based reasoning, particularly in solving bio-statistics or biomedical computational tasks.answer
: A verifiable, correct solution to the posed question. We do not require intermediate solutions or ground-truth code snippets. For computational tasks, the provided ground-truth answers are directly utilized. FFor coding tasks with known ground-truth solutions, we execute the provided code beforehand and use its output as the final answer. We deliberately exclude the original code from the answer, encouraging the model to generate diverse and innovative solutions through coding.
Agent Scaffolds
Following CodeAct, we introduce a default agent scaffold designed for systematic evaluation of coding-based medical reasoning. Interactions within MedAgentGYM are formulated as a Partially Observable Markov Decision Process (POMDP), where tasks are represented as medical reasoning problems sampled from a set \(P\). At each timestep \(t\), the agent receives an observation \(o_t\in\mathcal{O}\) and determines the next action \(a_{t+1}\in\mathcal{A}\) based on the interaction history.
request_info
: Retrieves relevant data from external sources such as Electronic Health Records (EHRs).terminal
: Handles dependencies or manages local files within isolated Docker environments.code execution
: Executes code generated by LLMs through an integrated interpreter.debugging
: Converts code execution errors into comprehensible natural language explanations, enriched with detailed error information to enhance LLM understanding.
Additional details can be found in ./env/action/ directory.
Citation
If you find our work helpful, please consider citing it as
@article{xu2025medagentgym,
title={MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale},
author={Ran Xu and Yuchen Zhuang and Yishan Zhong and Yue Yu and Xiangru Tang and Hang Wu and May D. Wang and Peifeng Ruan and Donghan Yang and Tao Wang and Guanghua Xiao and Carl Yang and Yang Xie and Wenqi Shi},
journal={arXiv preprint arXiv:2506.04405},
year={2025}
}