Alt textEHRAgent
Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records

1Georgia Institute of Technology, 2Emory University, 3University of Washington
*Indicates Equal Contribution
MY ALT TEXT

EHRAgent is an autonomous LLM agent with external tools and code interface for improved multi-tabular reasoning across EHRs.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in planning and tool utilization as autonomous agents, but few have been developed for medical problem-solving. We propose EHRAgent, an LLM agent empowered with a code interface, to autonomously generate and execute code for multi-tabular reasoning within electronic health records (EHRs). First, we formulate an EHR question-answering task into a tool-use planning process, efficiently decomposing a complicated task into a sequence of manageable actions. By integrating interactive coding and execution feedback, EHRAgent learns from error messages and improves the originally generated code through iterations. Furthermore, we enhance the LLM agent by incorporating long-term memory, which allows EHRAgent to effectively select and build upon the most relevant successful cases from past experiences. Experiments on three real-world multi-tabular EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate. EHRAgent leverages the emerging few-shot learning capabilities of LLMs, enabling autonomous code generation and execution to tackle complex clinical tasks with minimal demonstrations.

Examples

Dropdown Example

Main Results

EHRAgent significantly outperforms all the baselines on all three datasets with a performance gain of 19.92%, 12.41%, and 29.60%, respectively. This indicates the efficacy of our key designs, namely interactive coding with environment feedback and domain knowledge injection, as they gradually refine the generated code and provide sufficient back ground knowledge during the planning process.

MY ALT TEXT

Effect of Question Complexity

MY ALT TEXT MY ALT TEXT MY ALT TEXT MY ALT TEXT

We take a closer look at the model performance by considering multi-dimensional measurements of question complexity, exhibited in Figure 3. Although the performances of both EHRAgent and the baselines gener-ally decrease with an increase in task complexity (either quantified as more elements in queries or more columns in solutions), EHRAgent consistently outperforms all the baselines at various levels of difficulty.

Sample Efficiency

MY ALT TEXT MY ALT TEXT

The above figure illustrates the model performance wrt. number of demonstrations for EHRAgent and the two strongest baselines, AutoGen and Self-Debugging. Compared to supervised learning (e.g., text-to-SQL) that requires extensive training on over 10K samples with detailed annotations (e.g., SQL code), LLM agents enable complex tabular reasoning using a few demonstrations only. One interesting finding is that as the number of examples increases, both the success and completion rate of AutoGen tend to decrease, mainly due to the context limitation of LLMs. Notably, the performance of EHRAgent remains stable with more demonstrations, which may benefit from its integration of a rubber duck debugging module and the adaptive mechanism for selecting the most relevant demonstrations.

Case Studies

Dropdown Case

BibTeX

@misc{shi2024ehragent,
        title={EHRAgent: Code Empowers Large Language Models for Complex Tabular Reasoning on Electronic Health Records}, 
        author={Wenqi Shi and Ran Xu and Yuchen Zhuang and Yue Yu and Jieyu Zhang and Hang Wu and Yuanda Zhu and Joyce Ho and Carl Yang and May D. Wang},
        year={2024},
        eprint={2401.07128},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }