Published in Nature Portfolio’s npj Digital Medicine, China’s Future Doctor Team’s New Benchmark Aims to Define What “Safe and Effective” Means for Medical AI

News > Technology News

Audio By Carbonatix

11:00 AM on Wednesday, February 4

The Associated Press

NEW YORK--(BUSINESS WIRE)--Feb 4, 2026--

Future Doctor, a China-based AI healthcare technology company, together with 32 clinical experts, has published new research in Nature Portfolio’s npj Digital Medicine proposing a “ Clinical Safety-Effectiveness Dual-Track Benchmark” (CSEDB). The framework is designed to measure whether medical AI systems are safe and effective in real-world clinical decision-making. A comparative evaluation of leading large language models, including OpenAI’s o3, Google’s Gemini 2.5 Pro, and Anthropic’s Claude 3.7 Sonnet, found the company’s own MedGPT scoring highest across all key categories.

This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20260204052708/en/

Published in Nature Portfolio’s npj Digital Medicine, China’s Future Doctor team’s new benchmark aims to define what “safe and effective” means for medical AI

The research arrives amid a surge of new medical AI model releases, intensifying attention on the gap between strong general-purpose performance into the safety and reliability required in clinical practice, where hallucinations and confidently stated errors remain a key bottleneck. Against this backdrop, the work highlights evaluation as a persistent bottleneck for real clinical use. Most medical AI evaluations still resemble standardized tests, while safety-critical failure modes — such as missed urgent symptoms, contraindicated recommendations, or failures in multi-condition prioritization — often do not show up in accuracy-only scoring.

CSEDB is designed to make such risks measurable in a repeatable scoring system. According to the paper, CSEDB includes 30 indicators split into 17 safety metrics and 13 effectiveness metrics. The benchmark was developed with 32 clinical experts spanning 23 core specialties at leading medical institutions in China, including Peking Union Medical College Hospital, the Cancer Hospital of the Chinese Academy of Medical Sciences, Chinese PLA General Hospital, Shanghai Tongji Hospital, Huashan Hospital of Fudan University, Peking University Hospital of Stomatology, and Fuwai Hospital of the Chinese Academy of Medical Sciences. It also uses 2,069 open-ended Q&A items across 26 specialties to better simulate real-world clinical reasoning.

In the study’s evaluation, Future Doctor’s proprietary MedGPT ranked highest in overall, safety, and effectiveness scores under CSEDB. Notably, MedGPT demonstrated a stronger safety profile relative to many general-purpose systems, where safety performance often lags behind effectiveness. That divergence raises a broader question about whether clinical adoption will rely on increasingly capable general-purpose models or systems optimized for safety from the start.

If CSEDB gains adoption, it could shift evaluation from “can it answer medical questions?” to “can it operate safely under clinical constraints?” For hospitals and AI developers, benchmarks like this may become part of the infrastructure required for real-world rollout, procurement, and oversight. Future Doctor emphasizes that MedGPT is designed to support clinicians, with clinical use governed by local regulatory and institutional requirements.

View source version on businesswire.com:https://www.businesswire.com/news/home/20260204052708/en/

CONTACT: Yoyo Yi

Tel.：0086-13702430331

E-mail：[email protected]

KEYWORD: CHINA UNITED STATES NORTH AMERICA ASIA PACIFIC NEW YORK

INDUSTRY KEYWORD: VENTURE CAPITAL HEALTH ARTIFICIAL INTELLIGENCE HEALTH TECHNOLOGY PROFESSIONAL SERVICES TECHNOLOGY OTHER TECHNOLOGY OTHER HEALTH

SOURCE: Future Doctor

PUB: 02/04/2026 11:00 AM/DISC: 02/04/2026 11:00 AM

http://www.businesswire.com/news/home/20260204052708/en