【中翻】医疗器械的良好机器学习实践（GMLP）：指导原则_马丁学习摘要

前几日某位大佬看似不经意地在群里提问，实则展示自己在做项目的前沿性。（见下图）

为了追随大佬的步伐，近日学习了一下由美国FDA, Health Canada和英国MHRA（笔者注：为什么没有EU？难道英国即是EU？）共同起草的一个良好机器学习实践（GMLP）的指导原则（

指导原则）文件，特与大家分享。

医疗器械的良好机器学习实践：指导原则

Good Machine Learning Practice for Medical Device Development: Guiding Principle

美国食品药品监督管理局（FDA）、加拿大健康部（Health Canada）和英国药品和医疗产品监管局（MHRA）共同确定了10项指导原则，可指导良好的机器学习实践（Good Machine Learning Practice，GMLP）的发展。这些指导原则将有助于促进使用人工智能和机器学习（AI/ML）的安全、有效和高质量医疗器械的发展。

The U.S. Food and Drug Administration (FDA), Health Canada, and the United Kingdom’s Medicines and Healthcare products Regulatory Agency (MHRA) have jointly identified 10 guiding principles that can inform the development of Good Machine Learning Practice (GMLP). These guiding principles will help promote safe,

effective, and high-quality medical devices that use artificial intelligence and machine learning (AI/ML).

人工智能和机器学习技术有潜力通过从每天交付医疗过程中产生的大量数据中得出新的重要见解，从而改变医疗保健领域。它们使用软件算法从实际使用中学习，并在某些情况下可能使用这些信息来改善产品的性能。但由于其复杂性和迭代和数据驱动的开发特性，它们也呈现出独特的考虑因素。

Artificial intelligence and machine learning technologies have the potential to transform health care by deriving new and important insights from the vast amount of data generated during the delivery of health care every day. They use software algorithms to learn from real-world use and in some situations may use this information to improve the product’s performance. But they also present unique considerations due to their complexity and the iterative and data-driven nature of their development.

这10项指导原则旨在奠定开发良好机器学习实践的基础，以应对这些产品的独特性质。它们还将有助于培养未来这个快速发展领域的增长。

These 10 guiding principles are intended to lay the foundation for developing Good Machine Learning Practice that addresses the unique nature of these products. They will also help cultivate future growth in this rapidly progressing field.

这10项指导原则确定了国际医疗器械监管论坛（IMDRF）、国际标准组织和其他合作机构可以合作推进GMLP的领域。合作的领域包括研究、创建教育工具和资源、国际协调和共识标准，这可能有助于制定监管政策和监管指南。

The 10 guiding principles identify areas where the International Medical Device Regulators Forum (IMDRF), international standards organizations, and other collaborative bodies could work to advance GMLP. Areas of collaboration include research, creating educational tools and resources, international harmonization, and consensus standards, which may help inform regulatory policies and regulatory guidelines.

我们设想这些指导原则可用于：

We envision these guiding principles may be used to:

• 采用在其他行业已被证明的良好实践

• Adopt good practices that have been proven in other sectors

• 调整其他行业的实践，使其适用于医疗技术和医疗保健行业

• Tailor practices from other sectors so they are applicable to medical technology and the health care sector

• 创建特定于医疗技术和医疗保健行业的新实践

• Create new practices specific for medical technology and the health care sector

随着AI/ML医疗器械领域的发展，GMLP最佳实践和共识标准也必须不断进化。如果我们要使利益相关者推动这个领域的负责任创新，与我们的国际公共卫生伙伴建立牢固的合作伙伴关系将至关重要。因此，我们期望这项初步的合作工作可以为我们更广泛的国际交往提供参考，包括与IMDRF的交往

As the AI/ML medical device field evolves, so too must GMLP best practice and consensus standards. Strong partnerships with our international public health partners will be crucial if we are to empower stakeholders to advance responsible innovations in this area. Thus, we expect this initial collaborative work can inform our broader international engagements, including with the IMDRF.

指导原则Guiding Principles

全产品生命周期中充分利用多学科专业知识：对模型预期的集成到临床工作流程、期望的益处及相关患者风险的深入理解，可以帮助确保基于机器学习的医疗器械在设备的整个生命周期内是安全、有效的，并解决临床上有意义的需求。

Multi-Disciplinary Expertise Is Leveraged Throughout the Total Product Life Cycle: In-depth understanding of a model’s intended integration into clinical workflow, and the desired benefits and associated patient risks, can help ensure that MLenabled medical devices are safe and effective and address clinically meaningful needs over the lifecycle of the device.

实施良好的软件工程和安全实践：模型设计需要注意“基本原则”：良好的软件工程实践、数据质量保证、数据管理和强大的网络安全实践。这些实践包括方法论风险管理和设计过程，可以适当地捕捉和传达设计、实施和风险管理的决策和原理，同时确保数据的真实性和完整性。

Good Software Engineering and Security Practices Are Implemented: Model design is implemented with attention to the “fundamentals”: good software engineering practices, data quality assurance, data management, and robust cybersecurity practices. These practices include methodical risk management and design process that can appropriately capture and communicate design, implementation, and risk management decisions and rationale, as well as ensure data authenticity and integrity

临床研究参与者和数据集代表预期的患者人群：数据收集协议应确保预期患者人群的相关特征（例如年龄、性别、种族和族裔）以及使用和测量输入在临床研究、培训和测试数据集中有足够规模的样本，以便结果可以合理地推广到感兴趣的人群。这对于管理任何偏差、促进适当和可推广的性能，评估可用性以及确定模型可能表现不佳的情况非常重要。

Clinical Study Participants and Data Sets Are Representative of the Intended Patient Population: Data collection protocols should ensure that the relevant characteristics of the intended patient population (for example, in terms of age, gender, sex, race, and ethnicity), use, and measurement inputs are sufficiently represented in a sample of adequate size in the clinical study and training and test datasets, so that results can be reasonably generalized to the population of interest. This is important to manage any bias, promote appropriate and generalizable performance across the intended patient population, assess usability, and identify circumstances where the model may underperform.

训练数据集和测试集相互独立：为了确保独立性，训练和测试数据集被选择和维护为相互适当独立的。考虑和解决所有潜在的相关性来源，包括患者、数据采集和场所因素，以确保独立性。

Training Data Sets Are Independent of Test Sets: Training and test datasets are selected and maintained to be appropriately independent of one another. All potential sources of dependence, including patient, data acquisition, and site factors, are considered and addressed to assure independence.

选择的参考数据集基于最佳可用方法：采用被接受的、最佳可用的方法来开发参考数据集（即参考标准），以确保收集到临床相关且有良好特征描述的数据，并理解参考的局限性。如果有，应在模型开发和测试中使用被接受的参考数据集，以促进并证明模型在预期患者人群中的鲁棒性和推广性。

Selected Reference Datasets Are Based Upon Best Available Methods: Accepted, best available methods for developing a reference dataset (that is, a reference standard) ensure that clinically relevant and well characterized data are collected and the limitations of the reference are understood. If available, accepted reference datasets in model development and testing that promote and demonstrate model robustness and generalizability across the intended patient population are used.

模型设计应根据可用数据并反映设备的预期使用情况：模型设计应适合可用数据，并支持主动减轻已知风险，例如过度拟合、性能下降和安全风险。产品相关的临床效益和风险应被充分理解，用于制定临床意义的性能目标进行测试，并支持产品可以安全有效地实现其预期使用。考虑因素包括全局和局部性能的影响，以及设备输入、输出、预期患者人群和临床使用条件的不确定性/变异性。

Model Design Is Tailored to the Available Data and Reflects the Intended Use of the Device: Model design is suited to the available data and supports the active mitigation of known risks, like overfitting, performance degradation, and security risks. The clinical benefits and risks related to the product are well understood, used to derive clinically meaningful performance goals for testing, and support that the product can safely and effectively achieve its intended use. Considerations include the impact of both global and local performance and uncertainty/variability in the device inputs, outputs, intended patient populations, and clinical use conditions.

重点放在人-人工智能交互团队的表现上：当模型需要“人工”人工智能时，应考虑人因素因素以及模型输出的人类可解释性，重点是人工智能团队的表现，而不仅仅是模型本身的表现。

Focus Is Placed on the Performance of the Human-AI Team: Where the model has a “human in the loop,” human factors considerations and the human interpretability of the model outputs are addressed with emphasis on the performance of the Human-AI team, rather than just the performance of the model in isolation.

测试展示了在临床相关条件下设备的性能：制定并执行具有统计学意义的测试计划，以独立于训练数据集生成临床相关设备性能信息。考虑因素包括：目标患者人群、重要的亚组、临床环境和由人工智能团队使用、测量输入和潜在混淆因素。

Testing Demonstrates Device Performance During Clinically Relevant Conditions: Statistically sound test plans are developed and executed to generate clinically relevant device performance information independently of the training data set. Considerations include the intended patient population, important subgroups, clinical environment and use by the Human-AI team, measurement inputs, and potential confounding factors.

用户能够方便地获得清晰的、上下文相关的信息：这些信息适合目标受众（如医疗保健提供者或患者），包括：产品的预期用途和使用指示、模型适用于适当亚组的性能、用于训练和测试模型的数据特征、可接受的输入、已知限制、用户界面解释以及模型在临床工作流程中的整合。用户还要知道来自真实世界性能监测的设备修改和更新、可用时的决策基础以及向开发者沟通产品关注点的方式。

Users Are Provided Clear, Essential Information: Users are provided ready access to clear, contextually relevant information that is appropriate for the intended audience (such as health care providers or patients) including: the product’s intended use and indications for use, performance of the model for appropriate subgroups, characteristics of the data used to train and test the model, acceptable inputs, known limitations, user interface interpretation, and clinical workflow integration of the model. Users are also made aware of device modifications and updates from real-world performance monitoring, the basis for decision-making when available, and a means to communicate product concerns to the developer.

10.

部署的模型将能够在“实际使用中”进行监控，重点关注维持或改进安全性和性能：此外，当模型在部署后定期或持续训练时，必须有适当的控制措施来管理过度拟合、意外偏差或模型退化（例如，数据集漂移），这些可能会影响到人工智能团队使用时的安全性和性能。

Deployed Models Are Monitored for Performance and Re-training Risks Are Managed: Deployed models have the capability to be monitored in “real world” use with a focus on maintained or improved safety and performance. Additionally, when models are periodically or continually trained after deployment, there are appropriate controls in place to manage risks of overfitting, unintended bias, or degradation of the model (for example, dataset drift) that may impact the safety and performance of the model as it is used by the Human-AI team