Pinterest 利用 Ray 实现机器学习基础设施现代化,加速模型开发和部署
2024年8月23日 – 视觉发现平台 Pinterest 近日宣布,他们已成功使用开源分布式计算框架 Ray 实现机器学习基础设施的现代化,并分享了其在将Ray 集成到大规模生产环境中所遇到的挑战和解决方案。
Pinterest 致力于利用机器学习来解决其核心业务问题,但其原有的基础设施无法满足快速增长的需求。为了提升机器学习能力,Pinterest 选择了 Ray 作为其新的基础设施。然而,在将 Ray 集成到其通用联合 Kubernetes 集群 PinCompute 时,Pinterest 面临着几个挑战,例如 KubeRay 的安装限制以及与现有系统整合的复杂性。
为了克服这些挑战,Pinterest 开发了一套自定义解决方案,包括 API 网关、Ray 集群控制器、Ray 作业控制器和用于外部状态管理的 MySQL 数据库。这套解决方案为用户和 Kubernetes 之间提供了抽象层,简化了 Ray 集群的配置和管理。此外,Pinterest 还创建了一个专用的用户界面,用于持久化日志记录和指标,并将其与内部时间序列数据库 Goku 整合,以提高可观察性。
Pinterest 的努力取得了显著成效。通过采用 Ray,Pinterest 将机器学习模型从开发到生产的周期缩短至几天,而之前需要数周时间。这表明 Ray 的灵活性和易用性使其成为希望实现机器学习基础设施现代化的组织的理想选择。
Pinterest 的成功案例也为其他公司提供了宝贵的经验。例如,DoorDash 也使用 Ray 对其机器学习基础设施进行了现代化改造,并构建了自定义控制器来管理 Ray 集群和作业。DoorDash 还为其数据科学家和机器学习工程师创建了一个自助服务平台,并通过与 Nvidia 的合作解决了 GPU 可访问性问题。
Pinterest 和 DoorDash 的案例表明,Ray 能够帮助公司克服机器学习基础设施现代化过程中的挑战,并加速模型开发和部署。随着越来越多的公司采用 Ray,我们可以期待看到更多关于 Ray 在机器学习领域应用的创新案例。
英语如下:
Pinterest Modernizes Machine Learning Infrastructure with Ray
Keywords: Pinterest, Ray,Machine Learning
August 23, 2024 -Pinterest, the visual discovery platform, recently announced the successful modernization of its machine learning infrastructure using the open-source distributed computing framework Ray. The company shared the challengesand solutions it encountered while integrating Ray into its large-scale production environment.
Pinterest is committed to leveraging machine learning to solve its core business problems, but itsexisting infrastructure was unable to meet the rapidly growing demands. To enhance its machine learning capabilities, Pinterest chose Ray as its new infrastructure. However, integrating Ray into its general-purpose federated Kubernetes cluster, PinCompute, presented several challenges, suchas installation limitations of KubeRay and the complexity of integrating with existing systems.
To overcome these challenges, Pinterest developed a custom solution, including an API gateway, Ray cluster controller, Ray job controller, and a MySQL database for external statemanagement. This solution provided an abstraction layer between users and Kubernetes, simplifying the configuration and management of Ray clusters. Additionally, Pinterest created a dedicated user interface for persistent logging and metrics, integrating it with its internal time-series database Goku for enhanced observability.
Pinterest’s efforts have yielded significant results. By adopting Ray, Pinterest has reduced the time it takes to deploy machine learning models from development to production from weeks to days. This demonstrates that Ray’s flexibility and ease of use make it an ideal choice for organizations looking to modernize their machine learning infrastructure.
Pinterest’s success story also provides valuable lessons for other companies. For instance, DoorDash has also modernized its machine learning infrastructure using Ray, building custom controllers to manage Ray clusters and jobs. DoorDash has also created a self-service platform for its data scientists and machine learning engineers and addressed GPU accessibility issues through collaboration with Nvidia.
The cases of Pinterest and DoorDash demonstrate that Ray can helpcompanies overcome challenges in modernizing their machine learning infrastructure and accelerate model development and deployment. As more companies adopt Ray, we can expect to see more innovative use cases of Ray in the machine learning domain.
【来源】https://mp.weixin.qq.com/s/oJJeYl3X8pTXrLrQO3zBtg
Views: 1