“哎呀,系统‘挂了’! ”—— 线上可靠性工程如何成为企业核心竞争力?
近年来,多家知名互联网公司遭遇软件系统故障,导致服务中断、数据丢失,不仅影响用户体验,更给企业带来直接或间接的经济损失。这些事件引发了行业对线上可靠性工程的深刻反思,服务提供商、用户和其他利益相关者都在寻求改进现有技术和流程的方法。
InfoQ即将举办一场名为“哎呀,系统‘挂了’!”的圆桌讨论活动,邀请来自腾讯、携程、bilibili 等互联网巨头的技术专家,共同探讨不同规模公司在稳定性和可靠性方面面临的挑战及应对策略。
活动将围绕以下议题展开:
- 不同规模的公司,稳定性和可靠性的关注点会有所不同吗?
- “低级错误”带来的故障不少,这是能忍的吗?
- 在系统出现故障时,如何与用户进行有效沟通并保持透明度?
- 在处理系统故障时,如何推进跨技术团队之间的有效协作?
- 展望未来,稳定性和可靠性工程将面临哪些新的机遇和挑战?
活动将邀请以下嘉宾参与:
- 主持人:党受辉,腾讯 IEG 技术运营部助理总经理,专家工程师
- 嘉宾:
- 周昕毅,携程云原生研发总监
- 刘昊,bilibili 基础架构部平台工程负责人
- 杨军,腾讯 IEG 技术运营部 SRE 总监
活动时间:8 月 26 日 20:00-21:30
观看方式:扫描活动海报二维码或点击直播预约按钮,预约 InfoQ 视频号直播。
此外,InfoQ 将于 10 月 18-19 日举办 QCon 上海站,几位嘉宾将带来更深入的分享。
周昕毅 将以“AI 驱动下的可观测平台架构升级实践”为主题,分享携程对内部可观测平台进行架构升级的工程实践,包括 Metric 和 Logging 数据统一治理、为 AIOPS 落地提供数据和工具支撑以及云平台团队通过使用 AI 工具来提升平台运维效率的真实案例。
刘凯宁 将以“蚂蚁故障应急全流程体系构建及应用实践”为主题,介绍蚂蚁的故障应急体系,通过实际的故障案例来简要介绍故障定义、组织阵型、平台能力、应急流程、应急评价等内容,并分享 AIOPS、LLM 大模型等能力在应急定位中的落地情况。
这场圆桌讨论活动将为业界提供宝贵的经验分享和前瞻性思考,帮助企业更好地应对线上可靠性工程的挑战,提升自身核心竞争力。
英语如下:
System “Down”: Who Pays the Price for Users?
Keywords:System failure, service interruption, internet companies
News Content: ## “Oops, the system is down!” – How Online Reliability Engineering Becomes a Core Competitive Advantage for Businesses?
In recent years, numerous well-known internet companies have experiencedsoftware system failures, leading to service interruptions and data loss. These incidents not only impact user experience but also result in direct or indirect economic losses for businesses. Theseevents have prompted the industry to deeply reflect on online reliability engineering, with service providers, users, and other stakeholders seeking ways to improve existing technologies and processes.
InfoQ is hosting a roundtable discussion titled “Oops, the system is down!” inviting technical experts from internet giants like Tencent, Ctrip, and bilibili to discuss the challenges and strategies faced by companies of different sizes in terms of stability and reliability.
The discussion will revolve around the following topics:
*Do companies of different sizes have different focuses on stability and reliability?
* Are “low-level errors” leading to frequent failures acceptable?
* How to effectively communicate with users and maintain transparency when system failures occur?
* How to promote effective collaboration between cross-technical teams whenhandling system failures?
* Looking ahead, what new opportunities and challenges will stability and reliability engineering face?
The event will feature the following guests:
- Moderator: Dang Shouhui, Assistant General Manager of Tencent IEG Technical Operations Department, Expert Engineer
- Guests:
- Zhou Xinyi, Director of Cloud Native Development at Ctrip
- Liu Hao, Head of Platform Engineering at bilibili Infrastructure Department
- Yang Jun, SRE Director of Tencent IEG Technical Operations Department
Event Time: August 26, 20:00-21:30
How to Watch: Scan the QR code on the event poster or click the live reservation button to reserve the InfoQ video account live stream.
Additionally, InfoQ will host QCon Shanghai on October 18-19, where several guests will provide deeper insights.
ZhouXinyi will share Ctrip’s engineering practices in upgrading its internal observability platform architecture, including unified governance of Metric and Logging data, providing data and tool support for AIOPS implementation, and real-world cases of cloud platform teams using AI tools to improve platform operation efficiency, under the theme of “AI-Driven Observability Platform Architecture Upgrade Practices.”
Liu Kaining will introduce Ant Group’s fault emergency response system, using practical fault cases to briefly introduce fault definition, organizational formation, platform capabilities, emergency procedures, emergency evaluation, and share the implementation of AIOPS and LLM large models in emergency localization,under the theme of “Ant Group’s Fault Emergency Response System Construction and Application Practices.”
This roundtable discussion will provide valuable experience sharing and forward-looking thinking for the industry, helping businesses better address the challenges of online reliability engineering and enhance their core competitiveness.
【来源】https://mp.weixin.qq.com/s/uXfMHqu2XPUQfeHKU8GQWQ
Views: 1