Reliability Engineering Cuts Downtime
Learn how Intel IT uses reliability engineering to cut manufacturing systems unscheduled downtime.
Driven by the rising importance of keeping manufacturing sites operating at full capacity 24/7, Intel Manufacturing IT (MIT) has set a goal of achieving “four nines” (99.99%) availability (or 0.01% downtime) by 2025.
To help achieve this ambitious goal, we added a Reliability Engineer role to enhance the resilience of Intel’s manufacturing facilities. Reliability engineering (RE) is an emerging practice, first developed by cloud-based digital service providers. It focuses on designing systems to be failure-tolerant, so that service is maintained even when individual components fail.
At its core, RE entails identifying design patterns that promote continuity of service, both within individual applications and in their interactions. This approach involves collaboration between the RE team and the development team to help ensure that feedback on opportunities to enhance resilience is received and incorporated into system design. By closing the loop with developers, Reliability Engineers help align their overarching goals for resilience with the feature delivery objectives of the development team, enabling us to create robust and reliable solutions that meet the needs of our stakeholders.
Our Reliability Engineers proactively tackle potential vulnerabilities and develop strategies to mitigate the impact of failures on manufacturing operations. They play a critical role in identifying common failure modes, developing standards and designing solutions to lower the risk of failure.
- Reliability Engineers’ use of the Failure Mode and Effects Analysis (FMEA) methodology enabled us to develop a Resiliency Maturity Model (RMM), which is applicable across all our systems.
- This approach has helped us to identify over 200 resilience improvement projects and add them to our development roadmap for the next two years.
- Through these RE initiatives and implementation of numerous operational improvement activities, unscheduled factory downtime has decreased by 50% from 2019 levels.
Our results show how an RE approach can extend the benefits of resilience to the manufacturing environment, preparing us for future adoption of cloud-based microservice environments. We have demonstrated that best-in-class reliability and availability of IT systems can be achieved by adopting a standard set of RE tools and proactively applying them to improve resilience.