An Insight Into the Tech: Revolutionizing Cyberint’s On-Call Operations
- Table of contents
Tech Talks
An Insight Into the Tech: Revolutionizing Cyberint’s On-Call Operations
Jun 20, 2024
The Power of an AI-Driven Bot Introduction
When it comes to SaaS operation, the ability to respond swiftly to technical glitches and potential failures can mean the difference between a minor hiccup and a full-blown crisis. At Cyberint, we’re always on the lookout for out-of-the-box solutions to enhance our operational efficiency and ensure the highest level of service reliability. It’s with this spirit of innovation that we’ve embraced OpenAI’s Assistant to revolutionize the way our on-call engineers interact with and respond to our production environment critical alerts.
The Challenge
Our production environment is a complex distributed ecosystem where numerous monitoring systems, among them Grafana and Prometheus, continuously generate alerts. These alerts range from Kubernetes cluster issues, such as a crashed pod, to more systemic problems affecting our SaaS system or applicative flows. Additionally, we monitor infrastructure, including Kafka lags, SQS bottlenecks, and many more.
In addition, the volume and variety of these alerts present a significant challenge. keeping track of which alerts are still open and which have been resolved can be a challenging task for our engineers who must quickly diagnose and respond to incidents.
The Solution: An AI-Powered On-Call Assistant
To address these challenges, we’ve developed an AI-powered bot assistant, leveraging the OpenAI Assistant, designed specifically for our on-call engineers. This bot assistant represents a paradigm shift in how we monitor and respond to production critical alerts. It acts as a bridge between our engineers and the alerts in our Slack channel, which gathers notifications from various monitoring tools.
Key Features and Functionalities
- Centralized Alert Monitoring: The bot connects to our critical alerts Slack channel, serving as a central point for alerts from Grafana, Prometheus, and other tools.
- Intelligent Query Handling: It can understand and respond to queries about each alert, simulating human-like analysis to suggest the appropriate action to take.
- Dynamic Function Triggering: Depending on the alert, the bot decides which function to trigger and with what parameters, streamlining the response process.
- Advanced Troubleshooting Capabilities: For instance, it can fetch the relevant Grafana dashboard to analyze trends in Kafka lag, obtain Kubernetes descriptions and pod logs to diagnose crashes, and identify out-of-memory (OOM) issues by examining pod termination states.
Impact and Benefits
Our adoption of this AI-driven solution has markedly improved our operational workflow by automating the initial steps of the diagnostic process — steps that previously required time-consuming manual effort. The introduction of our on-call bot equips our engineers with a robust tool that significantly:
- Accelerates Response Times: Swiftly addresses critical incidents, minimizing potential disruptions.
- Grants Immediate Data Access: Ensures that relevant diagnostics and data are at our engineers’ fingertips when they need them the most.
- Supports Decisive Action: Empowers our team to make well-informed decisions quickly, even in high-pressure situations.
- Enhances Issue Resolution: Speeds up the troubleshooting process, leading to improved system reliability and increased uptime.
Streamlining Alert Management
To further enhance our on-call operations, we’ve enriched our AI-powered bot with the capability to query our critical alerts channel on Slack. This feature compiles and presents a summary of the most recent, yet unresolved, alerts.
By doing so, it enables our engineers to immediately identify and prioritize outstanding issues without the need to manually comb through numerous notifications. This efficient summarization not only elevates operational productivity but also ensures our team can direct their attention towards swiftly mitigating the most critical situations.
Ensuring Data Security with Local Execution
One of the significant challenges we encountered in deploying our AI-powered on-call bot was the concern around sending data to OpenAI cloud. This issue was crucial for us at Cyberint, given the sensitive nature of our work and the importance of data security. OpenAI Assistant provided an elegant solution to this problem. It enables us to execute functions locally within our own environment, a feature that significantly mitigates the risk of exposing sensitive information.
The assistant merely guides us by suggesting the function names and the parameters required to run them. This approach allows us to carefully filter out and retain control over what information is shared, ensuring that only non-sensitive, generic data is passed on. This capability not only enhances our operational security but also empowers our on-call engineers to leverage AI assistance without compromising on data privacy and security.
The benefits of OpenAI Assistant
By integrating OpenAI Assistant into our incident response workflow, Cyberint has taken a significant step forward in our ongoing mission to maintain a robust, secure, and highly available cybersecurity platform. This AI-powered assistant not only augments the capabilities of our on-call engineers but also represents our commitment to leveraging cutting-edge technology to solve complex challenges.
As we continue to refine and expand its functionalities, we’re excited about the future of AI in cybersecurity operations and the potential to further revolutionize our response capabilities.
To provide a glimpse into the technical foundation of our AI-powered On-Call Assistant, we’ve made a sample of the bot’s code available for those interested in exploring its inner workings. This sample offers insight into how we’ve structured the bot to interact with our systems and how it leverages AI to automate and enhance our on-call operations. You can view the sample code in Cyberint GitHub Repository link: Conceptual prototype, showcasing the potential application of such technology in real-world scenarios. We hope this repository serves as a valuable resource for anyone looking to understand more about the practical application of AI in operational monitoring and response.