
My Python App Now Has an AI Watchdog. Here Is How I Built It.
After setting up Sentry, Prometheus, and Grafana on my production FastAPI application, I had visibility I never had before. Alerts were firing when response times degraded. Error rates were being tracked in real time. Memory usage was graphed across days and weeks. Compared to where I started, it felt like a significant upgrade.
My Python App Now Has an AI Watchdog. This innovative feature enhances the monitoring capabilities significantly.
My Python App Now Has an AI Watchdog that assists in interpreting alerts. It provides clarity during urgent situations.
My Python App Now Has an AI Watchdog, which serves as a crucial tool for monitoring application performance.
But there was still a problem I had not solved. Every alert that fired required me to interpret it, investigate the cause, cross-reference it with recent deployments, and decide what to do. The monitoring stack told me something was wrong. It did not tell me why or what to do about it. At two in the morning, that distinction matters a great deal.
So I took the next step. I built an AI agent that sits on top of my monitoring stack, reads the alerts, investigates the application state, and gives me a plain-English explanation of what is happening and what I should do about it. I called it a watchdog because that is exactly what it does. It watches constantly and only wakes me up when it has something useful to say.
Knowing that My Python App Now Has an AI Watchdog, I can rely on it to alert me with detailed insights.
Having My Python App Now Has an AI Watchdog simplifies the process of managing application health and performance.
With My Python App Now Has an AI Watchdog, I can focus on strategic decisions instead of constant monitoring.
Here is exactly how I built it.
My Python App Now Has an AI Watchdog.
What the Watchdog Actually Does
Before getting into the code, it is worth being specific about what this agent does and what it does not do, because the term AI agent gets applied to a wide range of things in 2026 and the distinction matters.
This watchdog does four things. It polls my Prometheus metrics at regular intervals to check application health. When a metric crosses a threshold, it queries recent application logs to gather context. It sends that context to a language model with a structured prompt asking for an analysis and a recommended action. And it delivers the analysis to me via a Slack message with enough information to act immediately without needing to open five different dashboards.
It does not automatically fix problems or make changes to the application without my approval. That is an important boundary. The agent is an analyst, not an operator. It reduces the time between a problem occurring and me understanding what to do about it from thirty to sixty minutes down to two to three minutes.
The Stack
The implementation of My Python App Now Has an AI Watchdog has transformed how I handle alerts.
The agent is built with four components:
LangChain as the agent framework that orchestrates the workflow and manages the language model interaction.
Prometheus Python client for querying metrics from the running Prometheus instance.
OpenAI GPT-4o as the reasoning backbone that interprets the metrics and logs and generates the analysis.
Slack SDK for delivering the alert with context to wherever I am.
Building the Watchdog Step by Step
Step 1: Connect to Prometheus and Define Health Checks
The first thing the agent needs is access to your metrics. The prometheus-api-client library makes this straightforward:
import os
from prometheus_api_client import PrometheusConnect
from datetime import datetime, timedelta
PROMETHEUS_URL = os.environ.get("PROMETHEUS_URL", "http://localhost:9090")
prom = PrometheusConnect(url=PROMETHEUS_URL, disable_ssl=False)
def get_current_metrics() -> dict:
end_time = datetime.now()
start_time = end_time - timedelta(minutes=10)
metrics = {}
# P95 response time across all endpoints
response_time = prom.custom_query(
'histogram_quantile(0.95, rate(app_response_time_seconds_bucket[5m]))'
)
metrics["p95_response_time"] = float(
response_time[0]["value"][1]
) if response_time else 0.0
# Error rate as percentage of total requests
error_rate = prom.custom_query(
'rate(app_requests_total{status_code=~"5.."}[5m]) / rate(app_requests_total[5m]) * 100'
)
metrics["error_rate_percent"] = float(
error_rate[0]["value"][1]
) if error_rate else 0.0
# Current memory usage in MB
memory = prom.custom_query('app_memory_usage_bytes / 1024 / 1024')
metrics["memory_mb"] = float(
memory[0]["value"][1]
) if memory else 0.0
# Active connections
connections = prom.custom_query('app_active_connections')
metrics["active_connections"] = float(
connections[0]["value"][1]
) if connections else 0.0
return metrics
def check_thresholds(metrics: dict) -> list:
alerts = []
if metrics["p95_response_time"] > 0.5:
alerts.append(
f"High response time: P95 is {metrics['p95_response_time']:.3f}s (threshold: 0.5s)"
)
if metrics["error_rate_percent"] > 5.0:
alerts.append(
f"High error rate: {metrics['error_rate_percent']:.1f}% of requests failing (threshold: 5%)"
)
if metrics["memory_mb"] > 512:
alerts.append(
f"High memory usage: {metrics['memory_mb']:.0f}MB in use (threshold: 512MB)"
)
return alerts
This gives the agent a structured snapshot of application health every time it runs, along with a plain-English description of anything that has crossed a threshold.
Step 2: Gather Log Context When an Alert Fires
Raw metric numbers tell you that something is wrong. Logs tell you what specifically is happening. When the threshold check finds an alert, the agent pulls recent error logs to give the language model context for its analysis:
import re
from pathlib import Path
from collections import Counter
def get_recent_errors(log_file: str, minutes: int = 10) -> dict:
log_path = Path(log_file)
if not log_path.exists():
return {"error_count": 0, "top_errors": [], "sample_lines": []}
cutoff_time = datetime.now() - timedelta(minutes=minutes)
error_lines = []
with open(log_path, "r") as f:
for line in f:
if "ERROR" in line or "CRITICAL" in line:
error_lines.append(line.strip())
error_types = []
for line in error_lines:
match = re.search(r'(\w+Error|\w+Exception)', line)
if match:
error_types.append(match.group(1))
top_errors = Counter(error_types).most_common(5)
return {
"error_count": len(error_lines),
"top_errors": top_errors,
"sample_lines": error_lines[-5:] if error_lines else []
}
This extracts the count, most common error types, and a sample of recent error lines. It is enough context for the language model to identify patterns without overwhelming it with thousands of lines of raw log data.
Step 3: Build the AI Analysis Layer
My Python App Now Has an AI Watchdog that efficiently narrows down the most pressing issues in real time.
This is where the agent becomes genuinely useful. Instead of just forwarding alert messages, it uses a language model to interpret the metrics and logs together and generate an actionable analysis:
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage
import json
llm = ChatOpenAI(
model="gpt-4o",
temperature=0,
api_key=os.environ.get("OPENAI_API_KEY")
)
SYSTEM_PROMPT = """You are a Python application monitoring expert.
You analyze metrics and logs from production FastAPI applications and provide
clear, actionable diagnoses. Always structure your response with:
1. What is happening (1-2 sentences)
2. Most likely cause (1-2 sentences)
3. Recommended immediate action (numbered steps)
4. Severity level: LOW, MEDIUM, HIGH, or CRITICAL
Be specific and technical. Avoid vague advice."""
def analyze_with_ai(metrics: dict, alerts: list, log_data: dict) -> str:
if not alerts:
return None
context = f"""
Current Application Metrics:
- P95 Response Time: {metrics['p95_response_time']:.3f}s
- Error Rate: {metrics['error_rate_percent']:.1f}%
- Memory Usage: {metrics['memory_mb']:.0f}MB
- Active Connections: {metrics['active_connections']:.0f}
Active Alerts:
{chr(10).join(f'- {alert}' for alert in alerts)}
Recent Error Log Summary:
- Total errors in last 10 minutes: {log_data['error_count']}
- Most common error types: {log_data['top_errors']}
- Sample error lines:
{chr(10).join(log_data['sample_lines'][:3])}
Please analyze this situation and provide your diagnosis.
"""
messages = [
SystemMessage(content=SYSTEM_PROMPT),
HumanMessage(content=context)
]
response = llm.invoke(messages)
return response.content
Setting temperature to zero is important here. You want deterministic, consistent analysis, not creative interpretation. The structured system prompt ensures the response always follows a format you can parse and act on.
The results of My Python App Now Has an AI Watchdog have been overwhelmingly positive.
My Python App Now Has an AI Watchdog to prevent issues from escalating into major problems.
With My Python App Now Has an AI Watchdog, I’ve managed to streamline the alert response process.
Thanks to My Python App Now Has an AI Watchdog, I can sleep better at night knowing I’m informed of critical issues.
My Python App Now Has an AI Watchdog, ensuring that I’m always one step ahead in application management.
Every time My Python App Now Has an AI Watchdog alerts me, I know it’s based on solid analysis.
With My Python App Now Has an AI Watchdog, the clarity of issues is vastly improved.
Ultimately, My Python App Now Has an AI Watchdog helps me prioritize effectively when issues arise.
Step 4: Deliver the Alert via Slack
The final piece is delivering the analysis somewhere you will actually see it:
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
slack_client = WebClient(token=os.environ.get("SLACK_BOT_TOKEN"))
SLACK_CHANNEL = os.environ.get("SLACK_ALERT_CHANNEL", "#alerts")
def send_slack_alert(metrics: dict, alerts: list, analysis: str):
severity_color = {
"CRITICAL": "#FF0000",
"HIGH": "#FF6600",
"MEDIUM": "#FFAA00",
"LOW": "#36A64F"
}
severity = "HIGH"
for level in ["CRITICAL", "HIGH", "MEDIUM", "LOW"]:
if level in analysis:
severity = level
break
blocks = [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"Production Alert [{severity}]"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*Active Alerts:*\n" + "\n".join(f"- {a}" for a in alerts)
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": f"*AI Analysis:*\n{analysis}"
}
},
{
"type": "context",
"elements": [
{
"type": "mrkdwn",
"text": f"P95: {metrics['p95_response_time']:.3f}s | "
f"Errors: {metrics['error_rate_percent']:.1f}% | "
f"Memory: {metrics['memory_mb']:.0f}MB"
}
]
}
]
try:
slack_client.chat_postMessage(
channel=SLACK_CHANNEL,
blocks=blocks,
text=f"Production Alert [{severity}]"
)
except SlackApiError as e:
print(f"Slack notification failed: {e.response['error']}")
Step 5: The Main Watchdog Loop
Tying everything together into a continuous monitoring loop:
import time
import schedule
LOG_FILE = os.environ.get("APP_LOG_FILE", "logs/errors.log")
CHECK_INTERVAL_SECONDS = 60
def run_watchdog():
print(f"Running watchdog check at {datetime.now().strftime('%H:%M:%S')}")
metrics = get_current_metrics()
alerts = check_thresholds(metrics)
if not alerts:
print("All metrics within normal range.")
return
print(f"Found {len(alerts)} alert(s). Gathering context...")
log_data = get_recent_errors(LOG_FILE)
print("Running AI analysis...")
analysis = analyze_with_ai(metrics, alerts, log_data)
if analysis:
send_slack_alert(metrics, alerts, analysis)
print("Alert sent to Slack.")
def main():
print("Python AI Watchdog started.")
print(f"Checking every {CHECK_INTERVAL_SECONDS} seconds.")
schedule.every(CHECK_INTERVAL_SECONDS).seconds.do(run_watchdog)
run_watchdog()
while True:
schedule.run_pending()
time.sleep(1)
if __name__ == "__main__":
main()
What the Watchdog Caught in the First 30 Days
Running this agent across two production applications for thirty days produced results that justified the build time within the first week.
It caught a database connection pool exhaustion pattern on day four, about ninety minutes before it would have caused requests to start failing. The AI analysis correctly identified it as a connection leak in a background task and recommended the specific configuration parameter to adjust. The fix took eight minutes. The alternative would have been a production outage discovered by a user.
It flagged a gradual memory increase on day eleven that matched the pattern of a caching layer that was not expiring entries correctly. Without the watchdog, this would have run undetected until the next scheduled restart.
Most importantly, it stopped waking me up for things that did not need my attention. Before the watchdog, every Prometheus alert required me to investigate manually. With the watchdog, I only receive a Slack message when the AI analysis determines the situation warrants human action. Noise dropped by roughly 70%.
The Boundary That Makes This Safe
One thing worth saying directly: this agent analyses and recommends. It does not act. Every recommendation it produces requires a human decision before anything changes in production.
That boundary is intentional and important. AI agents that can modify production systems autonomously introduce a category of risk that the monitoring benefits do not justify at this stage. The value of this watchdog is in reducing the time from problem to understanding, not in removing the human from the loop entirely.
Build the analyst first. Build the operator later, carefully, with explicit approval gates at every step.
