Prometheus Chaos Edition -

| | With PCE | | --- | --- | | You assume Prometheus is always healthy. | You prove it can survive partial failures. | | Alertmanager might be misconfigured for months. | You test silences, inhibitions, and receivers. | | A slow scrape delays critical alerts. | You detect latency thresholds before they matter. | | Grafana dashboards freeze, but no one notices. | You build fallback visualizations. |

In short: How to Run Prometheus Chaos Edition (Step-by-Step)

Before we dive into code, let’s address the obvious question: Why would I voluntarily break my monitoring?

What happens when your Prometheus server runs out of memory? What if a metric scrape takes 30 seconds because a target is thrashing? What if your alerting rules become corrupt?

# malicious_exporter.py from flask import Flask, Response import random app = Flask()

Create a small proxy that intercepts /metrics endpoints: