Security data is messy. Engineers are busy. And yet, every new application or microservice adds more logs that need to be parsed, structured, and made useful. This used to be a blocker. Not anymore.
For years, one of the hidden pain points in detection engineering has been log ingestion and normalization. Most SOC teams rely on detection rules that assume data shows up in a clean, consistent format.
But in reality:
- Logs change structure between builds.
- Engineers forget to document fields.
- New cloud services emit non-standard telemetry.
As a result, teams waste time writing brittle parsing logic—or worse, delay detection altogether while they “wait for logging to stabilize.”
It’s time to end that cycle.
🧠 The Mental Model: “Learn the Schema. Don’t Define It.”
Instead of waiting for someone to define a schema up front, we let the system infer it on the fly.
That’s what this architecture does using a Google Colab-based ETL pipeline:
- Accepts raw logs from any engineer or service
- Parses and tokenizes content without assuming a format
- Learns field structure dynamically (e.g., timestamp, IP, user agent, action)
- Turns log lines into usable feature vectors for downstream ML models
Even if the engineer doesn’t know the schema… the system figures it out.
🔁 Real-World Use Case
An engineering team pushed logs from a new serverless app using a bespoke JSON structure. No schema. No documentation. Just logs.
The AI-driven ETL pipeline:
- Identified repeating field patterns and key-value pairs
- Clustered log types based on structure and frequency
- Converted logs into vectors compatible with the EBM model
- Fed the data directly into the anomaly detection pipeline within minutes
Result? New telemetry source onboarded in under 30 minutes. With no human-written parsing logic.
🛠️ How It Works: Step-by-Step
✅ Step 1: Log Submission
An engineer points the onboarding API or UI to a new log source.
✅ Step 2: Schema Inference
The system scans the logs, determines field structure, and stores a dynamic schema.
✅ Step 3: Feature Vector Generation
Logs are transformed into numeric vectors (dimensional embeddings) for ML consumption.
✅ Step 4: EBM-Based Scoring
Energy-Based Models (EBMs) evaluate incoming log events for behavioral anomalies.
✅ Step 5: Feedback Loop Activation
High-energy (anomalous) events are fed into the simulation engine for validation and potential model tuning.
✅ Step 6: Playbook Creation & Optimization
If no matching playbook exists:
- The system drafts a new SOAR playbook based on observed event type, severity, and mapped compliance frameworks.
- If a playbook exists, it is automatically optimized using simulation results, analyst feedback, and outcome tracking.
This ensures every log source is tied to a response—not just detection.
🔍 Example: Self-Created Response Logic for a New Risk
A previously unknown set of logs began emitting failed authentication attempts followed by file modification commands.
The system:
- Flagged the behavior as anomalous
- Mapped it to known MITRE ATT&CK patterns
- Noted that no playbook covered this risk
- Generated a new playbook to isolate the source and notify the SOC
After three simulations, it auto-optimized escalation logic and suppression thresholds—ready for real-time use.
🚀 Why This Changes the Game
Old Model | New Model |
---|---|
Schema defined manually | Schema inferred dynamically |
Parsing code written by engineers | Features extracted automatically |
Response logic mapped by humans | Playbooks generated + tuned automatically |
Compliance gaps between logs & action | Logs auto-mapped to control frameworks |
This isn’t just “log onboarding.” It’s automated threat coverage expansion.
🎯 Your Move
If onboarding new telemetry still feels like waiting for documentation, you’re behind.
Ask yourself: How many of your logs are ignored because they’re unlabeled or unmapped?
👉 Let AI take the first pass. If the system sees risk, it should be able to act on it. That’s not just detection—that’s defense. Read the full white paper or dive into the latest podcast episode to learn more.