No Schema? No Problem. Let AI Handle Your Security Data Onboarding

John W8MEJ Menerick · April 5, 2025

Security data is messy. Engineers are busy. And yet, every new application or microservice adds more logs that need to be parsed, structured, and made useful. This used to be a blocker. Not anymore.

For years, one of the hidden pain points in detection engineering has been log ingestion and normalization. Most SOC teams rely on detection rules that assume data shows up in a clean, consistent format.

But in reality:

  • Logs change structure between builds.
  • Engineers forget to document fields.
  • New cloud services emit non-standard telemetry.

As a result, teams waste time writing brittle parsing logic—or worse, delay detection altogether while they “wait for logging to stabilize.”

It’s time to end that cycle.


🧠 The Mental Model: “Learn the Schema. Don’t Define It.”

Instead of waiting for someone to define a schema up front, we let the system infer it on the fly.

That’s what this architecture does using a Google Colab-based ETL pipeline:

  • Accepts raw logs from any engineer or service
  • Parses and tokenizes content without assuming a format
  • Learns field structure dynamically (e.g., timestamp, IP, user agent, action)
  • Turns log lines into usable feature vectors for downstream ML models

Even if the engineer doesn’t know the schema… the system figures it out.


🔁 Real-World Use Case

An engineering team pushed logs from a new serverless app using a bespoke JSON structure. No schema. No documentation. Just logs.

The AI-driven ETL pipeline:

  • Identified repeating field patterns and key-value pairs
  • Clustered log types based on structure and frequency
  • Converted logs into vectors compatible with the EBM model
  • Fed the data directly into the anomaly detection pipeline within minutes

Result? New telemetry source onboarded in under 30 minutes. With no human-written parsing logic.


🛠️ How It Works: Step-by-Step

✅ Step 1: Log Submission

An engineer points the onboarding API or UI to a new log source.

✅ Step 2: Schema Inference

The system scans the logs, determines field structure, and stores a dynamic schema.

✅ Step 3: Feature Vector Generation

Logs are transformed into numeric vectors (dimensional embeddings) for ML consumption.

✅ Step 4: EBM-Based Scoring

Energy-Based Models (EBMs) evaluate incoming log events for behavioral anomalies.

✅ Step 5: Feedback Loop Activation

High-energy (anomalous) events are fed into the simulation engine for validation and potential model tuning.

✅ Step 6: Playbook Creation & Optimization

If no matching playbook exists:

  • The system drafts a new SOAR playbook based on observed event type, severity, and mapped compliance frameworks.
  • If a playbook exists, it is automatically optimized using simulation results, analyst feedback, and outcome tracking.

This ensures every log source is tied to a response—not just detection.


🔍 Example: Self-Created Response Logic for a New Risk

A previously unknown set of logs began emitting failed authentication attempts followed by file modification commands.

The system:

  • Flagged the behavior as anomalous
  • Mapped it to known MITRE ATT&CK patterns
  • Noted that no playbook covered this risk
  • Generated a new playbook to isolate the source and notify the SOC

After three simulations, it auto-optimized escalation logic and suppression thresholds—ready for real-time use.


🚀 Why This Changes the Game

Old Model New Model
Schema defined manually Schema inferred dynamically
Parsing code written by engineers Features extracted automatically
Response logic mapped by humans Playbooks generated + tuned automatically
Compliance gaps between logs & action Logs auto-mapped to control frameworks

This isn’t just “log onboarding.” It’s automated threat coverage expansion.


🎯 Your Move

If onboarding new telemetry still feels like waiting for documentation, you’re behind.

Ask yourself: How many of your logs are ignored because they’re unlabeled or unmapped?

👉 Let AI take the first pass. If the system sees risk, it should be able to act on it. That’s not just detection—that’s defense. Read the full white paper or dive into the latest podcast episode to learn more.

Share on: