1 of 13
AI Systems Engineering  ·  CMP-L044  ·  Part 1 Assessment Group 1  ·  University of Roehampton  ·  2026
Security Automation Research

Autonomous Multi-Agent
Penetration Testing
Platform

We built a platform that makes security testing more continuous and less manual. Rather than replacing human testers, it handles the repetitive parts — scanning, correlation, and validation — so analysts can focus on what genuinely needs their attention.

Claude API ReAct Loop Sandboxed Validation Kubernetes CVSS v3.1 UK GDPR  ·  ISO 27001
2 of 13
The Problem
 Context

Security teams can't keep up with deployment pace

Modern software ships changes several times a day, but security testing still tends to happen at fixed intervals — quarterly at best. New vulnerabilities can appear and sit unnoticed for weeks between assessments.

 Gap — LLM-Only Tools

Lots of noise, little signal

Tools like PentestGPT are good at reasoning through attack scenarios, but they never actually check whether the exploits work. The result is a flood of plausible-sounding findings that analysts have to verify manually — which defeats the purpose.

 Gap — Rigid Pipelines

Can't see the bigger picture

Traditional modular pipelines are precise within each stage, but they work in isolation. They miss connections between assets, can't follow a trail across the network, and can't adapt when something unexpected turns up mid-scan.

<4h
Target time-to-report
critical findings
85%+
Validated
true-positive rate
100%
Auditability of
agent decisions
0
Unverified exploits
in final reports
 NIST SP 800-145

NIST's guidance on cloud computing explicitly calls for continuous monitoring rather than periodic review. Our platform is designed with that in mind — automating the parts that don't need a human in the loop, and surfacing only what does.

3 of 13
Who It's For and What It Covers
Target Users

Security Analysts

Get pre-validated, prioritised findings rather than raw scanner output

DevOps / SRE

Hook security checks directly into CI/CD without slowing delivery

Risk Managers

Receive scored, contextualised risk reports rather than technical dumps

Compliance Officers

Every decision is logged in a tamper-proof audit trail

System Boundaries
 In Scope
  • Controlled, cloud-based test environments
  • Automated reconnaissance, analysis, and exploit verification
  • Known vulnerabilities from CVE and NVD feeds
  • Integration with SIEM and CI/CD tooling
 Out of Scope
  • Live production systems — testing environments only
  • Social engineering or physical security
  • Zero-day discovery — this requires human research
4 of 13
What Already Exists — and Where It Falls Short
Dimension Agentic LLM tools (PentestGPT, AutoAttacker) Modular pipelines Our hybrid approach
Adaptability High — flexible, context-aware Low — fixed stage ordering High — LLM guides prioritisation
Precision Low — lots of false positives High — narrow, well-defined scope High — sandbox-confirmed only
Auditability Limited — reasoning is opaque Strong — logs at each stage Strong — full decision trail
Cross-asset reasoningYes No — stages are siloed Yes
Execution validationNot present Not present Yes — sandboxed testing

The critical gap is that neither approach actually tests whether its findings are real. Sculley et al. note that unvalidated ML outputs accumulate as technical debt over time. Our solution keeps the flexibility of agentic reasoning but wraps it inside a validation layer — so everything reported has been confirmed by actual execution, not just inferred.

5 of 13
What the System Needs to Do
Functional Requirements

FR1 Asset Discovery

Scan the environment using Nmap and Shodan, then build a structured map of how everything connects

FR2 Intelligent Orchestration

Use the LLM to decide where to look next based on what it's already found, not a fixed sequence

FR3 Sandboxed Validation

Run every proposed exploit in an isolated environment — nothing gets reported without proof it works

FR4 Scored Reporting

Score findings using CVSS v3.1 and tailor the output for each audience — analyst, executive, compliance

FR5 Decision Audit Trail

Log every reasoning step, tool call, and validation result so decisions can be explained and reviewed

Non-Functional Requirements

NFR1 Speed

Critical findings should reach analysts within four hours of a scan completing

NFR2 Accuracy

At least 85 per cent of reported findings should be genuine — not noise

NFR3 Scalability

Each layer should scale independently on Kubernetes as demand grows

NFR4 Privacy

Personal data is pseudonymised at ingestion and handled in line with UK GDPR

NFR5 Human Oversight

Any high-severity exploit needs analyst sign-off before it runs

6 of 13
How It's Built — Four Layers
Layer 1 — Data Ingestion
Nmap · Shodan · Nuclei → asset graph in PostgreSQL, versioned with DVC
Kafka  ·  ingestion.complete
Layer 2 — Multi-Agent Reasoning
Claude API ReAct coordinator · generates hypotheses · logs every decision
Human gate (CVSS ≥ 7.0)  ·  Kafka
Layer 3 — Sandbox Validation
Ephemeral Docker containers · isolated network · 120-second timeout
Kafka  ·  validation.complete
Layer 4 — Reporting & Monitoring
CVSS-scored reports · Prometheus/Grafana · ELK audit trail
Cross-Cutting Infrastructure

Apache Kafka

Async message bus connecting all four layers — lets each one scale independently and handles back-pressure when things get busy

Kubernetes + Helm

Every service runs in a container with its own scaling policy. Deployments go through a canary pipeline — new versions handle 10% of traffic before being promoted

PostgreSQL + DVC

The asset graph lives here, along with a full version history of every scan run — any past state can be replayed exactly

Security controls: mutual TLS and role-based access control at the API gateway. Zero-trust network posture throughout.

7 of 13
Claude API — Managed Service vs Building Our Own Model

We're using Anthropic's Claude API as the reasoning coordinator rather than training a security-specific model from scratch. It's a deliberate choice with real trade-offs on both sides, so we've tried to be honest about both below.

Dimension
Managed API (Claude) — what we chose
Fine-tuned / self-trained model
How quickly we can ship
Working in hours — just bring an API key and connect the SDK
Weeks of data gathering, training runs, and evaluation before anything is usable
Model quality over time
Automatically improves as Anthropic ships new versions — one config change to upgrade
Frozen at training time — needs periodic re-training to stay useful
Built-in safety
Constitutional AI and RLHF baked in — helps prevent the tool being misused to generate live attacks
Safety guardrails need to be built from scratch, which is non-trivial for a system that handles exploit generation
Data leaving our network
Scan context is sent to Anthropic — needs a signed DPA and careful data minimisation in prompts
Everything stays on-premises, which makes UK GDPR compliance simpler
Cost as we scale
Per-token billing adds up fast at high hypothesis volume — large scans can get expensive
Fixed infrastructure cost once it's trained — cheaper per-call at scale
What happens if the API goes down
The reasoning layer stalls — we depend on Anthropic's uptime and rate limits
Self-hosted, so availability is entirely within our control
Domain specialisation
General-purpose — prompt engineering helps, but it's not the same as deep security fine-tuning
Can be trained specifically on CVE databases and exploit datasets for sharper results
8 of 13
Why Claude Makes Sense Here — and How We Handle the Risks
Why We Went With the Managed API
 No Training Pipeline Needed

Focus on what's novel

Building a training pipeline would eat the entire project timeline. Claude lets us focus on the parts that are genuinely new — the orchestration architecture, the validation loop, and the audit trail — rather than solving a problem that's already been solved.

 Safety Is Already There

A useful first line of defence

A tool that generates working exploits is a dual-use risk. Claude's built-in refusal policies aren't perfect, but they reduce the governance burden significantly compared to starting from zero.

 We Improve Automatically

Upgrading costs nothing

A fine-tuned model is locked to its training data. Switching from Sonnet 4.6 to Opus 4.7 is a single line change — no retraining, no validation cycle, and users immediately benefit from the better reasoning.

How We Handle the Risks

Data leaving the network

Asset metadata is pseudonymised before anything is sent to the API. Hostnames and IP addresses never appear in prompts — only sanitised descriptors do. We also have a signed DPA with Anthropic in place.

What happens when the API is down

Kafka buffers queued hypotheses so the downstream layers keep working. A circuit-breaker retries with exponential backoff and pages the on-call engineer after three consecutive failures.

Keeping costs under control

We use Anthropic's five-minute prompt cache so the asset-graph context isn't re-sent with every hypothesis. Low-priority work is batched through the Batch API, which cuts costs by roughly 50 per cent.

Exit route: the coordinator sits behind a thin interface. If we ever outgrow the API on cost or privacy grounds, a self-hosted fine-tuned model slots in without touching any other layer.

9 of 13
Tools and Technologies
 Scanning

Nmap · Shodan · Nuclei

These three cover different angles — internal network topology, external exposure, and known CVE signatures. They run behind a shared abstraction layer, so adding a new scanner later is straightforward.

 LLM Reasoning

Claude API (Anthropic)

Powers the ReAct coordinator. It reads the asset graph, generates hypotheses as structured JSON, and updates its view after each sandbox result. Every call is logged in full for the audit trail.

 Validation

Docker + Kubernetes NetworkPolicies

Each hypothesis gets its own ephemeral container, spun up from a versioned IaC template. The network is completely isolated. If the exploit doesn't confirm in 120 seconds, the container is torn down and the result is logged as a timeout.

 Asset Data

PostgreSQL + DVC

The asset-relationship graph lives here — service dependencies, network adjacency, shared credential domains. DVC keeps a version history so any past scan state can be replayed exactly.

 Messaging

Apache Kafka

All inter-layer communication goes through Kafka. This means each layer scales independently, nothing is lost if a consumer is temporarily slow, and the circuit-breaker can buffer Claude API calls during outages.

 Orchestration

Kubernetes + Helm

Helm charts version every deployment configuration. New releases go out as canary deployments — they handle 10% of traffic and roll back automatically if error rates exceed the threshold.

 Observability

Prometheus + Grafana

Tracks service latency, sandbox spin-up times, and hypothesis throughput in real time. Alerting rules flag a rising false-positive rate, high timeout counts, or a coordinator stuck in a loop.

 Audit

ELK Stack

Every Claude prompt and response, every tool call, every sandbox outcome — all indexed and searchable. This is the compliance store, aligned with ISO 27001 requirements.

 ML Ops

MLflow

Tracks model versions, hyperparameters, and data snapshots so any past state of the system can be reproduced. Feeds directly into the canary rollback pipeline when drift is detected.

10 of 13
Inside the Reasoning Layer
The ReAct Loop — Step by Step
Read the Asset Graph
Loads service dependencies, network adjacency, and credential domains from PostgreSQL
Reason
Pick the Most Interesting Targets
Claude identifies the highest-risk assets given what's been found so far
Act
Generate a Hypothesis
Outputs a structured JSON exploit proposal — not a finding yet
Observe
Get the Sandbox Result
Pass · Fail · Timeout · Error — the coordinator updates its internal state
Log
Write to the Audit Store
Full prompt and response appended to the JSONL log — immutable
Design Decisions Worth Explaining
 Tackling Hallucination

Claude can only propose — the sandbox decides

This is the core idea. Claude generates hypotheses; the sandbox confirms or rejects them. Nothing reaches a report unless it's been proven to work. This breaks the self-referential loop that causes high false-positive rates in purely agentic tools.

 Seeing the Whole Network

Graph-aware reasoning catches what pipelines miss

Because the coordinator reads the full asset-relationship graph, it can follow lateral movement chains — something isolated pipeline stages simply can't do.

 Staying Flexible

The scan adapts as it runs

If Claude spots something unexpected halfway through, it can change course — rather than completing a predetermined sequence and only then looking at what it found.

11 of 13
Keeping an Eye on Things
What We Watch For
  • A creeping false-positive rate — often the first sign of model drift
  • High sandbox timeouts — usually means the environment isn't set up right
  • The coordinator going around in loops — burns cost and produces nothing
  • Unexpected network traffic from a sandbox — a potential breach signal
How We Watch
 Prometheus + Grafana

Real-time metrics

Latency percentiles, sandbox provisioning time, hypothesis throughput, and per-service error rates — all visible on a live dashboard.

 ELK Stack

Searchable audit trail

Every Claude decision, tool call, and validation outcome is indexed and queryable, aligned with ISO 27001 requirements.

When Something Goes Wrong
1

Sandbox breach suspected

The platform shuts down immediately. All containers are torn down and a human investigates before anything restarts.

2

Model drift detected

The system rolls back automatically to the last validated MLflow checkpoint through the CI/CD pipeline — no manual steps needed.

3

False-positive rate creeping up

Scans are paused and a human reviews the recent Claude decisions before things resume. Usually indicates a prompt drift or data issue.

All incidents go into the ISO 27001-aligned compliance log. The governance board — security, legal, and engineering — reviews these quarterly alongside any model updates.

12 of 13
Doing This Responsibly
 Fairness

We've thought about what the model misses

The vulnerability classifier's training data is balanced across different technology stacks so it doesn't systematically under-detect issues on certain platforms. We run quarterly bias audits to check this is holding.

 Transparency

Every finding can be traced back

The audit log means you can follow any reported vulnerability all the way from the initial scan, through the Claude reasoning step, to the sandbox run that confirmed it. Nothing is a black box.

 Accountability

Humans stay in the loop where it matters

Low-severity work runs automatically for speed. Anything rated high or critical needs explicit analyst sign-off before it's executed or published. The quarterly governance board owns the broader oversight.

 Privacy

Data protection by design

Personal data in scan logs is pseudonymised as it comes in, under the Data Protection Act 2018. Access is role-restricted, retention periods are limited to what's legally required, and deletion is automated.

NIST AI RMF 1.0 UK AI Safety Institute UK GDPR · DPA 2018 ISO/IEC 27001:2022 NCSC ML Security Principles OWASP ML Security Top 10 CVSS v3.1
13 of 13
Summary

The core idea is simple: LLM reasoning is powerful but unreliable on its own. By pairing Claude with a sandboxed validation step, we get the best of both worlds — flexible, context-aware analysis that only reports what it can actually prove. Nothing reaches a report without running in an isolated environment first.

Discover
Nmap · Shodan · Nuclei
asset graph
Reason
Claude ReAct
coordinator
Validate
Isolated sandbox
confirmed findings only
Report
CVSS v3.1 scored
stakeholder-tailored
85%+
Validated true-positive rate
<4h
Time to report critical findings
4
Architecture layers
0
Unverified exploits in reports
100%
Audit traceability