1 of 13

AI Systems Engineering · CMP-L044 · Part 1 Assessment Group 1 · University of Roehampton · 2026

Security Automation Research

Autonomous Multi-Agent
Penetration Testing
Platform

We built a platform that makes security testing more continuous and less manual. Rather than replacing human testers, it handles the repetitive parts — scanning, correlation, and validation — so analysts can focus on what genuinely needs their attention.

Claude API ReAct Loop Sandboxed Validation Kubernetes CVSS v3.1 UK GDPR · ISO 27001

2 of 13

The Problem

Context

Security teams can't keep up with deployment pace

Modern software ships changes several times a day, but security testing still tends to happen at fixed intervals — quarterly at best. New vulnerabilities can appear and sit unnoticed for weeks between assessments.

Gap — LLM-Only Tools

Lots of noise, little signal

Tools like PentestGPT are good at reasoning through attack scenarios, but they never actually check whether the exploits work. The result is a flood of plausible-sounding findings that analysts have to verify manually — which defeats the purpose.

Gap — Rigid Pipelines

Can't see the bigger picture

Traditional modular pipelines are precise within each stage, but they work in isolation. They miss connections between assets, can't follow a trail across the network, and can't adapt when something unexpected turns up mid-scan.

<4h

Target time-to-report
critical findings

85%+

Validated
true-positive rate

100%

Auditability of
agent decisions

0

Unverified exploits
in final reports

NIST SP 800-145

NIST's guidance on cloud computing explicitly calls for continuous monitoring rather than periodic review. Our platform is designed with that in mind — automating the parts that don't need a human in the loop, and surfacing only what does.

3 of 13

Who It's For and What It Covers

Target Users

Security Analysts

Get pre-validated, prioritised findings rather than raw scanner output

DevOps / SRE

Hook security checks directly into CI/CD without slowing delivery

Risk Managers

Receive scored, contextualised risk reports rather than technical dumps

Compliance Officers

Every decision is logged in a tamper-proof audit trail

System Boundaries

In Scope

Controlled, cloud-based test environments
Automated reconnaissance, analysis, and exploit verification
Known vulnerabilities from CVE and NVD feeds
Integration with SIEM and CI/CD tooling

Out of Scope

Live production systems — testing environments only
Social engineering or physical security
Zero-day discovery — this requires human research

4 of 13

What Already Exists — and Where It Falls Short

Dimension	Agentic LLM tools (PentestGPT, AutoAttacker)	Modular pipelines	Our hybrid approach
Adaptability	High — flexible, context-aware	Low — fixed stage ordering	High — LLM guides prioritisation
Precision	Low — lots of false positives	High — narrow, well-defined scope	High — sandbox-confirmed only
Auditability	Limited — reasoning is opaque	Strong — logs at each stage	Strong — full decision trail
Cross-asset reasoning	Yes	No — stages are siloed	Yes
Execution validation	Not present	Not present	Yes — sandboxed testing

The critical gap is that neither approach actually tests whether its findings are real. Sculley et al. note that unvalidated ML outputs accumulate as technical debt over time. Our solution keeps the flexibility of agentic reasoning but wraps it inside a validation layer — so everything reported has been confirmed by actual execution, not just inferred.

5 of 13

What the System Needs to Do

Functional Requirements

`FR1` Asset Discovery

Scan the environment using Nmap and Shodan, then build a structured map of how everything connects

`FR2` Intelligent Orchestration

Use the LLM to decide where to look next based on what it's already found, not a fixed sequence

`FR3` Sandboxed Validation

Run every proposed exploit in an isolated environment — nothing gets reported without proof it works

`FR4` Scored Reporting

Score findings using CVSS v3.1 and tailor the output for each audience — analyst, executive, compliance

`FR5` Decision Audit Trail

Log every reasoning step, tool call, and validation result so decisions can be explained and reviewed

Non-Functional Requirements

`NFR1` Speed

Critical findings should reach analysts within four hours of a scan completing

`NFR2` Accuracy

At least 85 per cent of reported findings should be genuine — not noise

`NFR3` Scalability

Each layer should scale independently on Kubernetes as demand grows

`NFR4` Privacy

Personal data is pseudonymised at ingestion and handled in line with UK GDPR

`NFR5` Human Oversight

Any high-severity exploit needs analyst sign-off before it runs

6 of 13

How It's Built — Four Layers

Layer 1 — Data Ingestion

Nmap · Shodan · Nuclei → asset graph in PostgreSQL, versioned with DVC

Kafka · ingestion.complete

Layer 2 — Multi-Agent Reasoning

Claude API ReAct coordinator · generates hypotheses · logs every decision

Human gate (CVSS ≥ 7.0) · Kafka

Layer 3 — Sandbox Validation

Ephemeral Docker containers · isolated network · 120-second timeout

Kafka · validation.complete

Layer 4 — Reporting & Monitoring

CVSS-scored reports · Prometheus/Grafana · ELK audit trail

Cross-Cutting Infrastructure

Apache Kafka

Async message bus connecting all four layers — lets each one scale independently and handles back-pressure when things get busy

Kubernetes + Helm

Every service runs in a container with its own scaling policy. Deployments go through a canary pipeline — new versions handle 10% of traffic before being promoted

PostgreSQL + DVC

The asset graph lives here, along with a full version history of every scan run — any past state can be replayed exactly

Security controls: mutual TLS and role-based access control at the API gateway. Zero-trust network posture throughout.

7 of 13

Claude API — Managed Service vs Building Our Own Model

We're using Anthropic's Claude API as the reasoning coordinator rather than training a security-specific model from scratch. It's a deliberate choice with real trade-offs on both sides, so we've tried to be honest about both below.

Dimension

Managed API (Claude) — what we chose

Fine-tuned / self-trained model

How quickly we can ship

Working in hours — just bring an API key and connect the SDK

Weeks of data gathering, training runs, and evaluation before anything is usable

Model quality over time

Automatically improves as Anthropic ships new versions — one config change to upgrade

Frozen at training time — needs periodic re-training to stay useful

Built-in safety

Constitutional AI and RLHF baked in — helps prevent the tool being misused to generate live attacks

Safety guardrails need to be built from scratch, which is non-trivial for a system that handles exploit generation

Data leaving our network

Scan context is sent to Anthropic — needs a signed DPA and careful data minimisation in prompts

Everything stays on-premises, which makes UK GDPR compliance simpler

Cost as we scale

Per-token billing adds up fast at high hypothesis volume — large scans can get expensive

Fixed infrastructure cost once it's trained — cheaper per-call at scale

What happens if the API goes down

The reasoning layer stalls — we depend on Anthropic's uptime and rate limits

Self-hosted, so availability is entirely within our control

Domain specialisation

General-purpose — prompt engineering helps, but it's not the same as deep security fine-tuning

Can be trained specifically on CVE databases and exploit datasets for sharper results

8 of 13

Why Claude Makes Sense Here — and How We Handle the Risks

Why We Went With the Managed API

No Training Pipeline Needed

Focus on what's novel

Building a training pipeline would eat the entire project timeline. Claude lets us focus on the parts that are genuinely new — the orchestration architecture, the validation loop, and the audit trail — rather than solving a problem that's already been solved.

Safety Is Already There

A useful first line of defence

A tool that generates working exploits is a dual-use risk. Claude's built-in refusal policies aren't perfect, but they reduce the governance burden significantly compared to starting from zero.

We Improve Automatically

Upgrading costs nothing

A fine-tuned model is locked to its training data. Switching from Sonnet 4.6 to Opus 4.7 is a single line change — no retraining, no validation cycle, and users immediately benefit from the better reasoning.

How We Handle the Risks

Data leaving the network

Asset metadata is pseudonymised before anything is sent to the API. Hostnames and IP addresses never appear in prompts — only sanitised descriptors do. We also have a signed DPA with Anthropic in place.

What happens when the API is down

Kafka buffers queued hypotheses so the downstream layers keep working. A circuit-breaker retries with exponential backoff and pages the on-call engineer after three consecutive failures.

Keeping costs under control

We use Anthropic's five-minute prompt cache so the asset-graph context isn't re-sent with every hypothesis. Low-priority work is batched through the Batch API, which cuts costs by roughly 50 per cent.

Exit route: the coordinator sits behind a thin interface. If we ever outgrow the API on cost or privacy grounds, a self-hosted fine-tuned model slots in without touching any other layer.

9 of 13

Tools and Technologies

Scanning

Nmap · Shodan · Nuclei

These three cover different angles — internal network topology, external exposure, and known CVE signatures. They run behind a shared abstraction layer, so adding a new scanner later is straightforward.

LLM Reasoning

Claude API (Anthropic)

Powers the ReAct coordinator. It reads the asset graph, generates hypotheses as structured JSON, and updates its view after each sandbox result. Every call is logged in full for the audit trail.

Validation

Docker + Kubernetes NetworkPolicies

Each hypothesis gets its own ephemeral container, spun up from a versioned IaC template. The network is completely isolated. If the exploit doesn't confirm in 120 seconds, the container is torn down and the result is logged as a timeout.

Asset Data

PostgreSQL + DVC

The asset-relationship graph lives here — service dependencies, network adjacency, shared credential domains. DVC keeps a version history so any past scan state can be replayed exactly.

Messaging

Apache Kafka

All inter-layer communication goes through Kafka. This means each layer scales independently, nothing is lost if a consumer is temporarily slow, and the circuit-breaker can buffer Claude API calls during outages.

Orchestration

Kubernetes + Helm

Helm charts version every deployment configuration. New releases go out as canary deployments — they handle 10% of traffic and roll back automatically if error rates exceed the threshold.

Observability

Prometheus + Grafana

Tracks service latency, sandbox spin-up times, and hypothesis throughput in real time. Alerting rules flag a rising false-positive rate, high timeout counts, or a coordinator stuck in a loop.

Audit

ELK Stack

Every Claude prompt and response, every tool call, every sandbox outcome — all indexed and searchable. This is the compliance store, aligned with ISO 27001 requirements.

ML Ops

MLflow

Tracks model versions, hyperparameters, and data snapshots so any past state of the system can be reproduced. Feeds directly into the canary rollback pipeline when drift is detected.

10 of 13

Inside the Reasoning Layer

The ReAct Loop — Step by Step

Read the Asset Graph

Loads service dependencies, network adjacency, and credential domains from PostgreSQL

Reason

Pick the Most Interesting Targets

Claude identifies the highest-risk assets given what's been found so far

Act

Generate a Hypothesis

Outputs a structured JSON exploit proposal — not a finding yet

Observe

Get the Sandbox Result

Pass · Fail · Timeout · Error — the coordinator updates its internal state

Log

Write to the Audit Store

Full prompt and response appended to the JSONL log — immutable

Design Decisions Worth Explaining

Tackling Hallucination

Claude can only propose — the sandbox decides

This is the core idea. Claude generates hypotheses; the sandbox confirms or rejects them. Nothing reaches a report unless it's been proven to work. This breaks the self-referential loop that causes high false-positive rates in purely agentic tools.

Seeing the Whole Network

Graph-aware reasoning catches what pipelines miss

Because the coordinator reads the full asset-relationship graph, it can follow lateral movement chains — something isolated pipeline stages simply can't do.

Staying Flexible

The scan adapts as it runs

If Claude spots something unexpected halfway through, it can change course — rather than completing a predetermined sequence and only then looking at what it found.

11 of 13

Keeping an Eye on Things

What We Watch For

A creeping false-positive rate — often the first sign of model drift
High sandbox timeouts — usually means the environment isn't set up right
The coordinator going around in loops — burns cost and produces nothing
Unexpected network traffic from a sandbox — a potential breach signal

How We Watch

Prometheus + Grafana

Real-time metrics

Latency percentiles, sandbox provisioning time, hypothesis throughput, and per-service error rates — all visible on a live dashboard.

ELK Stack

Searchable audit trail

Every Claude decision, tool call, and validation outcome is indexed and queryable, aligned with ISO 27001 requirements.

When Something Goes Wrong

1

Sandbox breach suspected

The platform shuts down immediately. All containers are torn down and a human investigates before anything restarts.

2

Model drift detected

The system rolls back automatically to the last validated MLflow checkpoint through the CI/CD pipeline — no manual steps needed.

3

False-positive rate creeping up

Scans are paused and a human reviews the recent Claude decisions before things resume. Usually indicates a prompt drift or data issue.

All incidents go into the ISO 27001-aligned compliance log. The governance board — security, legal, and engineering — reviews these quarterly alongside any model updates.

12 of 13

Doing This Responsibly

Fairness

We've thought about what the model misses

The vulnerability classifier's training data is balanced across different technology stacks so it doesn't systematically under-detect issues on certain platforms. We run quarterly bias audits to check this is holding.

Transparency

Every finding can be traced back

The audit log means you can follow any reported vulnerability all the way from the initial scan, through the Claude reasoning step, to the sandbox run that confirmed it. Nothing is a black box.

Accountability

Humans stay in the loop where it matters

Low-severity work runs automatically for speed. Anything rated high or critical needs explicit analyst sign-off before it's executed or published. The quarterly governance board owns the broader oversight.

Privacy

Data protection by design

Personal data in scan logs is pseudonymised as it comes in, under the Data Protection Act 2018. Access is role-restricted, retention periods are limited to what's legally required, and deletion is automated.

NIST AI RMF 1.0 UK AI Safety Institute UK GDPR · DPA 2018 ISO/IEC 27001:2022 NCSC ML Security Principles OWASP ML Security Top 10 CVSS v3.1

13 of 13

Summary

The core idea is simple: LLM reasoning is powerful but unreliable on its own. By pairing Claude with a sandboxed validation step, we get the best of both worlds — flexible, context-aware analysis that only reports what it can actually prove. Nothing reaches a report without running in an isolated environment first.

Discover

Nmap · Shodan · Nuclei
asset graph

Reason

Claude ReAct
coordinator

Validate

Isolated sandbox
confirmed findings only

Report

CVSS v3.1 scored
stakeholder-tailored

85%+

Validated true-positive rate

<4h

Time to report critical findings

4

Architecture layers

0

Unverified exploits in reports

100%

Audit traceability

Autonomous Multi-AgentPenetration TestingPlatform

Security teams can't keep up with deployment pace

Lots of noise, little signal

Can't see the bigger picture

Security Analysts

DevOps / SRE

Risk Managers

Compliance Officers

FR1 Asset Discovery

FR2 Intelligent Orchestration

FR3 Sandboxed Validation

FR4 Scored Reporting

FR5 Decision Audit Trail

NFR1 Speed

NFR2 Accuracy

NFR3 Scalability

NFR4 Privacy

NFR5 Human Oversight

Apache Kafka

Kubernetes + Helm

PostgreSQL + DVC

Focus on what's novel

A useful first line of defence

Upgrading costs nothing

Data leaving the network

What happens when the API is down

Keeping costs under control

Nmap · Shodan · Nuclei

Claude API (Anthropic)

Docker + Kubernetes NetworkPolicies

PostgreSQL + DVC

Apache Kafka

Kubernetes + Helm

Prometheus + Grafana

ELK Stack

MLflow

Claude can only propose — the sandbox decides

Graph-aware reasoning catches what pipelines miss

The scan adapts as it runs

Real-time metrics

Searchable audit trail

Sandbox breach suspected

Model drift detected

False-positive rate creeping up

We've thought about what the model misses

Every finding can be traced back

Humans stay in the loop where it matters

Data protection by design

Autonomous Multi-Agent
Penetration Testing
Platform

`FR1` Asset Discovery

`FR2` Intelligent Orchestration

`FR3` Sandboxed Validation

`FR4` Scored Reporting

`FR5` Decision Audit Trail

`NFR1` Speed

`NFR2` Accuracy

`NFR3` Scalability

`NFR4` Privacy

`NFR5` Human Oversight