We built a platform that makes security testing more continuous and less manual. Rather than replacing human testers, it handles the repetitive parts — scanning, correlation, and validation — so analysts can focus on what genuinely needs their attention.
Modern software ships changes several times a day, but security testing still tends to happen at fixed intervals — quarterly at best. New vulnerabilities can appear and sit unnoticed for weeks between assessments.
Tools like PentestGPT are good at reasoning through attack scenarios, but they never actually check whether the exploits work. The result is a flood of plausible-sounding findings that analysts have to verify manually — which defeats the purpose.
Traditional modular pipelines are precise within each stage, but they work in isolation. They miss connections between assets, can't follow a trail across the network, and can't adapt when something unexpected turns up mid-scan.
NIST's guidance on cloud computing explicitly calls for continuous monitoring rather than periodic review. Our platform is designed with that in mind — automating the parts that don't need a human in the loop, and surfacing only what does.
Get pre-validated, prioritised findings rather than raw scanner output
Hook security checks directly into CI/CD without slowing delivery
Receive scored, contextualised risk reports rather than technical dumps
Every decision is logged in a tamper-proof audit trail
| Dimension | Agentic LLM tools (PentestGPT, AutoAttacker) | Modular pipelines | Our hybrid approach |
|---|---|---|---|
| Adaptability | High — flexible, context-aware | Low — fixed stage ordering | High — LLM guides prioritisation |
| Precision | Low — lots of false positives | High — narrow, well-defined scope | High — sandbox-confirmed only |
| Auditability | Limited — reasoning is opaque | Strong — logs at each stage | Strong — full decision trail |
| Cross-asset reasoning | Yes | No — stages are siloed | Yes |
| Execution validation | Not present | Not present | Yes — sandboxed testing |
The critical gap is that neither approach actually tests whether its findings are real. Sculley et al. note that unvalidated ML outputs accumulate as technical debt over time. Our solution keeps the flexibility of agentic reasoning but wraps it inside a validation layer — so everything reported has been confirmed by actual execution, not just inferred.
FR1 Asset DiscoveryScan the environment using Nmap and Shodan, then build a structured map of how everything connects
FR2 Intelligent OrchestrationUse the LLM to decide where to look next based on what it's already found, not a fixed sequence
FR3 Sandboxed ValidationRun every proposed exploit in an isolated environment — nothing gets reported without proof it works
FR4 Scored ReportingScore findings using CVSS v3.1 and tailor the output for each audience — analyst, executive, compliance
FR5 Decision Audit TrailLog every reasoning step, tool call, and validation result so decisions can be explained and reviewed
NFR1 SpeedCritical findings should reach analysts within four hours of a scan completing
NFR2 AccuracyAt least 85 per cent of reported findings should be genuine — not noise
NFR3 ScalabilityEach layer should scale independently on Kubernetes as demand grows
NFR4 PrivacyPersonal data is pseudonymised at ingestion and handled in line with UK GDPR
NFR5 Human OversightAny high-severity exploit needs analyst sign-off before it runs
Async message bus connecting all four layers — lets each one scale independently and handles back-pressure when things get busy
Every service runs in a container with its own scaling policy. Deployments go through a canary pipeline — new versions handle 10% of traffic before being promoted
The asset graph lives here, along with a full version history of every scan run — any past state can be replayed exactly
Security controls: mutual TLS and role-based access control at the API gateway. Zero-trust network posture throughout.
We're using Anthropic's Claude API as the reasoning coordinator rather than training a security-specific model from scratch. It's a deliberate choice with real trade-offs on both sides, so we've tried to be honest about both below.
Building a training pipeline would eat the entire project timeline. Claude lets us focus on the parts that are genuinely new — the orchestration architecture, the validation loop, and the audit trail — rather than solving a problem that's already been solved.
A tool that generates working exploits is a dual-use risk. Claude's built-in refusal policies aren't perfect, but they reduce the governance burden significantly compared to starting from zero.
A fine-tuned model is locked to its training data. Switching from Sonnet 4.6 to Opus 4.7 is a single line change — no retraining, no validation cycle, and users immediately benefit from the better reasoning.
Asset metadata is pseudonymised before anything is sent to the API. Hostnames and IP addresses never appear in prompts — only sanitised descriptors do. We also have a signed DPA with Anthropic in place.
Kafka buffers queued hypotheses so the downstream layers keep working. A circuit-breaker retries with exponential backoff and pages the on-call engineer after three consecutive failures.
We use Anthropic's five-minute prompt cache so the asset-graph context isn't re-sent with every hypothesis. Low-priority work is batched through the Batch API, which cuts costs by roughly 50 per cent.
Exit route: the coordinator sits behind a thin interface. If we ever outgrow the API on cost or privacy grounds, a self-hosted fine-tuned model slots in without touching any other layer.
These three cover different angles — internal network topology, external exposure, and known CVE signatures. They run behind a shared abstraction layer, so adding a new scanner later is straightforward.
Powers the ReAct coordinator. It reads the asset graph, generates hypotheses as structured JSON, and updates its view after each sandbox result. Every call is logged in full for the audit trail.
Each hypothesis gets its own ephemeral container, spun up from a versioned IaC template. The network is completely isolated. If the exploit doesn't confirm in 120 seconds, the container is torn down and the result is logged as a timeout.
The asset-relationship graph lives here — service dependencies, network adjacency, shared credential domains. DVC keeps a version history so any past scan state can be replayed exactly.
All inter-layer communication goes through Kafka. This means each layer scales independently, nothing is lost if a consumer is temporarily slow, and the circuit-breaker can buffer Claude API calls during outages.
Helm charts version every deployment configuration. New releases go out as canary deployments — they handle 10% of traffic and roll back automatically if error rates exceed the threshold.
Tracks service latency, sandbox spin-up times, and hypothesis throughput in real time. Alerting rules flag a rising false-positive rate, high timeout counts, or a coordinator stuck in a loop.
Every Claude prompt and response, every tool call, every sandbox outcome — all indexed and searchable. This is the compliance store, aligned with ISO 27001 requirements.
Tracks model versions, hyperparameters, and data snapshots so any past state of the system can be reproduced. Feeds directly into the canary rollback pipeline when drift is detected.
This is the core idea. Claude generates hypotheses; the sandbox confirms or rejects them. Nothing reaches a report unless it's been proven to work. This breaks the self-referential loop that causes high false-positive rates in purely agentic tools.
Because the coordinator reads the full asset-relationship graph, it can follow lateral movement chains — something isolated pipeline stages simply can't do.
If Claude spots something unexpected halfway through, it can change course — rather than completing a predetermined sequence and only then looking at what it found.
Latency percentiles, sandbox provisioning time, hypothesis throughput, and per-service error rates — all visible on a live dashboard.
Every Claude decision, tool call, and validation outcome is indexed and queryable, aligned with ISO 27001 requirements.
The platform shuts down immediately. All containers are torn down and a human investigates before anything restarts.
The system rolls back automatically to the last validated MLflow checkpoint through the CI/CD pipeline — no manual steps needed.
Scans are paused and a human reviews the recent Claude decisions before things resume. Usually indicates a prompt drift or data issue.
All incidents go into the ISO 27001-aligned compliance log. The governance board — security, legal, and engineering — reviews these quarterly alongside any model updates.
The vulnerability classifier's training data is balanced across different technology stacks so it doesn't systematically under-detect issues on certain platforms. We run quarterly bias audits to check this is holding.
The audit log means you can follow any reported vulnerability all the way from the initial scan, through the Claude reasoning step, to the sandbox run that confirmed it. Nothing is a black box.
Low-severity work runs automatically for speed. Anything rated high or critical needs explicit analyst sign-off before it's executed or published. The quarterly governance board owns the broader oversight.
Personal data in scan logs is pseudonymised as it comes in, under the Data Protection Act 2018. Access is role-restricted, retention periods are limited to what's legally required, and deletion is automated.
The core idea is simple: LLM reasoning is powerful but unreliable on its own. By pairing Claude with a sandboxed validation step, we get the best of both worlds — flexible, context-aware analysis that only reports what it can actually prove. Nothing reaches a report without running in an isolated environment first.