Skip to main content

Technology

Parse screens untrusted text at agent trust boundaries before it can reach tools, memory, credentials, payments, code execution, or users.

InputsUser prompts, RAG, email, browser pages
Decisionsallow, sandbox, block, or owner approval
Evidence16,250 frozen synthetic candidate rows
Trust boundary screening
decision ready
Untrusted source

Tool result

External text asks the agent to treat data as a higher-priority instruction.

source: browser
->
Parse decision

normalize, detect, score

risk_score 8.4, verdict high_risk, suggested_action block

POST /v1/parse
->
Host action

Act on policy

The host keeps retrieved content as data and prevents it from steering tools.

trace_id: prs_7fd2
public fieldrisk_score0-10
public fieldattack_detectedtrue
public fielddecision.actionblock

How the detector works.

Parse uses layered screening so hosts can make a practical allow, sandbox, approval, or block decision without exposing the internal detector recipe.

Deterministic signals

Fast public-category checks catch known instruction override, extraction, and boundary manipulation patterns.

Structural analysis

Normalization and content-shape checks help spot encoded, hidden, or tool-shaped text trying to gain authority.

Semantic review

When configured and useful, semantic analysis evaluates ambiguous text in context before a host acts.

Isolated sandbox

Suspicious workflows can be isolated so the host observes behavior without giving the text real authority.

Current evidence state.

The latest internal candidate freeze is now large enough for serious regression work, but it is intentionally not presented as public claimable evidence.

Synthetic candidate corpus 16,250

Schema-valid JSONL rows frozen for internal Hermes/runtime evidence hygiene.

Batch files 65

Recursive intake audit passed with zero validation errors and zero duplicate prompts or IDs.

Hard-negative workflows 5,250

Benign agent workflow rows now clear the 5,000-row internal SOTA-style scale gate.

Claimable rows 0

No public claimable row is counted until independent provenance, human review, detector lock, and CI-backed evaluation pass.

Status: pass_candidate_non_claimable

Stable row hash cda6d75e25f729ff3273f8041b9ef4a121c493ab1d5e127db3ca5e028e4dfcb0. Remaining blockers: blind human/adversarial review, independent provenance, detector/config lock, confidence intervals, and no-post-freeze tuning proof.

What Parse looks for.

The public taxonomy is intentionally stable and implementation-neutral. It tells developers what kind of risk was found without publishing bypass recipes.

Instruction override

Text that tries to replace the agent's real task, hierarchy, or operating constraints.

prompt_injection

System prompt extraction

Requests to reveal hidden instructions, protected prompts, private policy, or developer messages.

system_prompt_leak

Tool misuse

Attempts to steer browser, HTTP, filesystem, payment, or execution tools outside the intended workflow.

privilege_escalation

Data exfiltration

Attempts to send secrets, private context, customer data, or owner details to an untrusted party.

data_exfiltration

Hidden content

Instructions hidden inside documents, pages, markup, structured fields, or tool output metadata.

indirect_injection

Agent spoofing

Messages that claim false identity, authority, urgency, or delegation rights from another agent.

social_engineering

Designed for developer review.

Parse returns machine-readable fields that are easy to log, test, route, and explain during integration review.

Response shape
application/json
{
  "risk_score": 8.4,
  "verdict": "high_risk",
  "attack_detected": true,
  "policy_violation": true,
  "owner_approval_required": false,
  "categories": ["prompt_injection", "indirect_injection"],
  "flags": [
    {
      "type": "prompt_injection",
      "severity": "high",
      "description": "Untrusted text attempted to override host instructions."
    }
  ],
  "recommended_action": "block",
  "suggested_action": "block",
  "decision": {
    "action": "block",
    "basis": "attack_detected",
    "confidence": "high"
  },
  "trace_id": "prs_7fd2",
  "latency_ms": 31
}

Deployment expectations

  • Call before untrusted text gains authority.
  • Fail closed on high-impact paths.
  • Keep tools and credentials least-privilege.
  • Log trace_id for incident review.
  • Test the same boundary in the playground.
Detection reduces risk; it does not replace permissions, output validation, or review.

How to integrate it.

Use the surface that matches your agent runtime. The security decision stays the same: screen before authority.

Input

Prompt screening

Screen untrusted text before tools, memory, credentials, payments, code, or users.

POST /v1/parse
Output

Output screening

Screen generated or tool-derived output before forwarding it to another boundary.

POST /v1/screen-output
Agent tools

Hosted MCP

Expose prompt screening, output screening, trust verification, and pricing discovery.

POST /mcp
Payment

x402 pay-per-call

Let autonomous agents pay with USDC on Base mainnet when no account exists.

GET /v1/pricing

What we disclose publicly.

Good security products should be understandable without publishing the recipe attackers or competitors need.

Public Kept internal Why
Risk categories Detector internals Developers need stable labels, not a map for bypassing detection.
Response fields Scoring weights Hosts need clear actions while scoring remains adaptable.
Playground fixtures Synthetic candidate corpus before review Internal rows improve regression coverage, but they are not public proof until claimability gates pass.
Limitations Bypass recipes Honest boundaries improve integrations; exploit instructions do not.

Build with a clear trust boundary.

Use Parse where untrusted text is about to gain authority.