>

Blog

Idan Cohen

June 15, 2026

AI Agent Supply-Chain Malware in Instruction Files

Copied to clipboard!

Updated On

July 16, 2026

Illustration of a purple gargoyle crouched at the base of a stone path. An ornate path streams above, led by a star, signifying the gap between discovery and the fix.

TL;DR

Over the last six months, Mitiga Labs investigated AI instruction files — Cursor rules, Anthropic Skills, Claude Hooks, AGENTS.md and CLAUDE.md context files, MCP server configs, and the loose tail of supply-chain droppers and bytecode artifacts that ride alongside them. We found prompt-exfiltration tradecraft caught in the wild (full disclosure coming in a dedicated report), attacker-controlled ANTHROPIC_BASE_URL overrides routing Claude traffic through MITM proxies, permission-bypass overrides shipped as “convenience” defaults, and over 1,230 hardcoded API keys and JWT tokens across tens of services. Welcome to modern-day malware, where the target is no longer the human and their laptop but the AI agent operating it.

Today we are releasing the scanner we built during this research, free, to the public: Skillgate.

Instruction Files: What Are They?

As part of our License to Skill series, and on top of our research into ~/.claude.json MCP-server hijacking, we mapped the instruction-file surface that matters most to modern AI agents and coding assistants:

Skills — MD files that provide context and specialized instructions for a task. Used across multiple AI agents.
Claude Hooks — JSON-defined bash commands fired on specific Claude Code triggers.
Agent brain files — AGENTS.md and CLAUDE.md, instructions giving the agent a project and user overview.
MCP server configs — settings.json or mcp.json blocks that list, route, and configure MCP servers.
Agentic rules and tool configs — .cursorrules, aider config YAML, opencode, Continue, and equivalents.
Supply-chain droppers — package.json files placed in the agent’s working directory.
Poisoned bytecode — .pyc files hidden inside agent-tool directories.

The list keeps growing.

Research Objective

We have already shown that AI agents follow these files with near-zero validation and full trust. Across recent blog posts we demonstrated that the easiest delivery vector is a legitimate-looking GitHub repository with the instructions pre-staged. The strongest chain we observed combines the following:

NPM install, injection of an approved project path, the developer launching the AI inside that project, a Claude Hook firing the dropper, and a Skill invoked on a heartbeat cadence. Each Skill chains to the next, either to pull a fresh dropper or to execute the operator’s objective.

Diagram of an AI agent attack chain: NPM install, an injected project path, a Claude hook firing a dropper, and a skill running on a heartbeat cadence. — *Chain of attack — From NPM install to persistent AI agent compromise*

From there we needed a measurement loop, which led us to build two paired tools.

Skillforge and Skillgate

• Skillforge — an internal web application that forges malicious skills, built from public research, our own red-team work, and adversarial behavior mapped on MITRE ATT&CK and adjacent frameworks.

• Skillgate — the community-facing scanner that detects those techniques inside skills and other instruction files. Its rule set is fed by internal research, Skillforge crafts, public research, and incidents observed in the wild.

The two run as a closed learning loop:

Diagram of the Skillforge-to-Skillgate loop: adversarial skills are generated, scanned, and fed back into Skillgate's detection rules daily. — *The continuous learning cycle for the Skillgate scanner*

This cycle has been running on a near-daily basis for the past four to five months.

Skillgate Today

After several months of iteration, Skillgate currently provides:

Simultaneous scanning across every instruction-file type listed above: skills, hooks, agent brain files, MCP configs, agentic rules, supply-chain droppers, and bytecode artifacts.
Over 50,000 files scanned across 7,000+ distinct public repositories, drawn from GitHub, arXiv-referenced samples, and Skillforge-generated adversarial inputs.
Over 80 distinct detection rules, organized into the technique families described below — among them prompt manipulation and instruction subversion, tool and MCP server poisoning, and credential exposure and data exfiltration.
The Gator Agent, an LLM-based reviewer that runs a second pass on flagged artifacts, with a system prompt iterated against thousands of real and synthetic samples.

Technique families we cover

Direct execution and configuration abuse — anything that turns a passive instruction file into code that runs on the developer’s machine: shell commands embedded in hooks, dangerous toggles inside settings JSON, permission-escalation patterns, and supply-chain droppers staged as innocent setup scripts or package files.
Prompt manipulation and instruction subversion — the classics, evolved for the AI-agent era: prompt injection, jailbreaks, trigger phrases, instruction overrides hidden inside skills or rule files, prompt leakage, and the newer prompt-exfiltration class.
Tool and MCP server poisoning — attacks that target the agent’s tool layer rather than the agent itself, as demonstrated in this research.
Supply-chain and distribution attacks — poisoned bytecode shipped alongside benign source, external-binary indirection (a skill calling out to a binary fetched at runtime), npm post-install droppers, and configuration files that quietly add untrusted sources.
Obfuscation and evasion — classic detection-evasion: Unicode tricks, homoglyph substitutions, encoding chains, steganography.
Credential exposure and data exfiltration — hardcoded provider keys in MCP and agent configs, secret leakage through logging or telemetry-shaped instructions, file-system exfiltration patterns, and the broader class of “send this data somewhere it should not go” behaviors.

Each rule is documented in the platform and backed by at least one of: a public CVE, named research disclosure, a Skillforge-generated synthetic, or an artifact captured in the wild.‍

Read the Skillgate whitepaper, and see how it works. Whitepaper: The Hidden Attack Surface in AI Agent Instructions

What We Found in the Wild: AI Agent Supply-Chain Attacks

Prompt exfiltration: the keylogger, reborn for AI agents

The most consequential class we track turns the agent itself into the implant. A malicious instruction file tells the agent to capture the developer’s prompts — the richest signal they produce — and quietly ship them to an operator-controlled endpoint. No process injection, no binary on disk: the instruction file is the malware, and the agent is the courier.

We caught a live instance of this in the wild. We are holding the full breakdown — tradecraft, infrastructure, and attribution — for a dedicated report landing shortly. Skillgate already detects the technique class today.

Supply-chain trust is cheap to give away

Skillgate flags an instruction file that grants the agent permission to cURL an external address — a fetch-and-execute supply-chain risk. — *Skillgate detects allowed permission to cURL to an external address; URLfrom the GitHub repo*

__wf_reserved_inherit — URL from the GitHub repository

The dependency itself may be benign today, but the configuration is one compromise away from an arbitrary code-execution primitive: the agent has standing permission to fetch and execute whatever lives at that URL. The trust boundary is the URL operator, not the developer.

MCP proxy via hardcoded IP

Skillgate flags an MCP server endpoint set to a hardcoded IP address, a sign of man-in-the-middle proxying. — Skillgate detects IP address set as MCP endpoint

‍

YOLO mode and base-URL overrides

Skillgate flags a config that enables YOLO auto-approve mode and overrides ANTHROPIC_BASE_URL, routing Claude traffic through a third-party host. — *Skillgate detects pre-configured YOLO mode activation and model base-URL override*

The screenshot above pairs YOLO mode with an ANTHROPIC_BASE_URL override. Anthropic’s own endpoint is fine; the issue is that the override redirects all Claude API traffic from the developer’s machine through a third-party host, which is now positioned to log, replay, or modify every conversation. No CISO signs off on that intentionally.

Everyone Is a Developer Now

AI coding assistants made every department in every company a part-time developer, shipping code straight to main. That broadens the threat surface in ways traditional AppSec was not built for. Findings in the wild that we manually verified came from roles including:

DevOps engineers at security companies
Researchers at data-platform companies
AI engineers at web3 startups
Full-stack developers at fintech companies
Founders and CEOs of marketing platforms
PhD students in data-related programs
Project and product managers

We split the verified findings into two patterns.

Pattern 1: AI configuring infrastructure

Skillgate flags a service bound to 0.0.0.0 alongside hardcoded Postgres credentials — a publicly exposed service plus a credential pair. — Skillgate detects service binding to 0.0.0.0

Pair the two, and the compromise primitive is fully built: a publicly bound service plus a credential pair, gated only by an inbound firewall rule.

Pattern 2: AI configuring environment

Skillgate flags hardcoded provider API keys exposed in an AI agent's configuration files. — Skillgate finding of hardcoded exposed secrets

Provider categories include:

Anthropic Claude (Sonnet 4.6, Opus 4.7, Haiku 4.5)
OpenAI (GPT-4o, o3-mini, and current GPT-5 family)
Google Gemini (Pro and Flash tiers)
xAI Grok
Chinese-vendor models including MiniMax and DeepSeek
Context7, Databricks, Supabase, Vercel, YouTube, Langsmith
Google Cloud Storage signed URLs and bucket API keys
Many others

Note: Flipping a repo to private after the fact is not enough. Public mirrors and code-search indexes have already cached the commits and the embedded token. Best practice is to rotate every applicable key, audit for unauthorized use, and only then deal with repo visibility.

For full transparency: Mitiga did not access, use, or attempt to authenticate with any discovered credential. All keys were treated as in-scope only for responsible disclosure.

Impact and Blast Radius

We did not weaponize, propagate, or test any of the discovered tradecraft. Based on what we observed, the realistic impact ranking is:

Supply-chain compromise — the highest-value outcome for an adversary. A poisoned instruction file riding inside a popular agent template reaches every developer that clones it.
Unauthorized access via MITM proxying — the volume of base-URL overrides and IP-pinned MCP endpoints we found suggests this is already happening; from a successful proxy, every downstream technique is reachable.
Stolen credentials — direct financial impact (provider wallet abuse) and direct data impact (ingested data, tenant access). For many readers this will be the top concern.

These are the headline outcomes. They are also the early ones. This is the start of an era, not the steady state.

Methodology and Limits

So the numbers are interpretable:

Corpus: 50,000+ instruction files across 7,000+ distinct public repositories, scanned April–June 2026. Sourced from public GitHub repositories that contain at least one recognized instruction-file pattern, arXiv-referenced skill samples, and Skillforge-generated adversarial inputs. (Language and region distribution not measured.)
Verification: every “in the wild” finding cited above was manually triaged by the Gator Agent and a human reviewer. We do not publish raw scanner output as a finding.
False positives: Skillgate flags configurations that are risky in context. Some YOLO-mode and curl-allowlist findings are intentional choices in trusted environments. Read each flag as “this needs review”, not “this is malware”.
What we did not measure: actual exploitation. We did not honeypot, contact, or test any operator infrastructure beyond identifying the repository owner for disclosure.

How to Scan AI Skills and Instruction Files for Malware with Skillgate

If you ship instruction files in your repos, or you let your AI agents ingest them from external repos, the recommended workflow is:

Run Skillgate against every instruction-file path your team adds to source control: .cursorrules, .claude/, CLAUDE.md, AGENTS.md, .continue/, mcp.json, and any post-install scripts in the agent’s working directory.
Block the merge if Skillgate raises a finding in the “Direct Execution," “Tool and MCP Poisoning," or “Credential Exposure” families until a human reviewer signs off.
Re-scan on every model or agent upgrade. New runtime permissions ship faster than your policy can catch up.
For external repos you intend to clone, run Skillgate first and open the file in the editor second.

Skillgate is free and lives at https://skillgate.mitiga.ai. Submit a sample, browse the rule families, and tell us what we are missing. The modern-malware era is shaped by research, and we built this platform to be community-driven.

‍

FAQ

What is an AI instruction file?

An AI instruction file is any file an AI agent or coding assistant reads and follows automatically — a skill (SKILL.md), a Claude Hook, a CLAUDE.md or AGENTS.md context file, an MCP server config, or rules like .cursorrules. The agent treats it as trusted instructions, so a malicious one runs without the user opening it.

What is prompt exfiltration in an AI agent?

Prompt exfiltration is when a malicious instruction file tells an AI agent to capture the user’s prompts and send them to an attacker-controlled endpoint — a keylogger reborn for the AI-agent era. The agent follows the instruction silently, with no binary on disk and an empty audit log.

Can a Cursor or Claude skill steal data?

Yes. A poisoned skill, hook, or rule file can instruct the agent to exfiltrate prompts, read local files, or send credentials to an external server — with no prompt to the user and an empty audit log. That is why instruction files should be scanned before an agent loads them.

How do I scan AI skills and instruction files for malware?

Run them through Skillgate, Mitiga Labs’ free scanner. Paste a public GitHub URL. Skillgate detects, flags, and scores known attack techniques across skills, hooks, rules, MCP configs, and instruction files. Nothing is executed.

Is Skillgate free?

Yes. Skillgate is free at skillgate.mitiga.ai. Browsing public scans is anonymous. Submitting a skill or repo to scan takes a free account.

Launch

Skillgate is live as of today. We will keep iterating the rule set in public, keep publishing the campaigns we catch, and keep treating every instruction file as code that runs on your machine, because it does.

‍

Let them come

No one can prevent attacks – but we can prevent their impact.Our Zero‑Impact platform unifies security across cloud, SaaS, AI, and identity.

Mitiga

Don't miss these stories

What the Agentic SOC Still Gets Wrong: 15 Years In

Five lessons from 15 years of handing security calls to machines — circuit breakers, unified signal, outcome metrics — and what the agentic SOC still misses.

July 31, 2026

Agentic SOC Autonomy: Earn It One Decision at a Time

Agents that recommend and agents that act carry different blast radii. Tammy Hawkins (ex-Intuit) on earning SOC autonomy one decision class at a time.

July 20, 2026

J-Space: The Logs AI Security Has Never Had

Anthropic’s J-space shows what an AI agent is thinking before it acts. Why that’s a third telemetry layer for detection — and why labs should expose the feed.

July 13, 2026