How to Evaluate an AI Visibility Agency Without Getting Sold GEO Snake Oil

Every week another thread asks "which AI visibility agency should I hire?" — and fills up with agencies recommending themselves. The pitches sound identical: we'll get you into ChatGPT, we'll optimize you for AI search, we have a proprietary method. Some of those vendors do serious work. Others will charge you for a screenshot and a prayer. From the outside, the websites look the same.

There is a reliable filter, and it fits in one sentence: ask to see the audit output before you buy the optimization package. A vendor doing real work starts with measurement — which prompts, which AI providers, who got mentioned, who got recommended, which sources were cited — and can show you what that output looks like. A vendor selling snake oil starts with tactics, because tactics don't have to be measured. Everything in this guide unpacks that one test.

Why this market invites snake oil

AI visibility is a young category with a genuinely hard verification problem. AI answers vary between runs, between providers, and over time. That variance is exactly what makes a single ChatGPT screenshot worthless as proof — and exactly what makes it so useful to a dishonest seller. Run a prompt ten times, keep the one answer that mentions the client, and you have a "result" no one can easily dispute.

The buyer's defense is not becoming an expert in language models. It is insisting on the same things you'd insist on from any measurement vendor: a defined test, named instruments, raw data you can inspect, and repetition over time. Vendors who welcome that conversation are usually safe. Vendors who deflect it — "our method is proprietary," "AI moves too fast for benchmarks" — are telling you how they'll behave after the invoice.

Eight questions to ask before you sign

Use these in the first call. None of them require technical knowledge, and together they take about twenty minutes. What you're listening for isn't a perfect answer — it's whether the vendor has concrete answers at all.

1. What prompt set will you test?

The foundation of any honest engagement. A real vendor builds a prompt set from questions your buyers actually ask — category prompts ("best accounting software for restaurants"), problem prompts, comparison prompts, local prompts — and shows it to you before testing begins. If the vendor can't name the prompts, they can't measure anything, because every metric downstream is "per prompt."

2. Which providers do you cover?

ChatGPT, Perplexity, Gemini, Google AI Overviews and AI Mode, Claude where relevant. The answer matters because these systems retrieve and cite differently, and a brand can be visible in one and absent from another. "We optimize for AI" without naming providers is like an SEO agency that won't say which search engine it means.

3. Do you track mentions, recommendations, citations, sentiment, and competitors?

These are distinct outcomes. Being mentioned is not being recommended; being recommended without being cited tells you nothing about which source earned the recommendation; and a mention with negative sentiment is a problem, not a win. Competitor tracking matters because your visibility is relative — the practical question is rarely "do we appear?" but "who appears instead of us, and why?" If that framing interests you, we've written a full diagnostic on why ChatGPT recommends your competitors.

4. Do you show source URLs?

This is the single most revealing question. When an AI system cites its answer, the cited URLs show exactly which sources the recommendation is built on — a review site, a directory, a Reddit thread, a competitor's comparison page, your own site. Source URLs are what turn an audit into a work plan. A vendor who won't show them either isn't collecting them or doesn't want you to see how thin the analysis is.

5. Do you repeat tests over time?

One run is an anecdote. Because answers vary, credible measurement means repeated runs on a schedule, reported as rates and trends — mention rate this month versus last, citation sources gained and lost. This is also the only honest way to attribute results: if the vendor's work matters, the trend should move after the fixes ship.

6. How do you separate Google AI from ChatGPT, Perplexity, and Gemini?

A competence check. Google's AI features are built on top of its ordinary Search systems — Google's own documentation on AI features and your website says crawlability, helpful content, and Business Profiles are what feed them — so the fixes there look like foundational SEO. ChatGPT and Perplexity lean on different retrieval and different source preferences. A vendor with one undifferentiated "AI optimization" plan for all of them hasn't looked closely at any of them.

7. What will you change on owned versus third-party sources?

Your own site is only half the battlefield. AI recommendations lean heavily on third-party validation — review platforms, directories, editorial mentions, community threads. A serious proposal distinguishes the two: on owned pages, clearer answer-shaped content, structured data, entity consistency; on third-party sources, legitimately earned listings, reviews, and mentions. A vendor whose whole plan lives on your website is ignoring the sources AI actually cites; a vendor whose whole plan is "placements" should trigger the next section.

8. What do you not promise?

The best vendors answer this quickly, because they've thought about it: no guaranteed rankings, no guaranteed mentions, no control over what any model says. What they promise instead is a process — baseline, diagnosis, fixes, re-measurement. A vendor who hesitates here, or promises everything, has just failed the interview.

The sample-report test Every question above collapses into one request: "Send me a sample report — anonymized is fine." Real measurement produces artifacts, and a vendor who has done this work has one within arm's reach. If the sample shows prompts, providers, rates, competitors, and source URLs, keep talking. If it's a slide deck of definitions and a screenshot, you have your answer.

Red flags that should end the meeting

Any one of these is disqualifying on its own. They aren't style differences — each is either a promise the vendor cannot keep or a tactic that can actively hurt you.

Guaranteed ChatGPT rankings. There is no ranking to guarantee. AI answers are generated per query, vary between runs, and are not controlled by any vendor. Anyone who guarantees placement is either lying about their capability or planning to cherry-pick a run that shows what they promised.
Fake Reddit or forum mentions. Some agencies openly sell "seeding" — fake accounts posting fake recommendations in communities AI systems cite. Beyond the platform bans and the reputational damage when it's spotted (and communities are good at spotting it), Google's guidance on optimizing for generative AI features explicitly warns against manufacturing inauthentic mentions of your business. You would be paying for a liability.
No source URLs. If the vendor reports "you weren't mentioned" without showing which sources the AI answers actually cited, they can't tell you what to fix — and you can't tell whether they measured anything at all.
One screenshot as proof. A single answer, one provider, one moment in time, presented as evidence of anything — before or after the engagement — is theater. Ask how many runs, which providers, and over what period.
"llms.txt will solve it." Adding an llms.txt file is cheap and harmless, but no major provider has committed to treating it as a ranking input, and Google explicitly cautions that its AI features need no special files or magic markup — they run on ordinary search signals. A vendor leading with llms.txt is selling the easiest possible deliverable, not the one you need.
Hundreds of commodity AI pages. Mass-generating thin "answer" pages is the same content spam that search engines have spent a decade demoting, rebranded for a new buyer. It dilutes your site's credibility with the exact systems you're trying to impress, and it leaves you with a cleanup bill.

The snake-oil pitch

Walk away

"We guarantee top ChatGPT placement in 30 days"
Proof: one screenshot of one lucky answer
Plan: llms.txt, "AI-optimized" mass content, seeded mentions
No prompt set, no provider list, no source URLs
Method is "proprietary" whenever you ask for detail

The measurement-first pitch

Worth a call

"Here's the baseline audit; the plan comes from it"
Proof: repeated runs, rates, and trends per provider
Plan: fix owned content gaps, earn real third-party sources
Shows prompts, providers, competitors, and cited URLs
Explicit about what it cannot promise

Both pitches cost about the same. Only one produces something you can verify — or fire them over.

What good deliverables look like

Flip the evaluation around: instead of judging promises, judge artifacts. An engagement that's working produces a specific paper trail, and you can ask for every piece of it in the contract.

Baseline audit — your starting mention, recommendation, and citation rates across a named prompt set and named providers.

Competitor and source map — who appears instead of you, per prompt and provider, and which domains support them.

Citation gap analysis — the sources AI relies on in your category where you're absent, weak, or misrepresented.

Fix backlog — prioritized actions, each tied to a specific gap the audit found, split into owned-site and third-party work.

Monthly measurement — the same prompt set re-run on a schedule, reported as trends, so attribution is possible.

Sample report up front — shown before you sign, so you know exactly what you're buying.

The order matters as much as the list. The baseline comes first, the plan comes from the baseline, and the measurement continues after the fixes — otherwise no one can say whether the work did anything. If you want to see what the audit itself should contain in detail, our AI visibility audit checklist walks through the full prompt-provider-source workflow — it doubles as a spec you can hand to any vendor.

When the sample reports arrive, the comparison stops being abstract. Go through them artifact by artifact — the same eight items, every time — and note what each vendor actually shows you versus what they wave at. The pattern below is what that teardown typically looks like.

Artifact to demand	Measurement-first report	Snake-oil pitch
Prompt set	✓ 20–60 prompts, listed in buyer wording	"Proprietary" — never shown
Providers tested	✓ Named, each reported separately	"AI" — no provider named
Mention / recommendation / citation rates	✓ Per prompt, per provider	One screenshot of one lucky answer
Competitor visibility	✓ Who appears instead of you, and where	— your brand only
Source URLs for every citation	✓ Every cited URL, inspectable	— none shown
Repeat-run methodology	✓ Scheduled re-runs, reported as trends	One run, one day, one model
Fix backlog	✓ Each item tied to a source gap found	Generic tactics: llms.txt, mass content
What is not promised	✓ Explicit: no guaranteed rankings	"Guaranteed #1 in ChatGPT"

Use this as a live checklist while the vendor walks you through their sample report: every green cell is something you should be looking at on screen, and every amber cell is a follow-up question they should be able to answer on the spot.

Put it in the contract Two clauses keep everyone honest: raw data access (you can see every run — prompt, provider, date, full answer, cited URLs, not just the summary) and portability (the prompt set and history are yours if you leave). A vendor confident in their work agrees to both without flinching.

Running the evaluation — and where Plastorium fits

A realistic process for choosing a vendor: shortlist two or three, send each the same eight questions, and request a sample report from each. Disqualify on red flags first — one guaranteed ranking or seeded-mention offer removes a vendor regardless of everything else. Then compare the sample reports side by side: prompts, providers, metrics, source URLs, repetition. The differences will be much starker than the websites suggested.

To be transparent about our own position: Plastorium is a measurement platform, so this guide describes the standard we chose to build to — the diagnostic layer comes first. A scan tests a prompt set across providers and records mentions, recommendations, citation URLs, competitor visibility, and sentiment, with repeated runs over time. You can use that output as your independent baseline before hiring anyone, as the yardstick for whichever agency you pick, or as the audit itself. And you should apply every test in this article to us too — start by looking at a sample report before you run anything.

The vendors worth hiring will not be annoyed by any of this. Measurement-first shops love informed buyers, because the audit is where their work shows. The only sellers this guide inconveniences are the ones hoping you wouldn't ask.

FAQ

What should an AI visibility audit include?

A prompt set built from real buyer questions, the list of providers tested (ChatGPT, Perplexity, Gemini, Google AI), and — for every prompt and provider — whether your brand was mentioned, recommended, or cited, which competitors appeared, which source URLs the answer relied on, and how results changed across repeated runs. If any of those columns is missing, you have a screenshot collection, not an audit.

Can an agency guarantee that ChatGPT will recommend my business?

No. AI answers vary between runs, between providers, and over time, and no vendor controls the models' retrieval or training. A credible agency commits to a process — measure, fix the source gaps, re-measure — and reports movement in mention, recommendation, and citation rates. Anyone guaranteeing a specific ranking or placement in ChatGPT or Google AI is promising something they cannot control.

Is llms.txt enough to improve AI visibility?

No. An llms.txt file is cheap and harmless, but there is no evidence it drives recommendations on its own, and Google's guidance is explicit that its AI features rely on ordinary search systems rather than special files or markup tricks. Visibility gaps usually come from weak entity clarity, thin third-party validation, or content AI cannot cite — none of which a text file can patch.

How do I compare two AI visibility agencies?

Ask both for a sample report and compare the raw material: the prompts tested, the providers covered, the metrics tracked (mentions, recommendations, citations, sentiment, competitor visibility), the source URLs behind each answer, and whether tests are repeated over time. Then compare their promises. The vendor promising less — no guaranteed rankings, just measured movement — is usually doing more real work.

How to evaluate an AI visibility agency without getting sold snake oil

Why this market invites snake oil

Eight questions to ask before you sign

1. What prompt set will you test?

2. Which providers do you cover?

3. Do you track mentions, recommendations, citations, sentiment, and competitors?

4. Do you show source URLs?

5. Do you repeat tests over time?

6. How do you separate Google AI from ChatGPT, Perplexity, and Gemini?

7. What will you change on owned versus third-party sources?

8. What do you not promise?

Red flags that should end the meeting

What good deliverables look like

Running the evaluation — and where Plastorium fits

FAQ

Get the diagnostic layer before you buy the optimization.