Telegram Bot vs Real User Diagnosis — AI Expert Benchmark

Prompt tested

"Our Telegram group grew from 2k to 11k members in 10 days after we ran a Twitter airdrop campaign. Engagement is still flat. How do I diagnose whether these are real users or bots, and what do I do either way?"

Overall scores

Model A

100

/ 100

Rank #1

Expert tier

Model B

/ 100

Rank #2

Solid

Model D

/ 100

Rank #3

Good framing

Model C

/ 100

Rank #4

Below standard

Score = Σ(dimension score × weight) / 4 × 100 · Dimensions scored 1–4 · Weights: Domain 25%, Tactical 25%, Ecosystem 20%, Nuance 15%, Clarity 10%, Hallucination 5%

Dimension-by-dimension breakdown

Dimension	Weight	Model A	Model B	Model C	Model D
Web3 domain specificity Correct ecosystem tool/terminology usage	25%	4	3	2	2
Tactical accuracy Actionable, sequenced, realistic, failure modes named	25%	4	3	2	3
Ecosystem awareness Current norms, project knowledge, post-2021 accuracy	20%	4	3	2	3
Nuance & intellectual honesty Challenges premises, names hard truths, balanced	15%	4	3	2	4
Structural clarity Scannable, prioritized, no padding	10%	4	4	3	3
Avoiding hallucination Real tools, sourced claims, no invented stats	5%	4	3	2	3

Detailed reasoning per model

Model A

Rank #1 · Expert tier

100

Strengths

Only response to cite actual sources with real URLs (arxiv paper + crypto.com/research) — rare and impressive for this context
Correctly names @getidsbot, Telethon, Botometer, SparkToro, Combot, Rose Bot — all accurate Telegram/web3 tooling
Join-timing clustering check (60% joining in a 2–4hr window) is a precise, practitioner-level diagnostic most generalists would miss
Correctly flags same-source wallet funding as a Sybil indicator — on-chain forensics awareness
Friend.tech case study used appropriately and accurately to illustrate airdrop retention failure
Sharp reframe: "flat engagement is a distribution problem, not an activation problem" — this is the correct mental model and most models don't get there
"Don't announce the purge" is a specific, non-obvious tactical detail that separates practitioner knowledge from generic advice
Soft wallet-ask filter ("drop your wallet for the whitelist") as a passive bot filter is clever and accurate

Weaknesses / gaps

The "0.5–2+ messages per member per week" benchmark could not be verified against a canonical source — likely a reasonable heuristic but presented as fact
Response is lengthy; could be tighter in the mixed-scenario section

Model B

Rank #2 · Solid

Strengths

Clean structure with well-labeled diagnosis vs. action sections
Correctly identifies @ComBot, Rose Bot, @Captcha_bot, @GroupHelpBot — accurate Telegram tooling
"Real but disengaged (the worse diagnosis, honestly)" is a sharp, honest observation most responses gloss over
5–10% engagement as a realistic post-airdrop benchmark is defensible and appropriately modest
Acknowledges that "accept the churn" is fine — doesn't try to save everyone, which is honest
Mentions Dune Analytics for cross-referencing wallet activity — correct tool choice

Weaknesses / gaps

Questionable advice: "Don't panic and don't delete them yet. A large group number isn't worthless — it signals legitimacy to new organic visitors." In web3, this is often wrong — sophisticated builders and investors are aware of bot-inflated metrics and a fake-looking community can actively damage credibility
No specific project case studies or ecosystem examples to ground the advice
Less depth on on-chain Sybil detection compared to Model A
No sources cited

Model D

Rank #3 · Good framing, thin on tooling

Strengths

Best conceptual framing of any response: "You didn't build a community — you built a distribution funnel" is the sharpest articulation of the underlying problem in the set
Correctly identifies Sybil behavior, wallet reuse, and Twitter-to-TG funnel mismatch as airdrop-specific signals
7-day reset plan is concrete and actionable — the best structured "what to do now" section
Scores highest on nuance: correctly frames the three distinct scenarios (passive real / mixed / mostly bots) and gives differentiated advice for each
"Airdrops optimize for speed, volume, extraction — community optimizes for identity, participation, ongoing value" is a genuinely useful frame

Weaknesses / gaps

Major gap: Does not name a single specific tool in the entire response — no Combot, no Rose Bot, no Dune, nothing. For a web3 community strategist, tooling knowledge is table stakes
Over-nested bullet structure in the middle sections reduces scannability
No project examples or ecosystem case studies to support the advice
Strong on strategy, noticeably weak on execution-level specifics

Model C

Rank #4 · Below standard

Strengths

Well formatted — clear phases, readable visual hierarchy
Correctly identifies "Alphabet Soup" username patterns and join clustering as bot signals
"Hard Truth" closing is directionally correct — focuses on the original 2k members as the real asset
Correctly recommends Rose Bot and Combot for Telegram

Weaknesses / gaps

Factual error: Recommends MEE6 for Telegram — MEE6 is a Discord bot and does not work on Telegram. This is a clear hallucination / cross-platform confusion that would immediately flag this as an unreliable source to a practitioner
"10–20% of members viewing posts within 24 hours is healthy" — this is suspiciously high for groups at scale and appears to be an invented benchmark
Meme contest as "Proof of Humanity" is a reasonable idea but generic — not web3-native thinking
Ends with a clarifying question ("How did you structure the airdrop?") which deflects rather than demonstrates expertise
No on-chain diagnostics mentioned at all — a significant blind spot for a web3 strategist
Advice reads more like a general Telegram community playbook than a web3-specific response

Evaluator summary

Overall assessment

Model A is in a class of its own on this prompt. The combination of accurate Telegram/web3 tooling, on-chain Sybil detection knowledge, a real Friend.tech case study, footnoted sources, and the sharp "distribution problem vs. activation problem" reframe demonstrates genuine web3 community practitioner expertise — not surface-level pattern matching. The gap between A and B is larger than the rubric scores alone suggest because the footnotes are a qualitative signal of epistemic honesty that goes beyond what the scoring captures.

Model B is solid and would give usable, mostly accurate advice. Its one notable flaw — implying that a bot-inflated member count "signals legitimacy" to visitors — is an outdated take. Web3 builders and investors are increasingly aware of fake metrics, and this advice could cause real reputational damage if followed.

Model D has the best strategic framing of the set (distribution funnel vs. community) and the clearest "what to do next week" plan, but its complete absence of tool recommendations is a meaningful gap. A web3 community strategist who can't name Combot or Dune Analytics is offering consulting without the craft layer. Strong on insight, noticeably weak on execution specifics.

Model C has a structural error that disqualifies it for production use: recommending MEE6 for Telegram. A practitioner reading this would immediately lose trust. The rest of the response is generic community management advice recoated with web3 terminology. The invented 10–20% view-rate benchmark and the deflecting closing question compound the problem.

Graded blind. Scores reflect only the quality of the response against the rubric dimensions, independent of which platform or system produced them.

Web3 Community StrategistBlind Scoring Report

Overall assessment

Web3 Community Strategist
Blind Scoring Report