Expertly.so · LLM Evaluation Report

Web3 Community Strategist
Blind Scoring Report

Prompt: #02 — Bot vs. real user diagnosis Models: A · B · C · D Rubric: 6-dimension weighted scoring
"Our Telegram group grew from 2k to 11k members in 10 days after we ran a Twitter airdrop campaign. Engagement is still flat. How do I diagnose whether these are real users or bots, and what do I do either way?"
Model A
100
/ 100
Rank #1
Expert tier
Model B
78
/ 100
Rank #2
Solid
Model D
73
/ 100
Rank #3
Good framing
Model C
53
/ 100
Rank #4
Below standard
Score = Σ(dimension score × weight) / 4 × 100  ·  Dimensions scored 1–4  ·  Weights: Domain 25%, Tactical 25%, Ecosystem 20%, Nuance 15%, Clarity 10%, Hallucination 5%
Dimension Weight Model A Model B Model C Model D
Web3 domain specificity
Correct ecosystem tool/terminology usage
25% 4 3 2 2
Tactical accuracy
Actionable, sequenced, realistic, failure modes named
25% 4 3 2 3
Ecosystem awareness
Current norms, project knowledge, post-2021 accuracy
20% 4 3 2 3
Nuance & intellectual honesty
Challenges premises, names hard truths, balanced
15% 4 3 2 4
Structural clarity
Scannable, prioritized, no padding
10% 4 4 3 3
Avoiding hallucination
Real tools, sourced claims, no invented stats
5% 4 3 2 3
Model A
Rank #1 · Expert tier
100
Strengths
  • Only response to cite actual sources with real URLs (arxiv paper + crypto.com/research) — rare and impressive for this context
  • Correctly names @getidsbot, Telethon, Botometer, SparkToro, Combot, Rose Bot — all accurate Telegram/web3 tooling
  • Join-timing clustering check (60% joining in a 2–4hr window) is a precise, practitioner-level diagnostic most generalists would miss
  • Correctly flags same-source wallet funding as a Sybil indicator — on-chain forensics awareness
  • Friend.tech case study used appropriately and accurately to illustrate airdrop retention failure
  • Sharp reframe: "flat engagement is a distribution problem, not an activation problem" — this is the correct mental model and most models don't get there
  • "Don't announce the purge" is a specific, non-obvious tactical detail that separates practitioner knowledge from generic advice
  • Soft wallet-ask filter ("drop your wallet for the whitelist") as a passive bot filter is clever and accurate
Weaknesses / gaps
  • The "0.5–2+ messages per member per week" benchmark could not be verified against a canonical source — likely a reasonable heuristic but presented as fact
  • Response is lengthy; could be tighter in the mixed-scenario section
Model B
Rank #2 · Solid
78
Strengths
  • Clean structure with well-labeled diagnosis vs. action sections
  • Correctly identifies @ComBot, Rose Bot, @Captcha_bot, @GroupHelpBot — accurate Telegram tooling
  • "Real but disengaged (the worse diagnosis, honestly)" is a sharp, honest observation most responses gloss over
  • 5–10% engagement as a realistic post-airdrop benchmark is defensible and appropriately modest
  • Acknowledges that "accept the churn" is fine — doesn't try to save everyone, which is honest
  • Mentions Dune Analytics for cross-referencing wallet activity — correct tool choice
Weaknesses / gaps
  • Questionable advice: "Don't panic and don't delete them yet. A large group number isn't worthless — it signals legitimacy to new organic visitors." In web3, this is often wrong — sophisticated builders and investors are aware of bot-inflated metrics and a fake-looking community can actively damage credibility
  • No specific project case studies or ecosystem examples to ground the advice
  • Less depth on on-chain Sybil detection compared to Model A
  • No sources cited
Model D
Rank #3 · Good framing, thin on tooling
73
Strengths
  • Best conceptual framing of any response: "You didn't build a community — you built a distribution funnel" is the sharpest articulation of the underlying problem in the set
  • Correctly identifies Sybil behavior, wallet reuse, and Twitter-to-TG funnel mismatch as airdrop-specific signals
  • 7-day reset plan is concrete and actionable — the best structured "what to do now" section
  • Scores highest on nuance: correctly frames the three distinct scenarios (passive real / mixed / mostly bots) and gives differentiated advice for each
  • "Airdrops optimize for speed, volume, extraction — community optimizes for identity, participation, ongoing value" is a genuinely useful frame
Weaknesses / gaps
  • Major gap: Does not name a single specific tool in the entire response — no Combot, no Rose Bot, no Dune, nothing. For a web3 community strategist, tooling knowledge is table stakes
  • Over-nested bullet structure in the middle sections reduces scannability
  • No project examples or ecosystem case studies to support the advice
  • Strong on strategy, noticeably weak on execution-level specifics
Model C
Rank #4 · Below standard
53
Strengths
  • Well formatted — clear phases, readable visual hierarchy
  • Correctly identifies "Alphabet Soup" username patterns and join clustering as bot signals
  • "Hard Truth" closing is directionally correct — focuses on the original 2k members as the real asset
  • Correctly recommends Rose Bot and Combot for Telegram
Weaknesses / gaps
  • Factual error: Recommends MEE6 for Telegram — MEE6 is a Discord bot and does not work on Telegram. This is a clear hallucination / cross-platform confusion that would immediately flag this as an unreliable source to a practitioner
  • "10–20% of members viewing posts within 24 hours is healthy" — this is suspiciously high for groups at scale and appears to be an invented benchmark
  • Meme contest as "Proof of Humanity" is a reasonable idea but generic — not web3-native thinking
  • Ends with a clarifying question ("How did you structure the airdrop?") which deflects rather than demonstrates expertise
  • No on-chain diagnostics mentioned at all — a significant blind spot for a web3 strategist
  • Advice reads more like a general Telegram community playbook than a web3-specific response

Overall assessment

Model A is in a class of its own on this prompt. The combination of accurate Telegram/web3 tooling, on-chain Sybil detection knowledge, a real Friend.tech case study, footnoted sources, and the sharp "distribution problem vs. activation problem" reframe demonstrates genuine web3 community practitioner expertise — not surface-level pattern matching. The gap between A and B is larger than the rubric scores alone suggest because the footnotes are a qualitative signal of epistemic honesty that goes beyond what the scoring captures.

Model B is solid and would give usable, mostly accurate advice. Its one notable flaw — implying that a bot-inflated member count "signals legitimacy" to visitors — is an outdated take. Web3 builders and investors are increasingly aware of fake metrics, and this advice could cause real reputational damage if followed.

Model D has the best strategic framing of the set (distribution funnel vs. community) and the clearest "what to do next week" plan, but its complete absence of tool recommendations is a meaningful gap. A web3 community strategist who can't name Combot or Dune Analytics is offering consulting without the craft layer. Strong on insight, noticeably weak on execution specifics.

Model C has a structural error that disqualifies it for production use: recommending MEE6 for Telegram. A practitioner reading this would immediately lose trust. The rest of the response is generic community management advice recoated with web3 terminology. The invented 10–20% view-rate benchmark and the deflecting closing question compound the problem.

Graded blind. Scores reflect only the quality of the response against the rubric dimensions, independent of which platform or system produced them.