June 17, 20266 min read

I Asked the Leading AI Chatbots to Grade an Auction Catalog. Here's What Went Wrong.

The idea was obvious, which is exactly why I tried it. A catalog had something like 400 lots, more than I was going to grade by hand, so I reached for the tools I already use every day.

I should be upfront about where I stand on this, because it shapes the whole story. I'm not an AI skeptic. I pay for the top tiers of the leading AI assistants, I use them daily, and BullionBidder itself uses AI for one specific job inside it. So when I had 400 lots to grade, my first instinct wasn't to distrust the chatbots. It was to reach for them, because I respect what they're good at. I dropped the catalog in and asked the natural question: which of these bullion lots are actually below retail and worth bidding on?

They answered in seconds. Clean, organized, confident. A tidy list of "deals," each with a weight, a purity, a verdict.

The answers were wrong. Not all of them, which is the trap, and not obviously, which is the bigger trap. They were confidently wrong, and confidently wrong is the only kind that costs you money, because it's the only kind you act on.

What actually went wrong

The failures came in three flavors, and once I started checking by hand I couldn't unsee them.

The AI invented weights. A lot description that didn't actually state a clean troy-ounce figure would come back with one anyway: a specific number nothing in the text supported, sitting there looking like a fact.

It misclassified metals. The kind of lot that's a multi-item commemorative set, or a plated piece, or a "silver-colored" novelty, read as if it were solid bullion at full silver weight. If the math thinks there's more pure metal in a lot than there is, every number downstream of it is wrong.

It got purities wrong. The kind of slip where a .925 sterling piece gets treated as .999 fine, which quietly inflates the metal value of the lot by the difference, and the difference is the whole margin you're bidding on.

The common thread matters more than any single error: the output gave me no signal about which parts it had computed and which it had guessed. A wrong weight looked exactly as authoritative as a right one. No "I'm not sure about this lot," no flag, no seam. Just fluent, even prose that was sometimes fact and sometimes fabrication, with no way to tell them apart short of redoing the work myself, which was the work I'd been trying to skip.

Why a general chatbot is the wrong tool, not a bad one

Here's the part I want to be careful about, because it's easy to get wrong in the other direction, and because I like these tools. The problem isn't that the models are dumb, or that they'll always be dumb, or that a newer one wouldn't grade my exact catalog better than the ones I used did. One might. That's not the point, and pinning the argument on "the model is weak" just means the argument expires the next time the model improves, which it will.

The point is structural. Grading a bullion catalog is a precision task: exact weight times exact purity against the live spot price, repeated across hundreds of messy, inconsistent, human-written descriptions, where being off on one number is one bad bid. Handing that to a general model freehand asks it to generate plausible text and hopes plausible text happens to be arithmetically exact on all 400 lots. Nothing about "produce fluent language" guarantees "did the math right every single time," no matter how capable the model gets. If anything, the better it sounds, the more dangerous the wrong answers are, because they read even more like facts.

It's a screwdriver-as-hammer problem. The screwdriver isn't bad. It's a fine screwdriver, and I'll keep using it for what it's good at. You just don't want it for this, and a sharper screwdriver doesn't change that.

The fix isn't "no AI," it's structure

This is the part that matters, and it's why this isn't an anti-AI post. What changed my results wasn't a cleverer prompt or a smarter model. It was giving each job to the tool built for its structure, and using AI for the part it's genuinely great at instead of the part it isn't.

The math gets done as math: deterministic arithmetic, weight times purity times spot, computed rather than "reasoned about." There's no room to guess, because there's no generation step inside the number, it's a calculation. And AI gets used where a general model is genuinely strong: reading a thousand inconsistent human descriptions and flagging the ones where the claim and the contents don't line up, the "this says 12 oz of silver but it's a commemorative set" catch. Reading messy text is a language job. Pricing it is a math job. Splitting them so each runs on the right tool is the whole fix.

And here's the single most important difference, the one that separates a built tool from a chatbot wrapper: it tells you what it isn't sure about. When a lot's description and its contents don't line up, or something can't be confirmed, it gets flagged for review instead of handed to you as a confident number. A freehand chatbot has no seam between reading the description and doing the arithmetic, so a misread silently becomes a wrong number with no warning. A built system keeps those steps separate and shows you exactly where to look. That flag, the "verify this before you bid," is the opposite of a wrapper projecting false confidence, and it's the thing I most wanted after watching fluent, wrong answers scroll past.

That's what BullionBidder is: the numbers are calculated, the descriptions are screened, and the output tells you what it's confident about and what you should check yourself.

If you're about to try the shortcut

You probably will, because the idea is too obvious not to, and that's fine, I did too. Just know what you're actually risking. The danger was never that AI gives you no answer. It's that it gives you a confident, well-formatted, wrong answer, and a clean list of numbers is exactly persuasive enough to bid on.

So run the numbers somewhere that does math as math and tells you where it's unsure. Quick Check does the all-in on a single lot you're weighing, and BullionBidder does it across a whole catalog and flags what to verify. Either way, the rule is the same one the rest of this blog keeps circling back to: trust the calculation, check the claim, and never bid on a number you can't see the work behind.

Ready to run the all-in math on a real catalog?

Open app