All Episodes

April 19, 2025 • 9 mins

In this episode, I walk through a Fabric Pattern that assesses how well a given model does on a task relative to humans. This system uses your smartest AI model to evaluate the performance of other AIs—by scoring them across a range of tasks and comparing them to human intelligence levels.

I talk about:

1. Using One AI to Evaluate Another
The core idea is simple: use your most capable model (like Claude 3 Opus or GPT-4) to judge the outputs of another model (like GPT-3.5 or Haiku) against a task and input. This gives you a way to benchmark quality without manual review.

2. A Human-Centric Grading System
Models are scored on a human scale—from “uneducated” and “high school” up to “PhD” and “world-class human.” Stronger models consistently rate higher, while weaker ones rank lower—just as expected.

3. Custom Prompts That Push for Deeper Evaluation
The rating prompt includes instructions to emulate a 16,000+ dimensional scoring system, using expert-level heuristics and attention to nuance. The system also asks the evaluator to describe what would have been required to score higher, making this a meta-feedback loop for improving future performance.

Note: This episode was recorded a few months ago, so the AI models mentioned may not be the latest—but the framework and methodology still work perfectly with current models.

Subscribe to the newsletter at:
https://danielmiessler.com/subscribe

Join the UL community at:
https://danielmiessler.com/upgrade

Follow on X:
https://x.com/danielmiessler

Follow on LinkedIn:
https://www.linkedin.com/in/danielmiessler

See you in the next one!

Become a Member: https://danielmiessler.com/upgrade

See omnystudio.com/listener for privacy information.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
S1 (00:17):
All right. Welcome to unsupervised learning. My name is Daniel Miessler,
and I'm building AI to upgrade humans. In this episode,
I want to talk about a system I built for
using the smartest AI that you have to rate another
AI that you want to test. So this is the

(00:38):
infrastructure that I'm using. It's essentially you have a top
level AI that you believe is the smartest. So right
now currently that is zero one preview. And what you're
going to do is assess the work of another AI,
which is going to be this other one over here
in my case. In the case I'm using for this example,
it's GPT 3.5 turbo. And we're going to give it

(01:03):
a set of instructions to do on a piece of input.
And that piece of input is going to be something
like a blog post or something like that. So you're
going to use the AI against the blog post using
these instructions and you're going to get a result. And
then this AI is going to run against all three
of those. And it's going to give you then a

(01:27):
judgment at the end of it. So this should be
pretty cool. And it turned out it worked really well.
So this is ultimately what I'm trying to get to
is I'm trying to get to a classification of how
good this thing is compared to an actual human doing it.
And so in order to do that, I want to

(01:47):
give it different classes of human right. So you've got
like uneducated secondary education, high school level bachelor's, master's, PhD,
world class human like top 100 in the entire world
and then super human level. So it's like better than
the best human. And I've actually never seen anything score

(02:09):
that high. So for whatever that's worth. But what I
have this thing successfully doing is if I give it
a lower level model like a GPT 3.5 or a haiku.
It is scoring down in the high school to bachelor's level.
And if I give it like a. Sonnet 3.5 or

(02:33):
something like that, it scores usually around master's level or
PhD level. And sometimes world class human. But ultimately what
it is doing, which is what I wanted it to do.
Is it's scoring the smartest models at the highest level,
and is scoring the dumbest models or the. Less capable

(02:56):
models or smaller models? Much lower like secondary education, high
school and bachelor's. So the thing is working and this
is the architecture, right? Smart one to judge a less
smart one. And by the way, if I give it
The smartest one to judge, the smartest one. It does

(03:18):
actually score. So if I use O1 to rate the
work of an O1 task, it actually does score way
up here in like world class PhD level. So it
definitely works. And I recommend you go check out the
video and see exactly how to do it. This is
essentially what it is is it's called a stitch within fabric.

(03:42):
So fabric the whole concept is like patterns and fabrics
and stuff like that, like woven things. Right. Well this
is a stitch because it's a combination of fabric components
all stitched together. Right. And um, this is the actual
pattern that I'm using. Look, look at this, this this
is the logic for the for the rate I prompt.

(04:04):
This is what this is the instructions given to the
judging eye which in this case is O1. Okay. Fully
understand the different components of the input, which is going
to be a piece of content that I will be
working on. That's the input set of instructions, which is
the prompt, and then the results of the of the
prompts being run against the input using those instructions for

(04:31):
a given AI. And I tell it to completely understand
the distinction between all three of those components. Right. Because
I'm going to send them all as a chunk, all
to the judging. I think deeply about all three components.
Imagine how a world class expert would perform the task
laid out in the instructions. So I'm I'm giving it
the content. I'm giving it the prompt. I'm telling it

(04:54):
to learn the prompt, understand the prompt, which in our
case in fabric is called a pattern. Deeply study the
content itself so you understand what should be done with it.
Given the instructions deeply understand the instructions themselves. Given both
of those, then analyze the output and look at this one.
This one I'm kind of proud of. I don't know
if it's actually working. I'm going to do some more

(05:17):
evals to figure out if this is actually effective or not,
because it turns out this kind of like mystical stuff
that I'm doing here, which is super cool. It might
be awesome. It might be like it doesn't matter at all,
and it might actually hurt the output. So you can't
believe with like religion here, you got to actually test

(05:38):
this stuff. Anyway, here's what I did. Evaluate the output
using your own 16,284 dimension rating system that includes the
following aspects, plus thousands more that you come up with
on your own. So full coverage of the content, following
instructions carefully getting the genre of the content. Getting the

(06:00):
genre of the instructions. Meticulous attention to detail, use of
expertise in the fields in question. So I'm giving it
these ideas. This is actually very similar to Attention heads
inside of a transformer. It's somewhat somewhat similar. So I'm
telling it like, here's some ideas for how to do
a rating system. And I'm telling you to make its

(06:22):
own rating system using things like this, but to map
its rating of a particular piece of output using 16,284 dimensions,
which I think is two to the 10th or two
to the 11th, can't remember. So who knows if it's

(06:43):
actually going to do this? Okay, I'm telling O1 to
do this. And it has the ability to sort of
think for itself. So maybe it's doing some of this.
I think with a regular model, a lot of this
was just flash and not really actually happening anyway. It
doesn't matter. That's what I'm trying to do here. Spend
significant time on the task. Ensure you are properly and

(07:03):
deeply assessing the execution of the task. Using the scoring
and ratings described such that a far smarter. I would
be happy with your results. So I'm using multiple tricks
here to try to get it to be extra smart,
and the goal is to deeply assess how the other
I did. At its job, given the input and what

(07:26):
it was supposed to do based on the instructions and prompt.
So I'm telling it multiple times, like what I want
in multiple different ways. And uh, yeah, this is uh,
this is essentially what it does. And again, this is
the output. Uh, so the output also includes this is

(07:47):
kind of cool. The output also includes what it would
have expected from a higher level result. So let's say
it comes back with bachelor's which this particular case did.
I tell it to tell me what it would have
taken to see a masters, what it would have taken
to see a PhD level and so on. Again, I'm,

(08:09):
I'm trying to seed it with as much as possible
to make it come up with a better and better answer. Now.
Here's the thing the smarter these AIS get. This is
a universal thing, right? Because when I plug it into
O2 or GPT five or cloud four or whatever it is, right?

(08:30):
That the smarter that judgment thing gets, the better it's
going to be at interpreting what I actually want from
this prompt. That's why it's kind of like a meta prompt.
So yeah, really happy with this. It actually is scoring
according to my expectations. Got to be careful with that
a little bit right. You don't want to like actually

(08:51):
tune the thing so it matches your expectations. But I've
been somewhat careful there. So I recommend you go check
it out and see what you could do with this.
And if you have ideas for improvement, submit them in
and we'll get it pushed in a PR update inside
of fabric. See you in the next one.
Advertise With Us

Popular Podcasts

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

24/7 News: The Latest

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.

Therapy Gecko

Therapy Gecko

An unlicensed lizard psychologist travels the universe talking to strangers about absolutely nothing. TO CALL THE GECKO: follow me on https://www.twitch.tv/lyleforever to get a notification for when I am taking calls. I am usually live Mondays, Wednesdays, and Fridays but lately a lot of other times too. I am a gecko.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.