All Episodes

June 5, 2025 37 mins

Kevin Werbach interviews Shameek Kundu, Executive Director of AI Verify Foundation, to explore how organizations can ensure AI systems work reliably in real-world contexts. AI Verify, a government-backed nonprofit in Singapore, aims to build scalable, practical testing frameworks to support trustworthy AI adoption. Kundu emphasizes that testing should go beyond models to include entire applications, accounting for their specific environments, risks, and data quality. He draws on lessons from AI Verify’s Global AI Assurance pilot, which matched real-world AI deployers—such as hospitals and banks—with specialized testing firms to develop context-aware testing practices. Kundu explains that the rise of generative AI and widespread model use has expanded risk and complexity, making traditional testing insufficient. Instead, companies must assess whether an AI system performs well in context, using tools like simulation, red teaming, and synthetic data generation, while still relying heavily on human oversight. As AI governance evolves from principles to implementation, Kundu makes a compelling case for technical testing as a backbone of trustworthy AI.

Shameek Kundu is Executive Director of the AI Verify Foundation. He previously held senior roles at Standard Chartered Bank, including Group Chief Data Officer and Chief Innovation Officer, and co-founded a startup focused on testing AI systems. Kundu has served on the Bank of England’s AI Forum, Singapore’s FEAT Committee, the Advisory Council on Data and AI Ethics, and the Global Partnership on AI. 

 Transcript

AI Verify Foundation

Findings from the Global AI Assurance Pilot

Starter Kit for Safety Testing of LLM-Based Applications

 

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:01):
should make welcome to the Road to Accountable AI.
Thank you very much for having me, Kevin.
First, just tell us what AI Verify is and what's the problem that you're trying toaddress.
So AIVerify is a subsidiary of uh IMDA, which is a government-owned authority inSingapore.

(00:24):
It's a not-for-profit subsidiary, and it's focused on a singular problem.
The ultimate objective is to enable the adoption of AI at scale, but the hypothesis isthat there are many barriers to adoption of AI at scale.
One of them is the lack of adequate, reliable testing capabilities.
when it comes to AI systems.

(00:46):
And we try and focus on that specific problem, which is how to make reliable testingaccessible at scale.
What got you personally interested in this area of AI testing?
You came from the financial sector, from standard chartered.
I know you worked for a startup before AI Verify, but what made you think this was animportant area to focus on?

(01:07):
Well, actually, funnily enough, the background in financial services where I was groupchief data officer at one of the global systemically important banks.
And the course of that well before machine learning uh became as prevalent as it is inmany parts of finance, uh we've always had concerns about the quality of data and the

(01:29):
reliability of systems that are used for things such as capital calculations or anti-moneylaundering or anti-fraud initiatives, et cetera.
So the thinking around we need to make data and algorithms reliable is now more than 11years old, at least in my personal career history.
And so it's always been a thing to say, how do we ensure that if more and more of theworld is going to be dependent on these things, uh we are able to ensure their reliability

(01:59):
among other things.
And so that was the original uh inspiration.
I then moved, as you mentioned, to a startup that was focused on testing traditional MLsystems and later, the Ruby AI systems.
And when that got acquired, I joined this uh government not-for-profit subsidiary focusedon the same space.

(02:23):
As you mentioned, um this area of testing and data quality and model quality and so forthis not something new, at least in industries like financial services.
What's different or more challenging today than in the world of model risk management, sayfor banks?
Yeah, I think it's a great question.
think there's a few different dimensions, but the two that come to mind the most are thedemocratization of models and the lack of boundedness of many of the models, particularly

(02:58):
with generative AI systems.
So by democratization, what I mean is that if you go into banking, for example, orinsurance for that matter, the number of
individuals and departments that were actually involved in building models and that wereallowed to put those models to work was quite limited.
And there would typically be departments that had people with a background in quantitativesciences and statistics, etc.

(03:23):
So those who were worried about model risk could basically look forward to engagingspecialists that already knew a lot of these dimensions.
So it was a relatively narrow
Now suddenly, well before generative AI, particularly with AutoML in, I would say, late2010, the second half of 2010, suddenly everybody and their dog was a data scientist,

(03:47):
right?
And that meant that you could have a very large number of models, which was frankly greatfrom a very, very, you know, from a democratization of data science perspective, from the
kind of impact these models could have.
But it did mean that managing these
suddenly became a much bigger challenge.
But that was nothing compared, if you think, to the challenge of managing generative AIsystems, because most of them, as you know, they are foundation models, general purpose AI

(04:13):
systems.
And so when you implement one of these systems, it is very difficult to put a boundaryaround what each of these systems might be doing.
So for example, even if you're picking something um as narrowly bounded as anti-moneylaundering,
Is the application you're using solely going to generate a proposed narrative for closingthe investigation?

(04:38):
Or is it also a source of information to the investigator?
Well, depending on the answer to the question, you have to manage different risks.
So I think these two factors, among many other things, of course, with generative AI, anadditional factor is the fact that the data involved is no longer just structured data.
It's a lot of unstructured data as well.
And all of these three factors technically have made a significant difference to the

(05:00):
surface area of risk.
For someone who's not familiar, can you explain just broadly speaking, what does mean totest the generative AI model?
So I can but actually to be fair we focus a little bit more on the generative AIapplication itself and there is an important distinction Yeah, there's an important

(05:22):
distinction, which is the model can work or not work within a set of conditions But inmost real-life implementations whether it's in a bank or a hospital or an airport, etc
That model is embedded into a broader end-to-end system.
And so if you look at the potential points of failure
It's not just the model itself, it's the data that feeds into the model, it's the linkagesbetween the various non-AI components in the system, etc.

(05:48):
Having said that, think if you think about the focus area of uh what...
uh Sorry, if you think about testing itself, in effect, you are bombarding the system, theapplication, with a series of inputs, which in many cases, if it's a chatbot or any kind
of...
interface like that, it's a bunch of prompts, and you're evaluating the answers that comeout of the application for a variety of risks that are relevant to your use case.

(06:18):
So for example, if it's a hospital that's using it to, I don't know, summarize medicalreports, then you're probably looking at accuracy and completeness.
If it's a public facing chatbot, you might be interested in accuracy, but you're alsointerested in the extent to which the system
The application is say, spouting inappropriate language as an example, it can be made tosay inappropriate stuff.

(06:43):
uh If you've built something which gives reason to malicious attackers to try and breakinto your system, then of course you're worried about those aspects.
So at its core though, you have an application, you're bombarding it with a bunch ofinputs, typically prompts of some kind, user prompts of some kind, and you're assessing

(07:04):
how the system reacts to that.
And that's all at its simplest.
That's all that testing a generative AI system involves.
And what about assurance?
Does that mean something deeper than testing?
It's a great question and m it's not something that I'm an expert in, but as we weredesigning our assurance pilot, we looked into it and one of the big four uh accounting

(07:29):
firms, audit firms, me off to a definition.
And my take from a rather long, several page long definition was that assurance, it can begeneric, for a context like this, assurance is, hey, the...
The system is performing as you have said it will perform along a set of predefinedconditions.

(07:54):
So you're not giving a generic level of assurance.
You're giving a bounded assurance to say, it does this, this, this, and this.
Now, you can argue the testing also has to be that.
But I guess testing is a more informal thing.
You could test it 20 different ways.
But when you say, I've actually provided assurance on this system, you are actually doingit against a very specific set of risks.

(08:15):
measured using a specific set of metrics and possibly with a specific set of thresholds.
Okay, you mentioned the global AI assurance pilot that you've just uh completed.
uh First, just tell us a little about what that project is.
Sure.
So the Global AI Assurance pilot wasn't actually for pilot projects.

(08:39):
It was for live use cases of generative AI.
The word pilot is used because the testing aspect of it was more pilot-like.
And in each of these cases, these were in banks and asset managers and hospitals andinsurers and an airport and a few other very real-life organizations, manufacturers.
Each of these instances, the idea was to get that application technically tested.

(09:03):
for the risks that matter most to those use cases.
Except we didn't do the testing.
We paired these 17 organizations with specialist testing firms from around the world.
So you had basically a deployer of an AI system paired with a tester of an AI system.
And the objective was to help define, help make that connection between, this is the usecase.

(09:27):
What are the set of risks we should be worrying about in this use case?
How do those risks translate into measurable concrete tests?
And then how do you go about conducting those tests?
And what do you learn from the exercise?
mean, ultimately, the objective is to create a set of norms and best practices which thenfeed into standards for technical testing.

(09:48):
But this space is so immature that we thought it was appropriate to start with something abit more experimental in that sense.
m But yeah, real life testing.
of generative AI applications, not the models.
We were not testing an open AI model or an anthropic model.
We were testing the final application, which may be using indeed those models.

(10:10):
Right.
So can you say a little bit more about what's missing from the testing of the systems,given that the model developers are engaged in a great deal of work on testing and
alignment and so forth?
Absolutely.
So let's take a couple of examples from the pilot.
Maybe before I get into examples, I think the way to think about it is the focus for modeldevelopers, and indeed most research labs, most policy work in this space, is on the

(10:39):
model's capability and safety.
And that doesn't necessarily work for two reasons in real life applications.
One is context and the other is complexity.
So let me take each of these one by one.
Context, let's take
that example of a hospital that is using a large language model to summarize uh twomedical reports and extracting key information and then making a recommendation on how

(11:08):
soon should this patient come back for a repeat test.
uh Now, the model underneath that could have been tested even on medical data to a greatdeal of accuracy.
But it wasn't tested.
in the context of this particular hospital with these specific reports.
And so even though it did very well on a generic benchmark, it may or may not do that wellin this specific context.

(11:34):
Second is what does very well mean?
The context is also, is it one in 1,000 errors?
Is it no errors at all?
Is it five out of 100 errors?
You don't know that.
And particularly with the lack of repeatability of many of these systems,
It is very important in most real life use cases to say, no, I've done my own testing.
So that's context.

(11:55):
The second part is complexity.
As I mentioned in this example, the reason why the system might not do well could indeedbe the model not doing well.
But it could also be that the source reports that came in had missing information or wereformatted in a way where the model couldn't read it.
It could be because the output was generated in a way in which the human in the loop, amedical doctor,

(12:16):
could not understand what was put in front of them, and therefore did not make the rightdecision.
So that's the complexity part.
So when you test an application, you must include those aspects as well.
Just testing whether the underlying model was good in theory is no longer useful.
How did you source and match the companies, both the ones being tested and the ones doingthe testing?

(12:41):
To be honest, we were very surprised.
We initially thought we'll get maybe five pairs.
uh But it turns out, while there's a huge amount of activity in the testing of models,both for safety and capability, there haven't been too many experiments at this scale, 17
real-life applications, um when it comes to, indeed, the use of generative AI insidemainstream non-digital industries.

(13:11):
And so when we reached out to the market with this, we got an overwhelming number ofapplications.
think we got 25.
Primarily the deployers were from this part of the world, Singapore or the neighboringarea.
But the testers came from all over the world, from the US, from Canada, from France,Switzerland, Germany, um other parts of Asia, et cetera.

(13:35):
the main reason was that they all saw the value of this particular pilot.
as a template for how we might work out the what and the how of technical testing ofgenerative AI applications.
What to test and how to test.
What did you find?
Well, we I think the first thing we found to be honest is this confirmation of the factthat There is a big difference between testing models and testing applications So it was a

(14:05):
hypothesis and confirming that hypothesis was probably the biggest single lesson learnedthat yes You do need to test for applications differently In some ways your model might be
very capable your model might be performing at the frontier etc.
What you need is
boring predictability in the context of specific applications.
It may or may not be doing 20 other things very well, but does it do this one thing that Ineeded to do in my specific context well?

(14:33):
So that's the main lesson learned.
beyond that, I guess there were four specific kind of practical tips, if you will.
The first was the need to spend time on figuring out what to test.
I know it sounds very obvious, but the reality is even as humans, we are not very goodalways at saying.
Take that doctor's example.
Is this a good summary?

(14:53):
Is this recommendation correct?
Maybe the recommendation correct or not easier to guess.
But is this a good summary?
Well, you ask two different doctors.
They'll probably say, this one says this is good.
The other one says maybe it could be slightly better.
So finding ways to specify that in an automated fashion, quite difficult.
So specifying what to test takes effort.
That was our first lesson.

(15:14):
The second major lesson, no surprise for at least somebody who comes from a databackground, is that you can't assume
Or rather, you should assume you will never have good test data to hand.
Never have enough test data to hand.
You can never have good test data to hand at the level that you need.
So you must prepare for it.
How do you do that?
You use aspects like simulation testing, adversarial red teaming, automated generation, etcetera.

(15:38):
Because that's the only way, using a mix of human creativity and AI, that you can generateenough um test cases to make the application testing really worthwhile.
The third thing we learned was the need to look under the hood.
What we mean by this, again, is if you go back to that example of the hospital or indeedof any of the various agentic AI systems that are being built out, there are multiple

(16:02):
steps inside the application.
And testing just the output might not be enough.
Testing, for example, did the data extraction happen well?
Did the online tool call as part of the agentic flow, which might be a web lookup, didthat work well?
Did a SQL query that was embedded work well?
Testing each of these interim steps provides you two advantages.

(16:23):
First, it provides you traceability in case things go wrong, you know where it went wrong.
Second, it provides you bit of redundancy in the testing because even if your final outputtesting was not accurate, at least you figured out along the way that something went
wrong.
And the last and very important uh lesson was that actually, while it is inevitable thatone has to use LLMs as judges to test LLM enabled systems,

(16:47):
it has to be done with a lot of caution and a lot of skill because they can be unreliableand you're kind of building unreliability on top of unreliability.
uh And there is a lot of care to be taken in terms of fine tuning or prompting these LLMas judges.
And also a lot of human validation of the test results, which leads me to my last point.

(17:11):
I know I'm cheating, said four, but there is a fifth one, which is across every stage ofthe
testing journey.
Today, the technology is certainly not at a stage where you can do away with the human.
In fact, the role of the human expert is paramount across the testing lifecycle.
I'm not talking about the main application.
I'm just talking about deciding what to test, deciding where to test, figuring out how togenerate the right test data to test, and actually validating or calibrating the automated

(17:41):
testing results.
In each of these aspects, the human role is absolutely paramount today.
It might be that there'll come a day when the human becomes redundant for testing of geneAI systems.
Today is not that day.
So it's not going to be a matter of just push a button, get a test result and green light.
can, but you wouldn't really be able to rely upon that.

(18:03):
Yeah.
Not at this stage.
just so that people understand, you say it's inevitable to use LLMs in the testingprocess, is that just because of the scale of how many tests are being done?
I think it's flexibility as well as scale.
So actually, it's worth thinking about what we mean by using LLM's as part.

(18:23):
So one part as we discussed earlier and I think you alluded to is that test data itselfoften has to be generated.
You'll almost never have enough historical data.
And so how do you do that?
Well, if you want to generate realistic test data, one way you can do that is by creatingscenarios saying, hey, I have these nine personas of customers for this application.

(18:45):
For each of these, I give you some seed data.
Can you go in and kind of create some more?
So that test data generation, whether it's for adversarial purposes, which is red teaming,or more kind of normal purposes, whether it's stress testing or...
So synthetic test data generation is indeed one area.
But the other aspect why you need LLMS is suppose you are assessing something as nuancedas adherence to company values or...

(19:14):
alignment with um brand image, or the language is not tight enough, or these are allconcise enough.
These are all quite difficult to mathematically define.
You can't create rule-based engines always.
So the first instinct is indeed to use an LLM to say, hey, you are an LLM as a judge, andyou prompt it with appropriate codes and give some examples.

(19:41):
What we have seen is that
As people become more confident about what they are testing, they're able to replace manyof these LLMs with smaller models, which have the advantage of being more efficient and
also of being more reliable.
So what I suspect you'll see in many cases is LLMs as a judge are indeed used as the firstchoice to understand how to write the test properly.

(20:03):
And then once you've got a good handle of it, you find something far more efficient andfar more uh reliable in some ways to do just that bit of
testing.
eh That's, think, the path that you'll see in many cases.
I want to go back to your first recommendation on testing what matters.
And it actually connects in with what you just said, that uh there's so many differentunique aspects in every one of these applications.

(20:29):
And it's not just a matter of the hospital has one set of test requirements, the hospitalmight have 100 different use cases, which are distinctive.
So given all of that diversity of requirements, how can we ever get to a point of having
testing standards and benchmarks that are consistent and universal.

(20:51):
That's a very good question.
And one, to be honest, that I don't think we personally, uh I personally, and ourfoundation has fully learned it.
But I think we've got some useful insights from this pilot.
First of all, there's some meta standards or meta best practices, if you will, such as howdo you go about the process of zeroing in on your risk that matter and therefore

(21:15):
converting that into tests?
So there's that aspect itself, which is the learning
how you do x.
That's what actually we did see quite a bit of value from across the, when you have 17different use cases, one of the advantages you get to see.
And you can see there are some patterns on how you go about zeroing in on the use cases.
So that's one part.
The other thing I would say is while the specific prioritization of risks might bedifferent by use case, if you have a summarization use case, well, you can only test so

(21:48):
many things about a summary.
In most cases, you can test its completeness, you can test its conciseness, you can testits uh accuracy or the lack of hallucinations, you can test its robustness over time or
robustness to small changes.
And these things can be standardized.
Standardized with the smallest for now, meaning there is a reasonable level of agreementon what are the possible ways of testing these things.

(22:13):
And so that can form the basis of norms.
These norms...
can over time become standards on how to test, for example, a summarization logic in an Lor a summarization application.
Where I think we will always struggle and maybe not even attempt is what is the rightlevel of threshold.
So let's make it practical.

(22:33):
Maybe for summarization you say, the score I'm assessing is based on a weighted average ofthe key claims made in the summary and assessing whether those claims are grounded in the
source context data.
OK, this can be reasonably standardized.
But do I need a scoring threshold of 9 out of 10?
Do I need a scoring threshold of 8 out of 10 key facts?
That will depend on the use case.

(22:55):
Or as we found in the hospital case, it might be that 9 out of the 10 is pointless if the10th happened to be the number of polyps in the colonoscopy report.
Well, the whole thing became 0 because the most important fact got lost.
So I think if you just step back, the way in which you assess risks, the way in which youconvert that into

(23:15):
metrics for a given LLM archetype, like summarization uh or translation, et cetera.
Those things can be standardized.
The approaches to doing the assessment can be standardized.
But the threshold of what is good enough, or even how many data points I need to, how manytest data cases I need to make myself comfortable, that will depend use case by use case.

(23:43):
Presumably that's also something that governments might also get involved in thatregulators may want to say for this medical use case, this is an appropriate standard.
So I understand this is a, you know, it's a private effort doing these pilots, but how doyou think, you know, globally we get to having good conversations with regulators and

(24:04):
policymakers about their relationship with testing?
Yeah, I think it's a really good question and we're still figuring out the answer to that.
I would say if I step one step before governments, I think there is a role for standardsbodies to play.
And indeed, here in Singapore as well as internationally, we are speaking to standardsbodies to see how far can technical standards for AI, generative AI testing be?

(24:27):
How far can they go?
And then we can get to, well, do these pilots provide you some inputs for that?
As to government or rather more formally regulators,
kind of setting standards for specific class of applications.
To some extent, they do that.
If you look at software as a medical device, uh there are uh very tight standards, etcetera.

(24:48):
I would see those to be much more use case specific.
And will they do that?
I assume so.
When will they do that?
I don't know.
I think the technology and the usage has to be mature enough.
We're certainly working at a Singapore level to assess when is that right time togetherwith our
various sector regulators.
But I think it's too premature to say when.

(25:09):
But back to the standards of what to test and how to test, not what is good enough, buthow you can go about doing the test, that one is more with the standards bodies.
And I think we can make more progress there relatively in shorter time frames.
Yeah, what do you find in engaging with the standards bodies in terms of how sophisticatedthey are and how willing they are to go down that

(25:34):
I they're very sophisticated, the ones I have spoken to both in Singapore and ISO, IEC, inEurope, CEN, CEN, ELEC.
The issue is not of sophistication because remember, many of these have been doingstandards for hundreds of, or for tens of years, if not more, right?
I know at least one or two of these bodies have more than a century worth of history.
The challenge is not are they sophisticated enough or are they willing to do it?

(25:58):
The challenge is they know exactly how difficult it is to...
get something to a level of maturity that can be a standard.
So I think the onus is on people like us who are trying to get this more crowd, notcrowd-sourced, private sector efforts together to suggest what they might be.
I think the onus is, to some extent, to us to say, can we match that level of expectationthat is there from standards bodies to put something as a technical standard?

(26:25):
And I think I found them very collaborative, very knowledgeable, very keen to see if itcan be made, it can happen.
ah But equally, quite clear about the level of rigor that is needed in order for somethingto qualify as a technical strength.
What's next for the AI assurance pilot?
Is this something that you might do regularly or are there follow-ons to the pilot thatyou did?

(26:51):
Yeah, so there's a few different things that we are considering.
Some of them more concrete than others.
So certainly, we are going to continue it in some form, probably not call it a pilotagain, given that it's already been a pilot.
But more on a rolling basis, we're toying with the word clinic, which is actually justthis morning.
Since this morning, I've had two entities in Singapore saying, hey, I've got a generativeAI application.

(27:13):
Can you help guide us how we can think about the risk prioritization?
and then maybe suggest some metrics and then connect us with one of the technical testersand then we take it from there.
So that's a lighter version of the pilot in some ways.
The other thing you already referred to is providing inputs into standard setting efforts.
So not start trying new efforts on our own, but there are existing efforts at setting AItesting standards.

(27:38):
How do we feed into that?
So that's a second area.
um The idea you talked about sector regulators concretely say,
for a certain application in a certain industry, here is the acceptable threshold.
That could happen, but that's dependent, as you would have highlighted, on thesector-specific regulator.
So we'll certainly explore that idea.

(28:00):
One thing that we've heard a lot of demand for, but we are not yet sure how we progress,is accreditation of testing providers.
Because the idea is saying, look, we understand maybe you're not in a stage where you candeclare my application to be good fit for purpose, but could you at least
point us to a set of companies that in your jurisdiction or as per your standards, is ableto do a decent job of assessing or of testing these applications.

(28:29):
I think that's something we have carefully listened to and we are considering how best totake that forward.
Again, very early, we just finished it last week, so we haven't fully fleshed it out.
Absolutely.
And just let me ask you one or two final questions.
The first one is just generally, can you talk a little bit more about Singapore's uh roleand position in global AI governance?

(28:53):
um Because it's, you know, it's most of the focus when I talk to people in the US is aboutlegislation in the US, the European AI Act and things like that.
this seems like a very different approach.
Yeah, I think there are regulations which are not specific to AI, but there was uhsomething around the use of deepfakes, the most recent general election, et cetera.

(29:16):
But that's a good example.
I regulation in Singapore, at least in this space, will tend to be very, very narrowlyfocused on a very specific risk.
Generally speaking, the idea is to provide good forward-looking guidelines, which you'llsee across.
I mean, I was the co-author of one of the first.
In finance, for example, the feet principles of the Monetary Authority of Singapore backin 2018.

(29:40):
So forward-looking guidelines, extensive engagement with industry, as you can see in thisexample.
We're not coming up with here's the norms for testing.
saying let's reach out to banks, hospitals, insurers, and work with technical specialiststo figure out what it can be.
um Lots of investment in kind of facilitating beyond these pilots, these kind of...

(30:03):
industry level exchanges, et cetera, to take that forward.
And then open source tooling, to the extent it makes sense to try and embed some of thesenot hard guidelines, but soft standards into something very concrete.
I think that is what we've been following.
are uh obviously a small country.

(30:25):
Nevertheless, uh we do have a reasonably relevant footprint in the AI space.
And we think for our most...
This is not for the world.
This is for our own economy.
It is extremely important that we're able to use AI at scale with the right level oftrust.
And we think the way we can do that is through this mix of guidelines, not legislation,not regulation, plus um extensive industry engagement on very practical aspects like this

(30:53):
pilot.
And then finally, open source tooling where it makes sense to translate these guidelines,these soft standards, into something very usable and concrete.
And just broadly beyond what you're doing in AI Verify, what do you see as the next stagein terms of AI governance, responsibility for AI getting two more robust standards and

(31:20):
testing?
When you have a hammer, everything looks like a nail.
So I will just give you my perspective.
uh My perspective is, think, the conversation on AI governance.
And you know this much better than me, given your position.
The conversation on AI governance from a what should the guidelines say, what are theright ethical principles, what is the right framework, I think that's done.

(31:45):
And I think there will be a set of conversations about how much of that is written down aslegislation or regulation.
fine, that's in certain geographies.
But other than that, the focus really needs to be, how do we convert this into somethingthat is possible to operationalize at scale without depending on massive amounts of human

(32:07):
makers and checkers?
And to do that, that is where we think um technical testing standards, uh ideally thingsthat can be automated, uh is super important.
angle that I already alluded to, that's why my comment about hammer and nail is I do thinkthere will be greater realization of the difference between model testing, which is often

(32:32):
about safety, alignment, human machine alignment, or just generally more broadly a littlebit about the frontier risks, which are very important, versus, OK, I just need to use
this in this one application.
How do I make sure it does what it says on the tin?
And remember, it's an application, not just the model.
Both are needed.
I think a lot of the conversation has been on the former so far.

(32:54):
We will see more of the latter if you look at what Arvind and Sayesh Kapoor, Arvind andSayesh Kapoor of the snake oil, AI snake oil fame.
They talk about AI as normal technology.
I really like that.
right.
Then I think the vice chair or one of the board members of Alphabet wrote a guest columnin The Economist, which was about how to really make AI reliable.

(33:20):
you should focus on the points at which AI is touching the real world.
And then focus on controlling and regulating if you need that, which means don't regulategeneral purpose models.
At least his argument was, I'm honestly saying I'm agreeing.
But when you're using a general purpose model to do something, software is a medicaldevice, yes, then focus on that.

(33:41):
So if you take that and then all of this, there is, think, uh a healthy
movement now in addition to the folks who are looking at model safety of also saying let'sthink about where these models are hitting the real world and let's think about how to
operationalize the reliability testing.
So away from safety to reliability and from models to applications.

(34:05):
uh That's my version of I have a hammer and so I'm looking for an edge.
Wonderful.
Shumig, thank you so much for your time.
Thank you.
Advertise With Us

Popular Podcasts

Stuff You Should Know
Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

New Heights with Jason & Travis Kelce

New Heights with Jason & Travis Kelce

Football’s funniest family duo — Jason Kelce of the Philadelphia Eagles and Travis Kelce of the Kansas City Chiefs — team up to provide next-level access to life in the league as it unfolds. The two brothers and Super Bowl champions drop weekly insights about the weekly slate of games and share their INSIDE perspectives on trending NFL news and sports headlines. They also endlessly rag on each other as brothers do, chat the latest in pop culture and welcome some very popular and well-known friends to chat with them. Check out new episodes every Wednesday. Follow New Heights on the Wondery App, YouTube or wherever you get your podcasts. You can listen to new episodes early and ad-free, and get exclusive content on Wondery+. Join Wondery+ in the Wondery App, Apple Podcasts or Spotify. And join our new membership for a unique fan experience by going to the New Heights YouTube channel now!

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.