Ground Truth & Guided Journeys: Rethinking Data for AI with Inna Tokarev Sela

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:07):
What's going on? Warren? How are you?

Speaker 2 (00:09):
You know? I was so totally unprepared for that question.
I don't know what I thought you were going to
say to start off the episode, but of all the
things you could have picked, it was not that right.

Speaker 1 (00:19):
Just what a jerk asking someone how they are? That's
oh man. I can't remember who it was. We had
a I can't remember who the guest was, but I said,
you know, there's no like stump the chump or surprise questions.
And my first question what tour was how she's doing?
And she's like I thought you said the would be
no trick questions.

Speaker 2 (00:39):
That was Adriana. She was just on here, you know.

Speaker 1 (00:45):
Cool. So today we're gonna be talking about AI ready
data and AI governance, and to help us through that conversation,
we have Ina tokodev Sela from a Lumex. You know,
welcome to the show.

Speaker 3 (00:59):
Thank you so much. Will happy to be here.

Speaker 1 (01:02):
I'm excited to have you here. So give us a
little bit about your background and what led you to
creating because you're the CEO and founder of alumin X
or Alumax, So give us a little bit about how
you got there.

Speaker 3 (01:17):
It was a long passway, but also happy about the
way you know that my career took me. I started
an enterprise, SAP, a huge German software company, and I
think this is the you know, the most hidden choose
about enterprise. You can actually have quite an adventure and
then you can build stuff, big stuff right and switch

(01:39):
careers within. So I spent twelve years at SAP, starting
as an architect and then basing evolving into a customer
facing role as a partner manager and then a head
of pen Alpha video analytics units. So quite in the
journey and quite a privilege to work with the world
biggest company. Think about the worldmar builder buying and so

(02:01):
on and so forth, so really washing companies and boarding to
the cloud journey, and then machine learning and then you know,
neural and augentic as well. And then I continued my
career to license as business intelligence vendor and then understood
what an underserved segment of business users actually are. You know,

(02:23):
after building all these analytics for all those years, you
know you're speaking to to the actual users and some
of them are CEOs of the companies, and they still
cannot get the hands on the actual self service analytics.
So this has really moved me to build a company
which creates a space where data could be recognized and

(02:45):
meaningful semantically and business wise to the business users to
enable them to self service analytics and data access. Right.

Speaker 1 (02:52):
And so whenever you're thinking about self service data, one
of the challenges I've had in the past is self
service sometimes means guiding yourself to the wrong answer. And like,
one specific example I have is I was working for
a company. It was like, how many how many users
do we have using our app each month? It's like, oh,

(03:14):
that seems like a pretty straightforward question, right, But turns
out it wasn't because then, oh, when I meant monthly
active users, I actually meant people who weren't on the
trial and had converted to paid, but you know, they
were like using it at least three times a week,
not just once a month. And so, like a really
relatively simple question turned out to be coming with a

(03:37):
lot of constraints. So how do you deal with that
in a self service world?

Speaker 3 (03:40):
It's perfect question for status data has to be you know,
the state that you can actually create analytics on scale
for many users. So when let's speak about highly curated
dishboards in your intelligence tool. It's no brainer because you
can sell specific data sets which you go to specific
report and make sure that the hecks actually have decent quality.

(04:01):
For companies to have self service on scales and need
to make sure that any potential question could be answered
with a high quality data, you first need to get
your data in order to be able to provide this
kind of service. And the second one is as you're
right to mention single source of truth, right, So definition

(04:23):
what active user is, so how many users do we
have could be defined in dozens of different ways in organization.
If you speak to someone from product departments, they will
go for the active users to actually, you know, use
a product. And maybe if you speak to your financial departments,
it will go to someone who actually signed the contract
with you. So the different ways of calculators things, and

(04:45):
especially for self service, again you're asking about things which
might have different meaning. You have to have governance actually
understanding semantic business definitions of business matchakes and business terms.

Speaker 2 (04:59):
I feel like that's a bit of a moving target
from what I've seen in my experience. Like I was
remember in an organization where I'm like, we figured it out,
you know, we we managed to get a single source
of truth for even what a user is. Like, you know,
there's a lot of identity aspects for a user, but
then there's also a lot of business aspects to it,
and like how great of a customer are they repeat customer,

(05:20):
how much they've spent and then you know individual events
related to it, and trying to get a single identity
for that was always potentially a problem. And then you
you know, every team in your organization has their own
idea of what a user is and probably has their
each organization has their own user management service with his
own special data. And I just I wonder how possible,
like if it like if there is a company out

(05:41):
there that actually has good, good data, you know, high
marks on health Scorecard for their data management, Like what
does that look like? You know, is that actually do
you actually see that in practice?

Speaker 3 (05:53):
So I would say again, because we're operating in the space,
so naterally it's possible. Otherwise it wouldn't have any customers.

Speaker 1 (06:01):
Right.

Speaker 3 (06:03):
On the other hand, this is truly changed for many organizations.
And I would also say that agentic practice, so generally
to BI it's another silo, right, So because usually data
management departments that have their own single source of truths
may be implemented in the data pipelines, ls so, calculated

(06:24):
fields and so on and so force, and then analytics
department have their own single source of truth which is
probably in the bi to feature store, metric store, what
have you.

Speaker 2 (06:33):
That's an interesting point because we already see that with
the public agents that are out there, they're trained up
to you know, if we're lucky, even six months ago,
and especially things in a business context are rolling very
quickly what the principles of the organization are, or you know,
what features need to be implemented for customers. Those things
are iterating very quickly, and so you know, I'm know,

(06:55):
I'm this is a new problem that I actually hadn't
considered before, and it means models are fundamentally always out
of date. Do the just like your documentation and our
source code, right, it's all legacy as soon as it's made.
Does does RAG resource augmented generation help here to reduce
the changing nature of those things because you can point

(07:15):
to potentially the production data store or is there something
else going on where that realistically you're copying that data
is RAGGED being used against stale data sources anyway, like
you're not using the production database.

Speaker 3 (07:30):
It's a fair point if ROG is a separate silo
in the system and you just you know, keep fitting
it with examples which all of the outdated when they created.
So no, RUG is not actually solving the problem. If
you have started site you know as a sidecarrent data science,
it will never leave up to that. But if you

(07:51):
combine metadata management your business ontology and use it for
your gentic properties, is exactly what can be up to
date that point of time.

Speaker 1 (08:01):
So the level of overhead to pull that off seems
pretty significant. Everyone feels these pains, But what at what
point do you hit the scale or the quantity of
data where it actually seems like a worthwhile exercise to
implement this.

Speaker 3 (08:17):
I guess it's a healthy practice for any size of
organization unless you know, we just want to rely on
on the one developer who maintains your snowflake what have you,
which is which is fine, okay, But but I think
like Organizational Knowledge Way and its documentation was always a

(08:39):
challenge for any size of organizations. To me, it's it's
a healthy habit to actually have it from starters, to
have the you know, knowledge graphts about you different data
structures and the different data sources created from day one
when you actually have data stores. But alway said organizations
which just have one database, so one warehouse, a small one,

(09:02):
I would not necessarily invest in that, despite the fact
that I think it's not a good practice, because right
now we will see flourish of agentic workflows where different
agents going to communicate with each other and they have
to have shared context and reasoning. And the shared context
and reasoning is exactly this knowledge which you should document
about your organization, right so we become more more and

(09:25):
more automated, and we should and this is also a
differentiation business differentiation between companies like if you actually advanced
in keeping your knowledge and building agents around that or
you're not.

Speaker 1 (09:38):
Do are there out of the box agents to help
with this or is it completely built custom for everyone.

Speaker 3 (09:47):
I don't think it's feasible to really, you know, make
it a max employment manual for any size of organization,
because you know, data is exploding even for smaller companies.
We got this approach of actual predefining ontologies for different
particles and different lines of business and then automated way
to simple metadata from different data sources. Don't stand with

(10:12):
what the specific logic changes for different systems and basically
automatically creates business autology. But I must say to your
previous points, what we discovered as automated on boarding, which
is like a few hours a few days, is that
there are many proflicts of definitions. You might not aware
about that now, but you have ten different views in

(10:34):
your power bi tool where you have the same metric
defined in different ways based on different data sources, and
this is what we discover.

Speaker 1 (10:41):
Yeah, that makes sense. I can see it going horribly
wrong asking your technical people about the quality of the data.

Speaker 3 (10:50):
We're all technical people, but I guess if you want
to have a business logic, let's say, all those agents
which automate customer support and so forth to be automated
top your data, there should be aligned to sign organizational knowledge,
which is not necessarily on technical size. It's an operational
size side, so it's coming from those departments, not from

(11:12):
us technical people.

Speaker 2 (11:14):
Maybe to make this a little bit more concrete, do
you have like some canonical examples of what businesses are
trying to often answer with the storage of their data.

Speaker 3 (11:24):
So I would say again, there is no business case
a solid for automation as agentic as, also data compilots
and ergentic and automation workflows, because if you do not
inspired to automation, why would you sort out your data
or why would you sort out your governance? Right if
you don't have automation, like this is a killer use case.

(11:45):
But when you have the CERO killer use case, usually
companies tend to start with more like a knowledge center
and discoveries so basically search right all the customer support
functions on one site. On the site. We also have
like companies and pharma for example Holiday building the digital

(12:07):
health platforms with agents. We have customers and financial services industry,
which which implemented self service quotation for serve party brokers.
So basically is the use case would be always shortening
the time and increasing conversion rates. So there are always
business or metrics for implementing those use cases in the

(12:28):
first place, but internal use cases first and then sell
customer facing.

Speaker 2 (12:34):
And just thinking back to all the times that I
was working at companies and they were saying how valuable
their data would be and collected everything just waste and
storage and building up internal tables of just garbage from
years and years back, and realistically, I'm still waiting for
the point where we could be utilizing tools to even

(12:54):
evaluate that effectively, because is there something to this, Is
there an area where it's like, oh no, actually, this
probably highly useless data does turn around and solve a
critical need for the company today with the advent of
agents that can potentially utilize it in a much more
effective way than humans can, or shortly down the road five.

Speaker 3 (13:17):
Ten years, Yeah, yeah, I believe so, because most of
organizational data is unutilized, as I mentioned, so we see
that even most advanced companies use mainly twenty percent of
that data on you know, relatively frequent cadence and the
res is unutilized basically, and for that data, AIO readiness

(13:41):
also means that you actually understand few of the health
core components. So startus if it's duplicated, if it's used,
and if it's used by which applications, and if it's sensitive. Right,
so then cent also the risk factors as well. And
then a last but not least, I think it's even
the most important one is what is the semantic meaning

(14:04):
of that? What's actually hidden in this data? Right? And
then that's why knowledge graphs become handy because knowledge graphs
they create or suggested connections. Say, oh, did you know
that additional feature of your conversion score might come from
this customer demographic exparamitter like such and such, Right, so

(14:24):
it's kind of giving you related a data which you
might not encounter so far, like by your experience. But
it's in there. The thing is in there right now
is not covered. So companies usually do not index to
catalog data, which is not used actively for applications. Again
because cataloging was used for compliance and insurance sor to say,

(14:48):
for governance, and now cataloging is small, or we should
use cataloging indexing to actually for discovery. Discovery is semantic mapping,
risk mapping and risk management. And of course also by definition,
agents are better than humans to understand those relations on scale.
I'm also like on scale, say, and then they see

(15:11):
humans us as moderators.

Speaker 2 (15:14):
So you mentioned like twenty to thirty percent of the
data is being utilized, So seventy percent is you know,
not utilized. Is that because of the lack of value
in it, lack of it being categorized effectively? Is there
another bucket that I'm missing here? Something there could be
something interesting in it, but no one's taking the time to
actually understand it. Because I get the sense that the

(15:37):
agents aren't going to be able to just see this
uncategorized data and magically pop up an answer of how
it could be valuable there still requires a human to
evaluate it and know that there is something valuable in there.
How it could be utilized still has to be done
by a human.

Speaker 3 (15:53):
Yes, it's a good point. So in the databases which
only have some analytics on them and so on, so
first the role of the agent could be suggesting additional
features to look at to take into consideration in databases
where you do not have analytics at all. So, for example,
we have this discussion with the company which never introduced

(16:15):
and this is a huge company which never introduced analytics
for people's departments. There are not dashboards for people's departments
right now. They want to skip the stage of bi
tools and go to self service data copilots to actually
create this analytics to kind of scape the stage. Right,
So in this case the data is supervalable. Of course,
it does have to, you know, to go through automated

(16:36):
labeling and reconciliation and semantic definitions and all of that,
and you know, centrally, we have tools for that now,
but here the case is you had unutilized data not
for the right reasons.

Speaker 1 (16:48):
So do you find that after going through this process
that companies actually have less data storage concerns? Because, like
you know, we talk about a single source of truth
and a lot of times I seen where everyone claims
theirs is the single source of truth. So they want
their own copy, their own database servers, their own storage

(17:08):
system because they don't want anyone else polluting it. So
after you go through this exercise, do you find that
a lot of those can be decommissioned and you actually
end up with less data storage overall?

Speaker 3 (17:20):
Just a question because single source of just has to
be virtual, so it's kind of a virtual layer which
connects to your operational data source, analytic data sources, and
even applications because there's lots of business logic and application side,
so it's always virtualized. And then the question is do
you even need aggregating layers like warehouses? You know, we

(17:44):
see this question popping up more and more and we say, okay,
so there are probably going to be stages, right, so
some companies are going to reduce Companies like in IoT
your manufacturing space. They might want to reduce the size
of the data to to some warehouse to basically have
more focused use cases, right, more focused and sculpt use

(18:06):
cases cheaper for processing. Right. And some companies who might
have less data will just you know, goal result any
aggregation toll. And what I'm saying less data is because
the storage is not expensive anymore.

Speaker 2 (18:21):
I mean, I feel like there's a whole systems thinking
problem here, which is just because it would be better
to have a single sources. Truth, it doesn't automatically make
the organizations you know, migrate to that. I do see
the the XKCD article on the number of number of sources.
I mean it says standards, right, you know, we have
three three databases with user identity, user tracking metrics data

(18:41):
in it, and oh we should have one you know,
unified answer, one perfect database that is sanitized, that is
categorized correctly. And the result is now we have four
user databases, you know, with all the data in it,
and and you know, someone still utilizing those old ones
and to your point of storage still getting cheaper for us.

(19:01):
There is no justifier, and it takes effort, you know,
human time and resources to actually decommission a database. I
can see that just not being encouraged to even happen.
What if there's some something we missed in there that's
still valuable that we could be utilizing to increase our
business even by you know, a couple of percentage points.

Speaker 3 (19:21):
So I guess it's a virtualized ontology which lies knowledge
craft which you can connect to many data sources and
indicate which data source and which table call you need
to use for specific vision. To me, is something that
can with the time help you to decommission specific data source.
All in migrating you know, systems to new storages. And

(19:43):
I heard this talk at Gardner last year when someone
was comparing hadup to data lakes or data lake houses
and all of that. Because if you do not have,
like the semantic layer, the business understanding of what's in it,
this big store of data doesn't really solve your problem.

Speaker 1 (20:03):
So what's the biggest driver for this is it does
it typically come to you from the business side or
from the technical side customer?

Speaker 3 (20:15):
Yeah, I think it's good news for the whole industry
that everything about Argentic is coming from the business side.
And yeah, it's a good position to be in because
this is where money is, right, it's where decision power
is and so on the first and you don't need
to explain the technology anymore, right, you don't need to

(20:38):
explain yourself anymore, because you know, I think it's the
same that what happened with the Internet, and you know
early two thousands with the dot com boom, that the
business side were like super inspired to create e commerce
use cases and what have you. And that's what's happening
with Argentic. It's business side. It's already you know, also

(20:59):
inspire art with all the capabilities of this new technologies
that are actually inventing the use cases and the building
you know, business drivers and calculations behind that. This was
usually the prerogative of technical teams, right to come up
with a new technology and then find a compelling business case,
and now it's the other way wrong. On the other hand,

(21:22):
business so technical teams are struggling on the site to
provide this type of service that the business as far
as to because of law data quality, because of low
data readiness, and because of this multiple definitions where if
you connect agents to them, you know it's it's a business.

Speaker 2 (21:42):
Sir, I mean, it really seems like there's the innovation
Here is a fundamental paradigm shift from having business intelligence
and even data centered engineers working within organizations to completely
outsource the handling of any sort of data from your
production systems, because at the end of the day, they

(22:03):
were always sort of a bottleneck for delivering things that
used to be someone's like I need a dashboard for this,
or being able to answer the question of how we
talk about monthly active users, Well, where is that data?
What does it look like, and then figure out utilizing
the tools to actually build the dashboards. Having the data
in a single place all that had to be solved,
whereas now those teams don't necessarily need to be working
on that anymore. The data starts in the original application.

(22:25):
You don't want a middle layer. You want it given
to companies that understand how to sanitize it, what's relevant,
where the insights are, and providing an interface for those
asking the questions to directly interact with the data in
an understandable way, rather than looking at dashboards that are
out of date or configure some ultimately utilizing tools that
just don't really work that well because there's there's too

(22:46):
many degrees of freedom, too many variables, too many columns
or pieces of data that all needs to be displayed
depending on what you're actually looking for.

Speaker 3 (22:53):
Yeah, it's a good point because a majority of foul
decisions are ed HOLC decisions on aed hoc questions. I
do believe that we'll still have space for KPIs and dashboards,
Like I am starting my day with you know, Google
Analytics and Salesforce and all of that, but I also
have like a bunch of questions which are not in

(23:14):
those reports, and it is actually changed with the data
change every day, right, And I would like to have
a tool which can help me with that. And for that,
I think that we could be as practitioners, like as
analytics practitioners, we could be smart about it because if
you actually get monitoring and agentic metadata analytics, you actually

(23:36):
understand what are the interactions of users with the systems,
you will come up with dashboards which are actually useful,
right because the biggest criticism, well from the end user
that you build like a bunch of dashboards that we
didn't ask for. And now when they have this luxury
of asking that questions freely, you can monitor what's asking about, right,

(23:57):
ask asking for and actually create a takes is actually
neat and you know, convert into the dashboards.

Speaker 2 (24:04):
Now, I'm definitely more on the anti dashboard proponent or
I guess dashboard antagonist, Like I find there's something very
like you're utilizing as a crutch and maybe a little
bit lazy as far as not articulating what the challenges
or the question you want answered. It's like, oh, I'll
just look at a dashboard of this information, maybe the
answer will pop out where what you really want to

(24:24):
do is say why am I looking at the dashboard?
What am I really looking for? And have the answer
to your question. You don't care that the number of
monthly active users that is increasing over time, but maybe
you're utilizing that to figure out, well, where are the
biggest jumps? Why was there a jump here and there?
And so the question you want to ask is where
are the biggest jumps and what happened to my organization
or to our customers or competitors or in the global
market that caused the biggest change in the last six months.

(24:47):
And rather than looking at a graph that says that
basically underlying data, you're getting the answer straight away, and
then you can actually take the next step. And I
guess one of the reasons I'm such a dashboard antagonist,
I just coined that to our term right now is
I mean, we focus a lot on high availability systems
and high reliability. You can't know that you're you can't

(25:08):
rely on a dashboard for telling you if your your
system is up or down. Like you need to know
deterministically what the answer is at any single moment. And
I think the only difference is from a business is
it's more long term. Although I think a lot of
companies delude themselves into thinking that it can be a
short term answer, that I can just look at this
right now and automatically know what my next step is
that I should take. And it's a lot more I

(25:30):
feel like deep dive into really understanding the correlations between
the underlying data stores.

Speaker 3 (25:36):
Yeah, I think dashboards is the way that you can
tell the story like in coherent way, like from usually
from the experience, right, and when you just asking questions,
you know, using your slack teams, it could be random
questions without the context and without continuation. It would be
like just party questions. Right. So in those words, usually

(25:59):
when the builds right, it's the best ones, right, not
all of them. You actually have a phenomenon, right, and
measure and then a bunch of which is that explain
like where it's coming from, like segmentation and audiences and
so so so. So basically it's a good way to
visualize changes. But again, this is the way that also

(26:20):
doesn't allow you to recognize new things. It might be
a new factor that affecting what's in your dashboard, and
you will never know that because it's not automatically recognized
and you're not popping up. Whereas when you use authentic
you can know, okay, what are all the factors with
influencing this spike, and then you will actually see like

(26:41):
what is the related features which can affect right, So yeah,
yeah to your point, to me, this world is a
good starting point. It's not an endpoint, but a starting
point where it can actually start your exploration going further.
Everything is going to be data driven, and you know
you don't have to have like myself, fifty different types
open in your browser and this might affecting this upload speed.

Speaker 2 (27:05):
Yeah, just fifteen.

Speaker 1 (27:07):
Those are rookie numbers.

Speaker 3 (27:09):
Yeah, so so I guess we'll not have to use
so many applications for ever saying and you know, learning
all those applications. It's it's about experience. I have a question,
I have a task. I want to complete that, and
I don't really care which applications and data is involved.

Speaker 2 (27:25):
Right, it's so so optimistic.

Speaker 3 (27:27):
You don't think five years is realistic.

Speaker 2 (27:30):
I think that because of competition, there are always and
the segmenting of data, availability of data in the market,
Like there's no public Internet anymore, We've already seen the
closing off of available data sources. Every single one of
these applications is going to have access to just smaller
pieces of data that are more focused, and you're still
going to have to go from app to app to

(27:50):
get these questions answered. And I think some companies are
trying to push forward some way of still having like
a single pane of glass to interact through utilizing MCP,
the Model Context Protocol or a A two a agent
to agent. Thank you Google for you know, coming up
with something different. And you know even that, you know
we can't even standardize you know, a single you know,

(28:13):
paradigm for a protocol to communicate between agents. So you know,
I don't I think we failed up to this point,
you know, in the air twenty twenty five for humans
to you know, have one agreed upon answer. I just
don't see it happening unless you know, fundamentally your daily
driver changes. And I know on the software engineering side
we try to make it be the ide of choice,

(28:35):
but even that still, like, I don't think everyone is
spending all of their time just in that one tool.
You're still switching back and forth to different communication tools
and whatnot.

Speaker 1 (28:44):
No, No, I'm just thinking like everyone's in favor of
a single plane of glass as long as I'm the
provider of a single pane of glass.

Speaker 2 (28:51):
I think that sort of maybe brings in a question
the Risondet like of the existence of the company, like
what what are they doing that is fundamental? Like what
is it that they're really trying to sell? And I
feel like a lot of companies out there they just
copy each other, like they're not creating something unique there.
So I still see there always being an opportunity to
own the data and sell it. And I think maybe

(29:12):
this goes back to the question of if you're not
providing something unique, then other companies can spin up and
still own the data, and why not you pay some
company to provide you with the answers to questions and
manage all the data. And I think this has been
a model that has existed in certainly with like a
user research groups for instance, think tanks, consulting companies that

(29:35):
come and tell you how to just do your business
exactly what the data should be in everything. So I
don't know, like even if in your own company you
have a lot of data and you're like, we know
how to utilize the data most effectively. We can go
and hire an engineering team to go create that single
pane of glass. Eventually you're like, well, other companies can
use that pane of glass too, We'll start selling it.

(29:56):
And then that company becomes just the seller of a
pane of glass. So you know that's It's like I
was just going to keep on going, but it really
does bring up to the point where if another company
can answer all of your business questions for you, oh
what is there left to still be able to do uniquely?

Speaker 3 (30:13):
So every organization has the old prietary data based on
the nature of business and the customer base, and even
someone comes and says, okay, right now, we are going
to bring you into standards about how to make business.
You'll still customize it what they only have like this
is your advantage on one side. On the other side, yeah,

(30:35):
if you have very special data and you want to
sell it, you might want to sell it in machine
readable format which is not reverse engineered. Right, So you
do not sell data buy tables like they kills, right,
but you sell your data as an en semantic meeting
so so basically as machine readable format, so other algorithmsave

(30:57):
agents can use it, but they cannot decipher that, right.
I think it's actually the most secure way right to
share data for specific use.

Speaker 1 (31:08):
Speaking of which, what are the security concerns that you
deal with whenever you have an agent that has access
to all of these different data sources.

Speaker 3 (31:18):
So in our organization, we actually choose to separate data
values from data concepts from agents, so agents only have
access to data concepts and then when the care is generated,
it runs in separate environment on the data values and
ELMACS does not ever touch data values of our customers

(31:39):
just displayed in the customers applications, So we have total
separation between agents and the data values themselves. So this
approach actually allows you not to be concerned about the
data leakage on a thing like that. I think every
company will decide for themselves this you know on premisey
deployment is right. I would never actually vote for that

(32:03):
because models investings so fast and you have limited capability
to upgrade them if you go for the own premisey
deployment rather than just using APIs which always go forwards
and so on. So firce. But you know, everyone will
will make the choices again based on sensitivity and bright
in nature of the data is probably going to be
leveled some way that we have a you know, cloud

(32:25):
storage with on premise and you know, different governance practices.
It's going to to be the same for Augentic. Right now,
it's more and modransparency about what's moving there, and the
more punish about from the company side, like what's critical
and what's sensitive for them to basically send to serve
parties or what could be kept inside.

Speaker 2 (32:46):
I really, I really like that perspective. If it was
ever true in the past, where you could be profitable
with an on prem data center storing all your data
and running all your compute, that must be less and
less true every day, and you'd have to be doing
something very special for you to find value in that
because technology is iterating even faster now, right like any

(33:08):
argument you would have had in the past is now
no longer valuable. And so like I'm totally with you.
I don't I don't understand even ten years ago how
people were justifying on prem solutions and now it's like
even even less of the case.

Speaker 3 (33:22):
It's cost prohibitive to move data to AI. You need
to bring AI to data and if your data is
on premise, it means you need to bring a I
on premise. And this is like, you know, complicated and
not very efficient to deal with things. But you know,
you make your decision by some risk management. I guess
the aw US.

Speaker 2 (33:40):
Has the snowmobile, uh that you you know, just transfer
the you know, get the USB sticks on a giant
truck and you know, fly to the data center. And
you know, I think I think that can work out.
I mean, I think if anything, now, it's less about
like it must be less about the amount of data
you have and the rate of data creation. And I

(34:00):
can see that with the number in a manufacturing plant
or in healthcare, like the number of sensors increasing every
all the time. You know, I'm wearing one here and
I'm thinking about getting another one, and they're just going
to increase more and more, and so with that increase
you need to be able to handle it much more effectively.
I think storage costs coming down at the cloud providers

(34:21):
is probably the next innovation that will happen there. We
just saw aws's as three one zone drastically reduced by
like eighty five percent costs there, and I think we'll
continue to see that as storage costs decrease over time,
so it would just become more and more feasible to
put data in the cloud closer to the agents.

Speaker 3 (34:38):
Yeah, processing is always a bigger concerns. That's storage for
many many years now. And I think this is also
something that many of us in the news line what
it was like last months, two months ago about deep six,
So how much ability costs to actually train model. We

(34:58):
actually have a in inference running, So the costs of
processing is shifting from training to use of the models
to the inference itself, and that's where I see that
the majority of funds is going to be spent actually
using AGENTIC on the data. And again, if it's more
efficient on the cloud on data centers, I would say

(35:22):
it's going to be more efficient in the cloud because
especially if you're not locked into specific provider, is going
to be more and more competition than that, and especially
when you can recognize what data is garbage and what's
not and kind of limit the footprints, so everything, all
the costs are going to be down a thing. Right now,
we spent lots of money already on the data pipelines

(35:45):
which are duplicated to each other and not always feeding
information that we actually use an application. But because companies
do not monitor the metadata, they don't know what's in
use and what's not right, so we already have like
to spend which is pretty defined, and you paid anyhow
if you use your dish verse, if you use applications,

(36:06):
So if you're not, you're still paying the data vibelines
that you have. So gente tick might replace as habits
by actually invoking information and processing that you use and
not which is pretty defined for you by someone assumption.

Speaker 1 (36:25):
This is probably an unpopular opinion, but I think we're
going to look back decades from now and say that
making storage costs so inexpensive was the worst mistake we
ever made.

Speaker 2 (36:37):
I mean, as Gavin's paradox, right. I mean, anything anything
that we don't want to have, we should not make
more efficient because we will eventually over overutilize that thing. Yeah,
I mean that it happened in I think really the
industrial age, in especially England with coal mining. Yeah, I mean,
for sure people are are utilizing or systems and it's

(37:02):
abusing ways. Cloud providers have to have a strategy for
dynamically swapping out hard drives as they fail because we
haven't improved the reliability of them, just.

Speaker 1 (37:13):
Just the size.

Speaker 2 (37:14):
Yeah. Right, and you know that's sort of a problem.
I mean, I think it's a science fiction ideal that
we figure out how to inscribe and write and utilize
data and sort of like a pure energy e lertromagnetic
you know, constrained field inside like diamonds or something. I mean,
it'd be nice, honestly. Well you're going to start working
on that, Yeah?

Speaker 1 (37:34):
Probably not.

Speaker 2 (37:40):
Where are the customers?

Speaker 3 (37:42):
Where are the customers? Well, my next gig is going
to be in logivity for sure. I think it's fascinating fields,
and I think we have one more data to actually
have a breastrow sense fields. But yeah, yeah, I think
data volumes are not necessarily a bad thing. But it's
not about data ballis. It's about data variety. I wouldn't
say like, actually have big data is advantage, but actually

(38:05):
have rich data. Right.

Speaker 2 (38:07):
So that's a really good point actually that I don't
think anyone's brought up on the show before. I actually
have a colleague that looked into the connections between networks
human networks. But I think it applies here that as
you said, it's not about the volume the amount that
you have, but there's some arbitrary aspect of the of
the data that's like super critical here, which is the

(38:27):
say connectivity, but also the sparseness of it. How how,
I don't think it's a metric for that for for
what that is. Maybe maybe you're calling it something special.

Speaker 3 (38:37):
We call it interoperability, maybe not the best word for
that Internet, but it means it's actually like for data
assets like table you can have different types of analysis
or for analysis you can use like different assets to
fit it in. So interoperability is the ability to to
match different features between different sources, and that is a complimentary.

Speaker 2 (39:03):
Yeah. So okay, so I have to ask about this.
Some of the marketing for your company says that you
don't have any hallucinations, and so we know that hallucinations
are coupled to utilizing a straight transformer architecture. You know,
if you're using transfer architecture, you must have hallucinations. So
you must be doing something special that other companies aren't utilizing,

(39:23):
you know, different from what the lms are are building.
Is that something you can talk about?

Speaker 3 (39:29):
Yeah? Sure. So our approach is to ground a single
source of choice in your knowledge graph, right in your
business intology, which is transparent right as businesstology is represented
this knowledge graph of semantic embeddings. So for starters, we
allly have kind of organizational agreements on what business logic is. Yeah,
and in addition to that, we actually ground your experience

(39:53):
only to the business ontology, right, So we degree we
reduce the degrees of freedom of the model not to
to think you know widely about universe, but to think
about your companies in the universe. So when you ask
a questions about active users, it will not think about
Wikipedia definition of active users, will think about your business
matching definition of active user, maybe if coming from your

(40:15):
bi tool, right, So it's really grounding the experience of
the users in the single source of choice of your organization.
In addition, because we always build the synthologies based on
the metadata, we understand the context much better. But do
not only understand the context of user interaction within specific

(40:35):
memory a frame in the copilot. We also understand the
user interactions with any system which is connected to a
remax right, so it's basically previous interactions with operational systems,
with analytics systems. So our context is much wider and
we can have much more personalized experience for the user

(40:55):
based on this metadata access. And the third reason is
because okay, so the first reason was the business ontology
single sos of truths grounding for experience. The second one
is personalization, and the third one because we do have
business atologies which are a complimentary like in the language
and so on and so forth before the customization for

(41:17):
specific company. When users use different language which is different
from the business metrics definitions and the organization, we can
pick it up from our you know, generic contologies about
this vertical because people switch companies, they might use different lingos,
different abbreviations sure which are not necessarily implemented in this company.

(41:40):
And we have not only user context, but we also
have industry contexts so we can pick up this language.
So those three reasons allow us to have much reduced
experience on one side. On the other side, it's a
very very first lace, so you cannot ask. It makes

(42:00):
about whether you can only ask it makes about you
connected data, so you're.

Speaker 2 (42:06):
Not utilizing as much of a probabilistic model as other
companies that have built their own foundational models.

Speaker 3 (42:15):
We haven't built foundational models, but we use dozens of
semantic models and two dozen of graph models for different
tasks from onboarding to the user experience and explainability to
provide this type of experience, and we always keep an
eye on the latest and greatest, so we also when
the new models come out, we test them and see
how we can embed it and now andsemble and it

(42:38):
helps us to you know, increase accuracy over time. But
I think the biggest thing is if we give the
ownership on the context and reasoning for the organization that
we serve, we automatically build it for them and from
now they are the owners of the context and reasoning.
And if they want to plug tomorrow and Vida names
so able as bad Shock, they can use this contexts.

(42:59):
So it's kind of feeling, it's transparent and it's usable.
So for us, this is the biggest benefit.

Speaker 2 (43:04):
Actually, so there's still there's still a chance that it
will hallucinate. It's just very very low and it will
stay within the in the context of the business domain.

Speaker 1 (43:13):
No, no, it's not a hallucination. It's a guide spiritual journey.

Speaker 3 (43:20):
You mean, if you do have many versions of truth.
So for example, Janna is just introduced new definition of
active user and new dishboard. It makes fixed up and
if someone asks about douctive user, we might offer like, okay,
there is new definition in UBI dishboard. Would you like
to get answer on that.

Speaker 2 (43:36):
Wells, as long as there's a probability of how you
generate a solution the answer, there's always a chance for
it to pick it just make something up, even if
you have tried to constrain it by actual definitions. Otherwise,
that's just a fundamental aspect of probabilities. So I mean,
while you can definitely reduce it and eliminate duplicate definitions,

(43:58):
there's a whole other part of the transform architecture which
fundamentally requires the creation of pollutionination. It's like, I know,
you can have a transform architecture without that.

Speaker 3 (44:10):
Again, it's a good point, and we provide explainability about answers,
so it's not likely asking question for U and have
numbers and answers actually provide folks mobility Like this is
how we understand the questions. This is a semantic entity
that we met this question too, and this is logic
and all of that. And if user would like to
to base the answer on different logic as I can
actually choose like this not autopilot mode. And see, okay,

(44:33):
this is the related semantic entities to a question. You know,
you can pick up from them if you'd like to. Really,
like my husband, he drives alpham Mito manual stick right,
so he will always prefer to to have better control.
It just back from Italy. So it's like those are
also created for manual driving. So it's like some data

(44:54):
is created for manual selection. Probably right, If it's like
super A, I see you might want to select it manually,
I would say, lucky. We will. Of course as an industry,
we're going to be more and more automated. You know,
some people just like more control.

Speaker 2 (45:11):
I think it's like the the idea of control more
so than actually in control, like you know, you don't
want you don't want, you don't want the manual stick shift.
You want to be told it's a manual stick shift.
But if you mess up and do the wrong thing,
the right thing is still happen.

Speaker 3 (45:24):
That's choose a choose that always, you know. Yeah, we
have systems like aybeas and all that to keep us safe.
That's true.

Speaker 1 (45:32):
Do you want to shift the gears but you don't
want to dump the clutch?

Speaker 3 (45:36):
Probably awesome, not over like comel like you know, two
hundred meters above the water.

Speaker 1 (45:41):
And no, right, not really awesome. So it feels like
this might be a good place to roll into picks.
What do you think? Okay, let's do it, Warren. Yeah,
you're never gonna guess what's happening next.

Speaker 2 (45:54):
Okay, I'm going first. Yeah, so I got a really
controversial good one here. There's yeah like like, I like,
I like it. So there's this great article that I
read through. It's short, it's short form, so it should
be easy for anyone to get through. It's basically the
idea of how intuition is being used in software engineering

(46:15):
and whether or not lms are capable of intuition and
It is actually a proof that shows we can't have
AGI with transform architecture. Our lms will never be able
to reason. And it utilizes Google's incompletely its theorem the
non computability of intuition and the computability of touring machines,

(46:36):
and just with that we can actually prove fundamentally that
we can have AGI with our current systems. We haven't
gotten any closer to that. So don't don't listen to
the lies that people have been sharing from massive quote
unquote AI companies, because the real argument here is that
in order for us to have a GI, you need

(46:57):
to introduce intuition, and that's the exact thing that's lacking
in turning machines.

Speaker 3 (47:02):
Question too, are you born?

Speaker 2 (47:04):
Yeah, well there there is the.

Speaker 3 (47:07):
Just answer that if you're not born.

Speaker 1 (47:10):
Answer the question Warren, Yeah.

Speaker 4 (47:12):
It means experience really so yeah, yeah, I mean it's
really hard to identify even what happens as an individual,
let alone if we can believe it on an external system.

Speaker 2 (47:24):
Luckily, the Turing Turing machines we know, you know are
are are our closed systems.

Speaker 3 (47:29):
There.

Speaker 2 (47:30):
I I likened it to this great quote which will
be a future pick of in a in a future episode.
A parrot reciting Shakespeare. That's that's l MS today. And
you would never claim that that a parrot, you know,
would fully understand you know what it's reciting there. Uh,
And that's that's unfortunately the extent of our technology. That's
my pick right.

Speaker 1 (47:50):
Ina, you're upe, what did you bring for a pick.

Speaker 3 (47:54):
Per pack? Well, I wasn't for that, but I must
say yeah, So let's speak about AGI as well. I
do not believe in AGI in the next five to
ten years, at least, just the fact that these humans
be capable of applying context from one experience to different experience.

(48:17):
So I do not call it intuition because intuition, to
me is just experience. But our ability to merge contexts
which are vividly not connected is whereas the humans park
is and to me, AGI is not going to be
near that in the overseable future, let's say ten years.

(48:37):
So think about you can apply your knowledge from cooking
to your knowledge of right now, you know, coding or
something like that, like what the ingredients and so on
and so first, so our associations work differently than machine association,
and this context merge from unrelated experiences something that machines
are not good with.

Speaker 2 (48:56):
I like how you went to the philosophy side of this,
you know, our you know, there's this idea that the
universe is deterministic and that everything is connected through the
collapse of the you know, Shirtinger's equation, the wave function,
and uh.

Speaker 3 (49:08):
It's called religion. Yeah.

Speaker 2 (49:10):
Oh I was ready to go there.

Speaker 3 (49:12):
Yeah, yeah, I do.

Speaker 2 (49:17):
I do agree. You know, fundamentally, there's something missing from
the computing systems that we build today in order to
actually achieve AGI.

Speaker 1 (49:25):
My pick's going to take this down a whole big
notch because you know, we were talking about having the
number of the number of tabs that you have open
in your browser. For the last couple of months, I've
been using the ARC browser and specifically that conversation. One
of the things ARC does is any tab that you
haven't touched in the last thirty days, it just closes.

Speaker 3 (49:47):
It for you.

Speaker 1 (49:48):
And I used to have a bunch of tabs open,
and I was like, Okay, I'm going to try this.
I'm gonna hate it. I'm going to figure out how
to turn that feature off, or I'm going to quit
using it. After several months it's closed. Probably hundreds of
t as for me and I've not noticed. So go
try out the ARC browser. You don't need all those tabs.

Speaker 2 (50:07):
I think I'm having a little bit of a neurological
meltdown just at hearing about that feature.

Speaker 1 (50:13):
Well, right, it's panic inducing, Yeah, for sure, definitely.

Speaker 2 (50:20):
I mean there are tabs that I actually leave there,
that I know are there and I don't want them
to go away.

Speaker 1 (50:28):
Cool. I have a follow up question for you, Warren.
Though you mentioned that you wear you're currently wearing one
IoT device and you're thinking about getting another one. What
are you wearing and what are you thinking about getting?

Speaker 2 (50:40):
Yeah, so this isn't my pick, but I'm wearing the
Google pixel Watch too. I would definitely not recommend anyone
to get a smart watch ever. So that's one. So
I want to replacement. This thing disturbs me while I'm
sleeping and I would really like to get my sleep metrics,
and so I've been looking at alternatives there. I the

(51:01):
number one one is the ORR ring. I think they're
on edition four, but it's a subscription based thing and
that really rose me the wrong way. So I'm not
I'm not interested in that realistically. But that's why I'm
hoping for a good ring without a subscription to show
up that I think I'd be okay wearing a bed.

Speaker 1 (51:17):
No, I was just curious, curious I have for my watch.
I have a Garment Phoenix, which is technically a smart watch,
but I have everything turned off on it. The only
thing I use it for is heart rate and uh
metrics whenever I'm out for a run. And I had
an or a ring for a while and same same

(51:39):
with you, is like, I don't want to pay a
subscription for it. And I know a lot of people
who use the Whoop Band, but it's another subscription based service.
But they're like, just let me, let me buy it
and go on with my life. Is that cool with you?
But trying to push updates out onto devices that people
that people you don't have act ys too, Like, that's

(52:02):
not a fun place to be. Even from my days
of supporting mobile apps, just trying to get people to
update it was frustrating, all right, you know, thank you
so much for joining us today. This has been a
lot of fun.

Speaker 3 (52:17):
Likewise, I really enjoyed the composites.

Speaker 1 (52:19):
Orren as always, Thank you, appreciate you being on the show. Yeah,
of course, and to all the listeners, thank you very
very much because you're kind of the reason that we
do this, so hopefully you enjoyed this. If not, you
know how to find us and let us know, and
you see always

All Episodes

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Ground Truth & Guided Journeys: Rethinking Data for AI with Inna Tokarev Sela

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

Ground Truth & Guided Journeys: Rethinking Data for AI with Inna Tokarev Sela

My Favorite Murder with Karen Kilgariff and Georgia Hardstark