Ex Apple Engineers: The AI Future of Photography | CXOTalk #872 - CXOTalk

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
Are AI powered smartphones trulyready to replace professional
cameras or is that just Silicon Valley height?
Today on CXO Talk episode 872, we're going behind the scenes
with two former Apple engineers who created the iconic Portrait

(00:21):
mode on your iPhone. Now, as founders of glass
imaging, Ziv Atar and Tom Bishopare using neural networks to
extract stunning detail from smartphone cameras.
Ziv Tell us about glass imaging.We want to revolutionize digital
photography, specifically using AI.

(00:43):
And just to clarify what I mean by that, it doesn't mean taking
images from cameras and making them better.
That's something that you know, a lot of people are doing in
various industries, including smartphones and and drones and
and photography in general. Our goal is in the longer term
is actually to drive changes in hardware that are only enabled

(01:05):
by by AI that we're developing. So so for example, you can make
smaller cameras, cheaper cameras, but reach higher
quality. You can you, you can make
thinner cameras if we're talkingabout smartphones that fit into
it, this form factor which is very limited in in its size and

(01:25):
also to enable new types of of optics and new types of camera
module architectures that give you some type of significant
benefit. Tom, we hear terms like
computational photography take us behind the scenes.
What? What are we talking about there?
In traditional photography, you have a camera and you get an

(01:46):
image out at the end and maybe you can adjust it in an editing
program like Photoshop. With computational photography,
there's a lot of power that is put into creating that image,
typically on the camera or in algorithms that's embedded into
the the digital processing on the camera.

(02:07):
What that enables is collecting a lot more information and
extracting more information fromthe scene that's there to give
you a much better quality image.What it's actually doing is
typically combining various captures together, pushing the
limits on the hardware capabilities, or even designing

(02:30):
new kinds of hardware lenses andsensors and so on that can
benefit from using algorithms toget a better finished image.
So it's rethinking the whole camera from a computational
perspective. So when you talk about something
like portrait mode that that youwere you were both involved

(02:51):
with, where does computational photography fit in?
Or more generally, maybe you canjust give us a brief primer on
how something like portrait modeworks, just as an example.
Portrait mode is an example of computational photography
because it's not something that existed before we had enough

(03:12):
compute power. What portrait mode does is
basically it takes an image, normal image, color image like
like existed before portrait mode and it uses multiple
cameras to create this. It's it's called stereo, but
basically to create a depth map.So now when you have a color
image, you have another dimension to that image, which

(03:33):
is the depth, the distance of ofeach pixel.
Obviously, it's, it's not perfectly accurate, but it's
good enough to mimic this out offocus blur, which you have on
big camera. So that was intention with
portrait mode basically to, to mimic what happens on a big
camera on, on big cameras, it happens optically.
So optics when something is not focused, it gets very blurry,

(03:56):
but but it's also a nice thing. It kind of helps separate the
background and foreground. And then it what sets
smartphones apart from professional big cameras.
So portrait mode was an attempt to bridge that gap and basically
give smartphone users the ability to blur the background.
Now back to your question, Michael, you're asking kind of
this about portrait mode, but also more general about

(04:19):
computational photography. So computational photography has
been, has been around, I, I would say pretty much or a
little bit after smartphone photography started.
And, and I, you could say that digital cameras are all
computational because they have a, what was called a Bayer
pattern on the sensor and you have to interpolate some data
that's not there and, and that'scompute, right.

(04:42):
But, but I, I would actually call computational photography
when you start to do tricks likeHDR and you start to do like
complicated denoising in, in a multi frame image fusion, which
is happening on on any smartphone today or any some
other projects as well. It's it's just means you're
using heavy compute to do things.

(05:04):
Now this is not new. This has been going on for I, I,
I would say, let's say 10 years or or so.
And of course every year the chips on phones or computers,
laptops I think is getting much stronger.
So you can use more advanced computational photography.
But what we're seeing and that also comes in in kind of two
phases, the use of, of AI, more specifically neural networks in,

(05:26):
in this space. So we started seeing use of AI
even on smartphones quite a few good years ago.
For example, if your things likeface detection, segmentation,
this is a sky, this is a skin, we want to have a smoother sky.
We, we can be more aggressive with denoising, for example.
So, so all these semantics things, but they happen at low

(05:47):
resolution. So if you have an image from
from a phone and you want to say, hey, there's a face there
and then we want to do somethingwith the face.
You can run a face detector on athumbnail image.
You don't need to have the four res.
So those things that happened atlow res AI that happened at low
res that started a few good years ago.
We're now in the last year or two, we're at a point where AI

(06:08):
or the AI engines on phones are able to actually handle full res
compute, AI compute. And, and that opens up
possibilities to, to for doing alot of very innovative and very
impactful things like what we'redoing class.
We'll talk about it a little bitmore in detail in a minute.
The term computational photography I think came out

(06:30):
with regard to smartphones maybe5-8 years ago when they first
started fusing together multipleimages.
So they capture a burst of images and then to reduce noise
and to increase detail and to maintain sharpness, they can try
to fuse together, incorporate information from that sequence

(06:51):
of images to get one still photo.
And that for a while to many people was computational
imaging. But things I've mentioned quite
a lot of other use cases there and things we're now doing with
computational photography where it's really a new paradigm of
how do you design the whole system?
How do you process with AI? How do you manipulate the

(07:13):
hardware even to give you a better image, knowing that you
have this computational power available to you?
You've both mentioned hardware and software.
Can you tell us, drill into thata little bit and then Tom, maybe
you can show us some example images.

(07:35):
But first, I just want to tell everybody that you can ask
questions. If you're watching on Twitter,
pop your questions into the ontoTwitter, onto X using the
hashtag CXO talk. If you're watching on LinkedIn,
just pop your questions into thechat.
So, so, so Tom, just drill into this a little bit more.

(07:59):
So when we're talking about use of AI to improve image quality,
that's actually something in thecase of a phone that's running
typically on the phone itself, you know, most of the processing
happens to on the phone to produce an image.
There are some cases coming to the forefront today with, you

(08:21):
know, generators of AI where there are services available to
process images after the fact. But when you capture an image,
the sensor captures raw data, which is really just the the
light that hits the sensor and the this electronic signal that
comes from that. There's a lot of processing
steps that typically occur to give you the finished image.

(08:42):
You can view on a screen those possibly done by software
engineers who create algorithms,many different blocks that go
together, try and increase sharpness, reduce noise, get the
right colour and so on. And then there's many novels
they have to adjust in order to give a good quality image.

(09:03):
What we're doing at Glass Imaging actually is trying to
create a neural network that's an end to end AI solution that
replaces all of those algorithmswith one, one process that can
extract the most quality from a given image and it's adapted to

(09:25):
a particular cameras as well. So in the traditional image
processing sense, all that demosaicing, denoising,
sharpening and detail extraction, correcting for lens
issues, those are all things that we can do now with AI and
we run on a camera on the smartphone to extract the best

(09:46):
possible image. Check out cxotalk.com Subscribe
to our newsletter so that you can join our community.
We have really awesome shows coming up.
What about the quality of traditional photography using,
using lenses, of course, versus the quality that you're able to

(10:09):
get through AI? And would it be would it be
correct to say that this is an artificial process as opposed to
traditional photography using lenses?
I wouldn't say it's a artificial.
So it's there is interpolation going on and there has been
basically for any colored camerathat ever existed, like I

(10:32):
mentioned earlier, there is a color pattern on the sensor even
on big, you know, traditional Canon Nikon cameras in which
means that in every pixel location you have either red or
green or blue information. We also means that if you're now
in a green pixel, it means you don't know what the red value
there and what the blue value isand, and you have to guess it.

(10:54):
And the process traditionally is, is known to be called demos
Iking, which means guessing, interpolating information.
So you, you, the question is like, do we call this
hallucination or guessing? It's, it's, it's on a small
scale, on a pixel level, it is making up information.
I, I think this question, Michael, you bring up, it

(11:14):
becomes more interesting nowadays with generative AI
because now you're not only ableto guess what a value of pixel
is, you can guess what the mouthof a person looks like or, or
the, the, the eye or a house or a bird or a flower.
And you can make up the whole flower.
You don't just guess one pixel. And I, I think if, if you're, if

(11:35):
you're talking about multiple pixels of replacing multiple
picks, like a chunk of 100 by 100 pixel, let's say there's a
flower in there that, that, that's purely generative.
If you're replacing a few pixelshere and there and moving some
noise, then I, I wouldn't go outand call it generative.
Although again, it's a, it's totally a Gray area.
It's not black and white like like I tried to explain.

(11:56):
There really is this question ofreality versus what's made-up.
And it, as you said, at one extreme, you have generative AI
that is concocting something that never existed.
At the other extreme, you have traditional photography.

(12:17):
And so how do you make that distinction?
And how far do you, at glass imaging, push the the envelope
in terms of adding or changing or adapting the original
photograph in order to make it quote UN quote better, whatever
better happens to mean? Regarding the better there there

(12:39):
is if you're taking a a picture of something and you're trying
to test some algorithms, AI, anything, let's say that it,
it's, I don't know, some business card you put on the
wall. You can actually walk up to that
business card or just take a picture with a very high
resolution camera. So you have you have a ground
truth. You can judge what you're doing
in terms of how good or bad did did it interpolate or

(13:02):
hallucinate the missing information.
We specifically at glass are, are actually focused on the
trying to be as less generative as possible in, in, in the sense
of not making up information that's completely not there.
Some, you know, some other companies or other pieces of

(13:22):
software offering that, you know, you have Dolly and then
Sora and, and all these things. We, we see it as a post
processing. So it, it kind of complements
what we're doing 'cause even if you have an image and you want
to do some generative things on it, like remove skin
imperfections or, or something like that, you're always better
off with having the input image with the highest possible

(13:46):
resolution and as true as possible to, to, to reality.
And then you can go ahead and apply some, let's say, fake
effects on it, which is fine. I'm not criticizing it.
I think a lot of people love it.But I think it's also important
to start with something that as true as possible to, to reality.
And we're very proud that, that we're able to achieve, let's

(14:08):
say, a high confidence level or,or high comparability with the
actual, with reality, with the ground truth.
Maybe I'll say a little bit about how we're doing it because
we, we talked about what we're doing, but I, we, we haven't
explained how. So when, when you look at a
camera and it doesn't matter if it's a phone or a drone or a
professional camera or a medicaldevice, a camera consists of a

(14:29):
lens and a, and a sensor than a bunch of algorithms, right?
The more constraints you have onthe camera in terms of size and
cost and, and, and and and weight and materials that you
can use usually means you'll have more aberration.
The aberration means imperfections in the lens and it
means the lens is going to be a little bit blurry at me.
If, if you're cramped with spacelike phones or or electronics in

(14:52):
general, also means you're likely using a very small
sensor, which typically means a lot of noise in the image.
So you have, if you look at the image that just comes out of the
sensor and, and all these small devices, the images is bad.
It's really bad. We, we look at it, you know,
internally, but most people never see an image, a raw image
from a phone. It's super noisy.

(15:13):
It has color, noise, color speckles all over the place.
It's very not sharp. It has lots of optical
aberrations and most visible ones are usually chromatical
aberrations. So if you have black and white
text suddenly on the raw image, when, if when you look at it, it
has purple and blue and green, but you look at the, the, the
real paper and says it's black and white.
So these, these are chromatical aberrations and noise and color

(15:35):
noise, it all add, it all adds up to being, to creating a very
bad image in the process of correcting all these things.
It is, let's say algorithms, computational algorithms,
additional photography. Now back to what glass is doing,
which is also different than than what others have been doing
in the last few years. We have, we build special labs

(15:57):
that we can take any camera. So if, if for example, we'll
take like a iPhone, we're showing some, it's from iPhone.
So we took the, the physical iPhone, we put it in the lab.
We have labs that we built dedicated for this, what the lab
does and it's all I, I can't describe what, what, what like
to, to the very details, but thelab learns how to characterize

(16:18):
the optics and the sensor extremely, an, an extremely
detailed manner. So for every pixel, for every
angle, for every type of light, for every amount of light, it
characterizes the optics and thesensor and the interactions
between the two in it. And then all this data is
collected, terabytes of data. It takes us a few days to
collect that all that is fit into a, a software for training

(16:40):
in neural network that trains inneural network specifically, if
we talk, if we're talking about smartphones, something that can
actually run on a smartphone. Curiously, we're offering our
solutions on Qualcomm chips, butwe also have demos on, on iPhone
and and other devices. Once we do it, we have a network
that basically learns to take these bad images and create good
images, basically remove all theoptical imperfections, all the

(17:03):
as, as much as possible, the sensor noise and, and sensor
imperfections. Now everything I explained now I
was, I was given an example of an iPhone.
The camera on the iPhone is they're trying to make it as
good as possible in terms of sharpness, lens operation
sensor. But because we can correct these
things, that kind of opens up the, the, the question, OK, if
you can correct all these things, why does Apple or

(17:25):
Samsung or any other phone maker, why do they need to make
it perfect, right. And the answer is they don't.
We, we, we can make the lens much worse than they are today
or the camera in general and still get a good image of it.
And, and then the next question,you know, that I'm asking
myself, OK, why make it bad if you can make it good?
And, and that, that the answers to this are what makes things

(17:47):
interesting. So you, you make it bad if you
can correct it only to gain something, right?
Because what we're not university, we're not trying to
prove something. We're trying to, to create
something tangible that that is actually useful for products.
So, so the making something bad to gain something you can gain
cost, you can make lens cheaper,can make them smaller, you can
make them support a bigger sensor without going crazy with

(18:10):
the number of elements that you have inside the lens.
And, and, and that just opens uplots of the design space you
have now is, is way bigger than what you had when you needed to
make a good lens and a good sensor.
Very interesting. So the source of data, correct
me if I'm wrong, If I understoodthis correctly, the source of
data is the profiling that you do of the camera of of the of

(18:35):
the hardware and that data then gets fed into your neural
network, which then has can essentially construct a
simulation, a digital twin. Would that be a correct way of
saying it? Yeah, essentially this, this
kind of three steps, I think 1, you characterize the camera.

(18:59):
So we take any camera, smartphone or other type of
camera, we've put it in our lab and we do many measurements
that's completely profile how the lens and the sensor behave
in different lighting conditions, different kinds of
brightnesses and scenes and so on.
The second step is we take that data and we use that to train a

(19:20):
neural network with. So in a way it's a self
supervised process. We don't have any data
annotation or labelling going onfrom AAI point of view.
So we train a neural network that's dedicated to that device.
And the third step is we have toport it to run efficiently on
the edge. So most AI today is running in

(19:42):
the cloud, you know, big data centres, lots of GPU's.
What's the power on these mobiledevices?
You have actually some pretty capable chips today.
So Carcom is on the majority of Android phones today.
Pixel, Google Pixel and Apple iPhone have their own processors

(20:04):
that also have some kind of neural processing unit.
So you have the CPUGPU and an MPU, three different systems of
which they all have different merits.
But for any of these kind of neural networks, the MPU neural
processing unit is super capable, very power efficient.
So we can put that neural network running on that MPU.

(20:26):
And that requires, you know, some careful software
engineering to make it run efficiently.
We want to process like the 12 or 50 or even 200 megapixels
coming out of some of these sensors today in close to real
time. Meaning you take the picture and
you don't notice that there's some background processing going
on and you've rendered that image on the screen and you save

(20:48):
it to disk using that neural network, which is is taking
those bursts of raw images that come in from the sensor and
creating one final image file that has all the the sharpness
and detail and low noise and clarity that you expect.
O Your neural network then is closely bound to the specific

(21:09):
hardware of the phone, to that particular model, with the
characteristics of that particular lens and processor
and sensor and so forth. I'd say that's one of the
advantages because there are solutions like today available
for enhancing images. You'll see some of them in
Photoshop, for example, Lightroom, you'll see, you know,

(21:29):
upsampling websites, You can upload an image and produce an
AI generated upsampling. Those are not specific to the
type of camera that's used to take the picture.
So they can improve somewhat theimage quality.
And as Steve was explaining before, there's this continuum
of how much generated versus howmuch is recovered from the

(21:52):
image. What we're focused on is really
trying to recover detail that's encoded in the signal from a
sort of information theory or signal processing point of view
that there is there's image content.
The image has information that is scrambled up in a way.
It's mixed up, it's blurred, it's noisy, it has some

(22:14):
uncertainty about it. There is a true underlying
signal, which is the the image that comes from the scene.
And we're trying to reconstruct that as faithfully as possible
using the the information that'sthere.
But it's just a scrambled up. So if you take one of those
other approaches, they don't actually to do that unscrambling

(22:35):
in a way. They just try to generate or
create or invent plausible detail might be there.
So we're actually compensating for issues of the lens like
softness, aberrations, noise, inthe sense they're using a
profile that's measured in our lab.
You're tightly bound to the hardware and you're compensating

(22:57):
for deficiencies that may exist in in the lens or in the
software platform, whatever, whatever it might be.
That's part of that overall mobile system, that phone.
Any lens, any sensor has deficiencies.
It's a measurement of the the physical worlds.
When you have a sensor, you're collecting photons.

(23:19):
When you have a lens that's only, you're trying to focus
rays onto small points on the center.
And we can get into the the physics of it, but it's, it's,
it has limits physically and we try to overcome this as best as
possible using software. You asked or you mentioned that,
OK, this neural network that we just described would be tailor

(23:40):
made to that specific camera. It has pros and cons.
So the and we solve the con. I'll explain how.
So the big Pro is that it's tailor made to a specific
camera. So it's, it's able to correct
the specific aberrations of thatcamera in, in the, the specific
noise of that sensor. It that what, what is allowing
us to achieve the maximum potential quality of, of that

(24:02):
camera, basically maximize them well that you can from a given
system. The downside of of this approach
is that you have to train for each specific camera.
And basically most of the time we spent last few years is
actually building a system whichis a physical lab and a bunch of
software for, for for training and software for controlling the
lab. There's some robotics inside.

(24:24):
What we what we concluded a while ago that OK, the way to
get the maximum quality is to train for, for every different
camera. We need to make it fast.
So we spend a lot of efforts on and resources on making these
labs that can take any device with a camera on it or multiple
cameras like phones and within, you know, it can be as fast as a

(24:44):
few hours and go up to one or two days.
We, we can capture all the data needed to train the network for
a specific device and then the training takes anything between
a few hours and, and, and, and maybe a day or two.
So the whole process can start on Monday and finish by
Thursday, right. And just to kind of put things
where this stands in terms of what without glass, how would

(25:08):
you do it, right? So all the phone makers, they
design a new hardware for their new generation phone.
So you know, like Apple is not working on iPhone 17, I assume.
So it's at some point several months before the launch.
These these companies, they get the first hardware prototypes
like the phone with the new lensand the new sensor, new
everything, and then they need to tune.

(25:30):
It's called tuning tuned image quality.
Typically there's depend on the company size, but there's
anything between a few hundreds to a few thousands of people
working for a few good months, maybe even six months on tuning
these things, right. So, so you know, when, when we
say four days, it's very fast. And also worth mentioning those
four days that I mentioned, Monday to Thursday, there's

(25:53):
almost 0 human hours involved. So there's maybe 20-30 minutes
setting up the device in the lab.
Once it's set up, we close the door and we, we let it go.
And then the training, obviously, you know, it's NVIDIA
GPU is crunching numbers. It's not people doing actual
work. So, so we put what, what I want
to say, we put tons of efforts on automating the process.

(26:16):
This way we can support, you know, thousands of, of different
project at the same time and, and provide the, the output
which is in your network that's designed to run on a specific
platform, whether it's a Qualcomm chip or, or some PC or
NVIDIA chip within a few days. So we could support very high
number of customers every week. So we have a question.

(26:38):
Well, let's see from LinkedIn, Kamaru Kamaru Dean Lawal says.
Is glass imaging hiring? Good question.
Yeah, thanks for bringing up. Yeah, we, we have a few open
roles on our website and we are going to add several more open
roles. It's some software roles, some

(27:00):
business roles and and some machine learning AI roles.
How maybe you want to add something?
Yeah. Optics.
Computational optics also. Yeah.
Kamaro Dean, look at their website.
Gloria Putnam says, where do I get one of these glass AI track
suits? They look sharp.

(27:20):
And here we have a more technical question from Ayush
Jamdar. And he says, do the trained
neural networks learn from the specific camera's response
function, CRF mapping from sensor irradiance to pixel
color, therefore replacing entirely the camera pipeline

(27:43):
processing at the phone's ISP. So basically what he's wondering
is where do you guys fit in relation to the camera, if I
understand the question? We can essentially replace the
whole ISP image signal processing pipeline that exists
today. We can also replace part of it,

(28:05):
but we sit at the, as I said, the the front end where we take
in those raw images directly from the sensor all the way up
to producing a finished RGB image.
Typically we say it's a linear RGB image, meaning it's hasn't
had any tone response or manipulation of colour on top of

(28:26):
it, which is more of an aesthetic treatment that every
camera or phone or software doestoday.
So we leave that up to vendors to apply their own or we can
provide a solution that gives, you know, optimal tone and
colour. It can all be done end to end.
So it's it's flexible. So would it be correct to say
then that you are performing your magic, improving the

(28:52):
sharpness, reducing the noise, you're leaving the colors as
they are so that the phone platform can then apply their
own processing like like Samsungdoes certain things or other
manufacturers do other things within their phone software?

(29:12):
Is that their end user software as well?
That's very correct, Michael. And the reason behind it is not
necessarily what we can do or what we, we want to do.
It's, you know, if you look at smartphones, some of them have
their own, I'd say color tendency, warmer looks, this
type of white balance, that typeof white balance.
If, if we were selling the 10 smartphone companies the same

(29:36):
post proceed pipeline, they willall end up with the exact same
colors. And it's, it's one of their ways
to differentiate, you know, between, between phones.
And also it's worth mentioning that these color preferences are
very geographically varied. So if we're looking for phones
that are designed in China and sold in China versus phones that
are designed for the Korean market or Japanese market versus

(30:00):
the US or European market, you'll see very different trends
in terms of white balance, exposure, color vibrancy, things
like that. It's also worth missing.
These trends are not fixed. So we've seen in the past, let's
say five years ago, we've seen that Europe and US devices
mostly like Apple and, and, and,and Samsung were trying to stick

(30:22):
to more realistic looking colorsand tones, while the, the, the,
the leading phone makers in China were kind of promoting
very saturated colors. And now we actually see or not
now like the last 2-3, as we kind of see some inverse of
that. So I'm just saying that this
trend is not constant with time,it changes.
It has a very slow cycle, maybe maybe every few three years.

(30:46):
I, I think this is just my personal opinion.
I think the, the reason to go more true to life is kind of
coupled with the quality, the sharpness of damages.
I, I think that the more you're getting closer to let's say SLR
quality or professional quality on images, I think that will
actually drive demand for more realistic looking colors and not

(31:08):
very, you know, overly vibrant saturated colors.
A good time to look at some images if important.
Not Michael. OK, so we're looking at an
image. We're looking at 2 images side
by side. On the left, these are both
images from the recent iPhone. On the left, it's actually the
processing out straight out of the Apple Camera app, and on the

(31:28):
right is using our Glass AI neural network processing
running on the device. So what you can see immediately
and we can pan and zoom around here, but you can see that you
have better detail, sharpness, definition, color.
Like if you zoom in on this roof, you can see the red color

(31:51):
is much better represented. You can see the trees are clear
up. The definition of the lines in
these buildings is is much better.
The saturation is better. Many, many things to point out.
Let's look at another image here.
This is one where actually said you captured this one.

(32:13):
It's I guess it's zoom in. Yeah, it's the one where we were
talking about earlier, Michael, you're talking about
hallucination and we talked about how it's Gray and not a
binary. So I think this is a good
example of how we differ from other, let's say, AI processing.
So the image on the left is captured by the default iPhone

(32:34):
camera app. This is a lot of zooming in
going on here. So this this object is pretty
far from the camera. So when you zoom in, zoom in a
lot, what you see on the processing done by the iPhone
camera is that the text starts to break at some point and it
starts to look very weird. This this weirdness is actually

(32:55):
hallucination. Obviously the photons that came
from the scene through the lens on the sensor did not look like
that. The photons look correct, but
it's very hard because there's so much noise in it and the, the
optics is so blurry. It's very hard to to separate
information from from noise and and blur because we're training

(33:15):
neural network specifically for this phone.
In this case, it's the iPhone 16Pro Max Telecamera.
It's the 5X Telecamera. What we're able to, to, to
achieve better reliability in terms of how, how this matches
to the, the reality or the real text on this box or whatever
this thing is. And, and you know, both, you

(33:36):
could say both left and right are hallucination to some extent
on, on the pixel level. But I think this is a good
example to show that the hallucinations on the left
extend to maybe 45610 pixels, whereas on the right we're
hallucinating only or interpolating only maybe one or
two pixels in size. So even though the text on the

(34:00):
right does I, I wouldn't say it looks good.
It's a little bit blurry and, and smudgy, but you don't see
things that should not be there.I mean, we, we as humans look at
this text and we know it's English letters and we know what
it should look like, right? So, so we're not achieving
perfect sharpness, but we're notinventing lines that were not
there, which you know, just thiswas to explain the hallucination

(34:23):
or lack of hallucination objective.
You might ask how do we do this?These actually come from, you
know, the, the same raw files that are processed differently
in each case. So we're actually, like I said,
replacing all those blocks in this traditional image signal
processing pipeline that are kind of lost seat.

(34:44):
So every every time you apply a sharpen denoise tone curve, you
know, HDR, various different steps, they all lose some
information in the signal, whereas you do it end to end
with a neural network, you can actually preserve those things.
So there's demosaicing, which isyou have this beta pattern on

(35:07):
the sensor where you only see every other green pixel and
every one in four blue and red pixels.
So you're having to make those up.
But the sensor is also noisy, soyou're trying to guess the
missing pixels in the face of actually extreme noise in these
tiny cell phone sensors. And that's what leads to these
kind of artefacts when it's donewith a handcrafted algorithm, as

(35:30):
has been done for the the past. Well, since digital cameras have
been around, you tend to those noise artefacts get amplified
through those different steps. Whereas on the right, the neural
network is actually able to makeits its best interpretation
using the the characterization that we did in our lab to

(35:51):
untangle the the effects of the the noise that the mosaicing and
the sensor and so on. So when you talk about the
characterization, now that we'veseen some examples and it's
obviously very, very impressive,what what is the
characterization? Can you be more specific?
You're talking about aberrationsin the lens, limitations on the

(36:12):
sensor and, and how do you know?How do you know?
I mean, if you know where those aberrations exist, you know, is
it relative to absolute perfection?
Is it relative to nothing? Can you give us some calibration
sense here? We can give some more
information but not explain exactly how we do it because
that's, you know, part of our secret sauce.

(36:34):
So when the lens goes in the lab, I'll explain methods used
by let's say traditional. Let let's say that forget about
the labs again. Let's say that you have a new
camera or a phone with a camera on it.
And you want to determine what is the what, what does the
optics look like? How is the blur looking like?
So you, you, you could do a few things, but maybe the most

(36:56):
obvious 1 you, you would take some lab with nice lighting and,
and, and and charts. And you would pick a chart that
has, there's something called SFR chart was basically just a
bunch of black squares on white background.
And then on these squares, you can go to the edge where it's
supposed to go from black to white.
You know that it's infinitely sharp.

(37:17):
You so you know the ground truthbecause you printed that chart
and created that chart or, or, or you, you buy it, but you know
how it looks like. And then you take a picture of
that chart with your, let's say,in the case of a phone with a
phone, but you just take a raw picture.
You don't, you don't do any processing on the image.
So now you look at this raw image and you look at the, let's
say the black, black and white edge and you see that, OK, it's

(37:40):
not actually transitioning straight from black to white,
it's actually blurry. So the the transition takes
maybe three or four or five pixel.
So you can learn what the blur of the lens from that.
And then you would also see thatthis edge is supposed to be
either black or white or Gray. You see that it has for example,
some greenish hue on the right side and some magenta color on

(38:04):
the left side. Then you know that you have a
color aberration and you can actually even use like some
simple calculations to calculatehow much is the color aberration
in pixel. You can manually characterize
these things that are AI is doing, but it would it would
take you years to characterize 11 camera because you would have
to do this thing that I just mentioned every location in the

(38:28):
sensor. So and from going from corner to
corner to the center, you have to do it for all types of light
level or light levels and light types because the noise plays a
game here. If if you have low light, you
have the defective noise is muchmore dominant and you have the
temperature of the light, which also has an effect.
Now, just to make things more explained, the reality is even

(38:50):
more complicated because we havethese lenses are not static,
they have Oas on them, so they move around.
So everything I described now isalso dependent on the position
of the lens during stabilization, during capture.
And you also have an autofocus system on these cameras which
tries to focus on on your subject, but that's never

(39:10):
perfectly accurate. So you have some variance there,
some slight autofocus like mismatch and you also have
object distance. So everything I describe now you
need to repeat the whole thing from a few centimeters all the
way to, to very, you know, several 10s of feet.
So, so this, this can be done mentally, but it will just take
you, my guess is a year of, of full time employee in the lab

(39:32):
to, to just do all these measurements.
And then there's also a question, what do you do with
all this data, right? You'll end up with just tons of
data and and what do you do withit?
Let's not look at another example.
Maybe Michael, can you see the the screen here?
Yep, KPMG. Oh, look at that.
Yes. This is a a building taken, you
know, from reasonably far away. But on the left, this is another

(39:53):
phone. It's an Android phone, but
Archie has some pretty extreme, extreme lens aberrations going
on. So exactly what Zenith was
talking about some of these and zoom in some more.
Did you see the edge transition here is kind of blurry.
It's it's sort of sharp, but sort of not.
So it's not, it's not like just.Soft, but you you kind of have a

(40:14):
double edge this is and you havesome colour fringing.
You can see also on these wires here in the roof and on these
pipes, you kind of got red and green on either side of these
whites and black transitions that it kind of looks like an
old camera actually. And you know the reason why old
cameras look like that because they didn't have very well

(40:35):
controlled lenses. Nowadays you have much better
manufacturing tolerances and producing lenses, but still when
you try this is actually in an extreme case, you're trying to
zoom in maybe 10X or 20X on the camera.
I'm trying to fit that in a cellphone's really difficult.
So when you combine it with verytiny pixels, you start to see
these lens separations where thelight is diverging in different

(40:58):
ways on the sensor of different wavelengths.
So the red, the green focus differently.
You get chromatic aberrations and other other things like
that. So this is what you would get
with typical processing, but after applying Glass AI, it
learns to calibrate and correct for those things with that
neural network under all those very conditions, as if

(41:18):
mentioned. So different focus settings,
different distances and so on, uh, different parts of the field
of view, they all behave differently.
So it's a huge space of parameters the network has to
learn, but the result is you, you can start to see those very
fine details. Umm, just one more example, you
know, let's zoom in on this balcony.

(41:39):
Umm, you can see these very small shiny objects here.
Actually well resolved the edge of this towel hanging up here.
It's it's much better. But but Tom, in this image, with
all due respect, there's there'salso a plasticky kind of feel to

(42:01):
the to the towel that's hanging.Sure.
This is, this is an extreme zoomcase, like really or something
pushing beyond the limits of theactual sensor.
So there is actually no physicaltexture detail here.
What we're able to do is to correct those edges in this case
and give something that's, you know, when you zoom out a little

(42:22):
bit it it, I think you'll agree it's it's crisper, it's clearer.
It's not just adding contrast, it's it's that clarity that
that's for sure the mid frequency details become a lot
crisper and and more natural when you zoom it, when when you
see it like that, that was 100 Ximage zoomed in the pixel level.

(42:43):
So when you view it on a normal screen it just looks a lot
better. Yes, it's very clear.
We have another question and this is from Steph Bishop.
What is the relative importance of correction for lens defects
compared to coping with noise? If you're in sunny day outside,
the importance of correcting thelens is more dominant than

(43:05):
correcting the sensor noise. If you're in I, I'm saying low
light, but just put things in perspective.
Anything indoor for phones is low light, It's going to be
noisy unless you have some very fancy studio lights in your
house. But even in any normal house
with lights on is already low light.
And in that case, I would say that both are as important
correcting sensor noise and the the lens effect and, and just

(43:28):
emphasize again, I know we said a few times, we're not
correcting them sequentially. They're being corrected at at
one pass with one network, whichis very important.
And I think when you go to extreme low light, like let's
say 0.1 lux or something like that, then the noise becomes
more dominant. Because if if you have a 12
megapixel camera like you have on most, most phones and you're

(43:49):
now in extreme low light, you would be very thankful to get a
good quality one megapixel image, right?
So, So what it means is that actually correcting like one
pixel size operation is not super important.
It's more about correcting noise.
But again, back to my statement,if you correct them separately,
you're never going to achieve optimal results when you correct
them together. Even this extreme low light,

(44:11):
it's very important. And maybe to kind of help
explain it, if you have a lot ofnoise, right, and you have a
fancy traditional denoising algorithm, what do you want to
do? You basically want to average
pixels out, right? But you don't want to average
pixels out when you reach an edge, you have an edge.
You want to preserve the edge, right?
Now, the definition of an edge actually depends on the lens,

(44:33):
not just as you can find an edgeas a big transition, but if
you're not taking into account what your lens can do and not do
or the blurve the lens, you're not going to denoise perfectly.
You'll denoise but not as good as you can.
So take taking these things together as as as one and using
one piece of algorithm, a neuralnetwork in our case, to do these

(44:54):
things together gives you optimum.
Results. It depends on the character of
the lens and the sensor as well the pixel size versus the amount
of aberration you have on the lens.
And also not just the, the, the size of the blurb, but the, the
nature. So some, some lenses have that
are easy to correct, some have ones which are harder to

(45:14):
correct. And so that the noise sort of
operates at different spatial frequencies.
So your local frequencies, your course details are easier to
correct and the high frequencies, the very fine
details are very hard to correct.
Typically it's where the noise tends to dominate.
One thing that we haven't talkedabout that I think is
interesting to bring up is this idea of deep optics, which is

(45:35):
something else we're working on,which is more in the realm of
joint hardware and software design.
So given that we have this powerful neural ISP, this glass
AI that we've developed the the software that can correct these
lens operations. Another thought is like, how do
you produce better hardware taking into account of that

(45:58):
software capability? And that's where you can
actually engineer and design that blur that that point spread
function of the lens such that the the neural network can
extract the maximum detail and it can overcome that, that trade
off with noise and, and deep blurring or or sharpening.
I understand now that the the model that you're building is so

(46:26):
very thorough. Very basically you have this
this very deep model of the various characteristics of the
entire system and that's why you're able to make the changes
that you make when you process the image.
Ultimately, we have another question now more on the
business side from Twitter. And the question is this, if you

(46:49):
were, if using your system let'sa smart a smartphone
manufacturer use lower quality, cheaper camera parts, is the
combined cost of the cost the the cost reduction plus your
system less than the need to buya more expensive phone or a more

(47:12):
expensive imaging system? Smartphone cameras started 1015
years ago, expensive with around$5 and the goal is to bring it
down to $1.00. What actually happened is that
the price went up like crazy because there's more cameras now
and all of them have Oas and allof them have high resolution
sensor and most phones have at least 4 cameras, 5 cameras

(47:35):
today, or at least the, let's say flagship phones.
The the bill of material of these cameras can easily reach
$50 a $100 on some, some, some there's, there's one device even
speculated to a few years ago that went up to well above 100.
But, but I would say most of them fall within the several of
10s of dollars. And what that means and, and it,

(47:58):
it's a very good question. You, you could ask it the other
way. Let's say that you don't have
constraints on size. If you're a phone maker and you
have some quality today, some you reach some quality.
Let's say you have a score of 70and you want to reach a score of
90 or or 100, whatever the 100 is, you want to improve a lot.
You have two options, right? One is to use our software.

(48:21):
I'm I'm just using as an example.
The other option is to pay more for hardware, use bigger
sensors, bigger, better lenses, better sensors.
The amount of money that you would have to pay to create the
same gap in quality would be in the order of several 10s of
dollars. And, and I'm putting aside the
fact that it's not going to fit in a phone, which, which is a
big deal too, but it's going to be several 10s of dollars.

(48:42):
Where, where is, you know, not because I'm not, I'm not
discussing, you know, our, our pricing models, but it, it's,
it's, it's much lower than that,like an order of magnitude
lower. So.
So the bottom line is, if you'rebuilding a camera and you're
debating between putting more expensive hardware or using the
software coupled with your camera, you'll save a lot of
money by using the software and not the hardware.

(49:04):
Aside from money, there's an interesting thing to observe is
the trend in in smartphone thicknesses over the years.
Such, you know, Apple tried to keep them really thin for a long
time and then other companies added these bumps that started
growing thicker and thicker. Apple eventually followed that
trend and now we've maybe reached a limit where these
things barely the the flagship Android phones anywhere.

(49:25):
They're maybe fourteen, 1516mm thick.
They're barely fit in your pocket.
And I think there's a desire to go back the other way.
So they've added better and better hardware over the last
five, 6-7 years. Now they're looking like, well,
what do we do? So adding a, a software solution
like ours is actually a way theycan make it go thin again.
And I think these things go in waves.

(49:47):
That might be the next trend in in smartphones.
So where is all of this going? What is the future of smartphone
photography? And oh, by the way, why doesn't
Apple just build this in? And you know, they have as much
money as any company, so why haven't they just invented this

(50:10):
and do this? It's.
Difficult. I've, I've actually been working
in this field for probably 20 years now.
You know, obviously many, many smart people work on these kind
of topics, but back when I started doing this, it was
taking like a supercomputer to be able to to run deconvolution
algorithms on, you know, they'retaking weeks, but now we have it

(50:33):
running in less than a second ona mobile processor.
Doing that amount of engineeringto actually make things run
efficiently is super difficult. There's a lot of getting
everything just right. As we said, we have to capture a
lot of data under very many different conditions.
If you make something wrong, then you can screw up the image

(50:55):
and obviously you don't want to do that.
But besides that, I think these big companies have, I don't
know, a traditional thinking when it comes to how these image
signal processing pipelines integrate in tunes.
They have a lot of invested likethousands of people and many
resources and momentum going that direction.
So it's a little bit classical innovators dilemma that it takes

(51:19):
a new mindset to be able to deploy this.
And I think now is the, the timewe have the, the processing
power available and we, we've been actually working for
several years to get this working well on these edge
devices. But I think it's, it's, you
know, the right moment for this technology to be deployed.
And where is this all going? There's a short term where it's

(51:41):
going and there's long term where it's going.
Short term, I think we'll see more AI used in, in all phones,
not just by class, by by any allthe phone makers, by, by the
chip makers, by the whole industry, drone security,
cameras, everything. And it's happening already.
So we'll see more AI and we'll also specifically see more AI,
pixel level AI, which is what we're doing basically.

(52:02):
And obviously it will get betterover the years and there'll be
competition between companies and one will do a little better
than the other. So I, I think that's the short
term. I think the more interesting
answer is where is this going inthe long term?
In long term, in three years andabove, it might, might be sooner
is actually the change. You can drive a change to the
hardware industry based on this AI.

(52:23):
And I think that's super interesting.
You know, there's, there's lots of new optics technologies,
there's metal lenses, diffractive elements, various
type of autofocus materials and,and, and mechanisms that things
that work but are not enabled, that you can't use them without

(52:46):
this AI or, or the non existing this stuff that we're developing
with the deep optics that Tom mentioned.
I'll give an example. There was an attempt to use
under display cameras for smartphones, for the front
facing camera. Some, some phone makers actually
launched it and the the cams were just horrible, right.

(53:07):
The reason for that is because putting a display of a camera
serves kind of as a defractive grading and it creates a very
messy ESF, which is point spreadfunction.
The the optics just becomes very, very aberrated, but also
in a very non predictable way for a traditional algorithm to
fix. That's something that for us,
for example, is, is, is is is not an issue at all.

(53:27):
We would just take that phone with under display camera and
put in the lab and we, we get a perfect image out of it.
So just just giving a small example of something that's
enabled. It's a piece of hardware that's
enabled by this, this technologythat we developed.
And I think you can extend it tocheaper lenses, thinner lenses
also. So we, we will see in the coming

(53:49):
years phones with the same sensor size, but much smaller
thickness of the lens. So all these, you know, bumps on
the phones will be able to get much, much smaller because, you
know, one or two millimeter smaller then you don't need a
bump at some point. It's also worth mentioning that
the foldable phones are trendingup now.
We we just came back from a Mobile World Congress last week

(54:10):
and Barcelona, me and and Tom, and we saw tons of foldable
devices, you know, triple, triple folding, dual folding
screens on all side. These foldable phones are very
cool and useful because they give you a big screen, but they
have a big problem. Any given section of the phone
is actually very thin. So it means that when you design
a camera to go in there, you nowhave half the space or maybe 1/3

(54:33):
of the space that you have in a normal phone.
And that just puts more constraints on the camera.
So they're just using worse cameras today.
But if we can give those camerasa big boost in quality enough
that customers will say, OK, I'mgoing to buy this foldable
device, even though it has smallcamera, I'm still getting the
good image quality out of it. So that that's like a kind of

(54:54):
2nd order drive of technology. But I I just Long story short,
being able to to use this AI to drive changes and hardware will
lead to very interesting things.If you could show this image
quickly, this is one we took from a drone as well.
So we applied our technology on not just smartphones, but other
platforms that have cameras. So zooming in here.

(55:15):
These these are some friends from a well known photography
website. They came to visit our office
and you can see again we compared with the one on the
left that comes straight out of the drone that we can resolve a
lot more detail. We're able to apply our
technology to anywhere that has a camera needs processing that
runs on the edge like like also wearables like smart glasses and

(55:37):
things like that, which are big up and coming trend we believe.
That is a very significant difference.
It's pretty amazing. Yeah, it's amazing.
The difference between these twoimages very quickly.
Large camera manufacturers have also been playing with software

(55:59):
based corrections. For example, I believe that
Fujifilm and some of their cameras have been doing that,
others as well. Any thoughts on this kind of
technology showing up in large cameras?
But very quickly, please. It's coming.
I mean, I think that behind in terms of the processes they have
on their smartphones typically have better processes today

(56:22):
because of higher shipping volumes.
But as they start to integrate the same kind of Qualcomm chips,
for example, then they will be able to run our software and we
can develop versions that run onthose as well.
Do we lose anything by applying AI to photographs as opposed to

(56:43):
just light going through glass on a sensor?
You don't lose anything besides processing time in a little bit
of battery power. But actually that even is not
completely true because computational photography, non
AI computer photography actuallyconsumes more power than AI.
Just because you know, Qualcomm and, and that the chip makers,

(57:04):
they, they spend a lot of effortto making their chips very, very
power efficient. And we can run insane amount of
calculations on on that chip fora whole 2nd and they will
consume less power than traditional algorithms running
on CPUGP for the same amount of time.
So. So did I think the answer is no.
From your perspective, looking inside these smartphones, who's

(57:27):
got the best camera? The short answer is some of the
Chinese Android phones have a lot better hardware.
There's flagship phones that areactually more expensive than the
iPhone or any any Pixel or anything like that.
But the software. Wait till we're shipping, then
we'll have the best. But the image is a result of the

(57:49):
combination of the hardware and the software.
So would it be accurate to say that merely buying better
hardware because you want a better image is an incomplete
solution because you'd need to look at both?
Yeah, you need to look at the total and there's, you know,
there's a website called Dxo which they test cameras and they

(58:11):
rank them. And if you go to the top of the
chart, you will see some phones with more expensive hardware on
it. Not surprising.
But some of the phones up there actually are not that they have
much cheaper hardware, just withbetter software.
So there is kind of a like, likeyou said, it's a combination of
the two and and if you have the best hardware and the best
software, you'll get the best image quality obviously.

(58:33):
And I know that DXO has rated your results very highly.
So congratulations on that. Thank you very.
Much And with that, we're out oftime.
A huge thank you to Ziv Atar andTom Bishus Bishop.
Thank you both so much for having.
Us Michael, it was great nice talking to you and thanks to the
audience for sending interestingquestions also.

(58:54):
And thank you to everybody who watched, especially you folks
who asked such great questions. Now, before you go, check out
cxotalk.com, subscribe to our newsletter so that you can join
our community. We have really awesome shows
coming up. This has been more of a
technical deep dive than we usually do, but I thought it's

(59:16):
such an important and interesting topic that thought,
well, let's let's go for it and let me know in the comments or
send a message, smoke signals, whatever you want.
What kind of shows from CXO Talkdo you want to see?
So let me know because we are very responsive as far as that
goes. Thanks so much everybody.
Have a great day and we'll see you again next time.

All Episodes

Ex Apple Engineers: The AI Future of Photography | CXOTalk #872

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

New Heights with Jason & Travis Kelce

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Ex Apple Engineers: The AI Future of Photography | CXOTalk #872