Knowledge Inference in Medical Image Analysis: Hamid R. Tizhoosh

Videos

Core Ideas in Artificial Intelligence (AI): From Perceptron to Transformers

This lecture argues that the most important foundational concept in computer science is the abstraction of the natural neuron that underlies modern AI. Dr. Tizhoosh traces the evolution from the perceptron to contemporary deep learning architectures such as Transformers, highlighting how this core idea has shaped the understanding and development of intelligent systems.

H.R. Tizhoosh, Ph.D., Professor of Biomedical Informatics, KIMIA Lab, Mayo Clinic: Thank you very much. I hope you can hear me. So that makes it very difficult with this kind introduction. The task is difficult enough. So, 'Core Ideas in AI' and in the short time, so, I don't know the background of the audience. I'm pretty sure a lot of things that I say, already you know? But, I want to take the the history from the inception as we can identify that and then bring it to today and see, what are the core ideas of AI, when they worked, and when they did not work? So, that the disclosure that we are supposed to do now, any time that you use large language models, to get help for putting things together.

So, at the beginning, definitely, was Alan Turing. And, Alan Turing started with looking at what we call the Turing Machine. And, a Turing Machine basically is the concept of computation. So, you cannot do anything in computer sciences specifically, unless you have an abstraction. You need an abstraction such that you can bring the physical world in form of an abstraction and then you can process it inside computers. But, by his time, we didn't have computers. So, the first abstraction was, what is a computer? And, the Turing Machine was the attempt to do that.

You have an infinite tape, you have a tape head, a finite set of states, and a transition function to create a Turing Machine. So, that's, that's basically a model of a Turing Machine, but if it confuses you, just grab your cell phone that's a Turing Machine. Your computer, that's a Turing Machine. So, that's the advanced form. Same abstraction, nothing has changed. We do not have a new abstraction for computational devices, so everything started in 1936 with what Turing gave us with, with his understanding.

And, then, if looking into the future before it even started, so Turing created the idea of the test. How do we know machines, if we create those type of machines, that meanwhile we have created. How do we know that those machines are smart? How do we know we can trust them? So, he came up with this idea of testing them.

So, if you have two rooms separated, one is human, one is computer, one is AI. And, then we have a judge, who is an human expert. And, then ask a question, the judge does not know which room is which room. And, then the job is to find out if you ask the same question from the human and from the computer, and you don't know who is who, can you figure out which one is the computer? So, how long does it take for the computer to say something stupid, that we realize, oh, no human would say that?

Well, Chat GBT, it took me 8 seconds. So, when Chat GBT came out, so, okay. And, my grandma, I'm pretty sure she still thinks Chat GPT is very smart. So, the Turing test has to be done by an expert, not by a regular user. So, that's, that's why we make our life easier by assuming that things are smart when they are not.

So, what are the core concepts in our AI? Of course, you have to start from the natural neuron. So, the natural neuron with all its anatomy and structures. You have the nucleus, you have axon, you have the dendrites, you have the bifurcations. When you learn some things, we use the myelin sheath to protect what we have learned such that the neurotransmitter do not lose the information. And, of course, the most important thing in a neuron, in a natural neuron with all its diversity, is the neurotransmitters, the synaptic connections.

So, the synapsis, which we call in AI research, we call weights or parameters (more recently), are the ones that in natural neurons they are quite complicated. We know some of the functionalities. There is a lot of stuff that we don't know. Neuroscience, biology, anatomy are working on it, and we learn more and more. But, the connections between neurons to build a network is basically managed, the information is managed, by synaptic connections that are either in excitatory state or inhabitory state.

So, and of course, adult humans have around 86 billion neurons. We have, actually, much more. We have 200 billion when we are a kid. And, when we grow up, we lose many of them because we don't need them. So. We just become rigid and little less learnable, which is very strange.

But, more than neurons we have the synapses - 100 trillion to a quadrillion connections - which is actually more the magic. If you look at zebra fish, you have 10 million to 100 million. And, if you look at a bee, you have 850,000 to a million. But, even if we have cases where we know basically every single neuron in, in under a controlled-laboratory environment and we still don't know what is happening. And, people cannot even start looking at the concept of, what is consciousness? How does consciousness emerge from those neurotransmitting activities? So, a question that is probably too early for us.

The artificial intelligent artificial neuron is then the abstraction of the natural neuron. It's a tough job to do. So, that was the beginning in 1943 by McCulloch and Pitts to come up with the idea of a mathematical model of neurons. So, every time that we came up with an abstraction for the first time, the abstraction of the natural phenomenon, something big happened. So, that was definitely the beginning of artificial intelligence, that we came up with an abstraction, and say, okay, natural neuron looks like this.

A mathematical neuron should look like this, it should have some inputs and outputs. It should fire, which means a signal should come in and sometimes the signal goes out. Sometimes the signal doesn't go out, and so on, and so forth. So, they call that threshold logic unit TLU. So, that means you have some inputs and that's you have some weights, the synaptic connections it goes into a function that just sums them up. And, then it goes to a limiter. Why limiter? Because thousands and thousands of signals come in. Any biological or technical system has a capacity. You cannot just endlessly let things flow in. Then, you need to limit things. So, that limiter, which nowadays we call activation function, the purpose is, okay, you cannot just endlessly put electrochemical signals on me. Or, simply, electronic signals, and then you calculate the output.

From today's perspective, that seems trivial, but that was not short of a revolution, that we came up with an abstraction of an artificial neuron. So, that requires a lot of insight in both sides, that requires talented scientists to do that. So, the artificial neuron, basically, you have your inputs and outputs and the weights, or parameters (the synapses), are based on everything we know. Based on neuroscience, everything we know, are responsible for learning.

So, the changes in those synaptic questions (or weights) in the artificial neuron, is what constitutes learning. And, of course, you have a limiter which is what we call a transfer function, or activation function, is to again, limit. So, in computer, so in in biology, so you can easily go over the limit and in technical terms, you get an overflow. You cannot endlessly add things up. If you add things up you go beyond the biggest number that we can calculate, so, you have to limit things. Simple things, simple limitations.

So, the artificial neuron was interesting, because at the core idea of artificial intelligence, because, fundamentally, what it does, it creates a line. It draws a line. And, any artificial neuron, the only thing that it can do, it can draw a line. And, it doesn't matter how many inputs it has, if it has two inputs or a million inputs, it cannot draw more than a line. It's just a line. So, we can simplify it to say y = ax + b. That's it. But, we have w1 * x1 + w2 * x2, it's a line. If you go in multiple dimensions, it's not a line, it's a plane. If you go higher, it's a hyperplane. But, still, we are talking about a linear concept.

And, we do that because with an artificial neuron, one artificial neuron is actually quite smart. And, it can, let's say separate a tumor type in lungs. So, long adenocarcinoma versus a squamous cell carcinoma tumor type for a pathologist, for a physician; easy job. So, this a tumor type, it's not a big deal to say, "this is this type, this is this type, etc..."

But, also with an artificial, with a single artificial neuron, theoretically, we can draw a line and separate these cases. Of course, in the reality that we see now, we have two red lines, two long adenocarcinomaomas. In the green area, which means you have some error, but depending on the number of features those number of misclassifications will arise. So, you cannot really do the job with one neuron. You need more neurons and then we get started with, with the research.

So, this side, your line gives positive value. That side of your line gives negative value. Very simple, binary classification. You have a decision boundary, starts with one artificial neuron. The most, arguably the most, valuable core idea in the history of computer science, I would claim, is the abstraction of natural neuron. Make it, okay, bring the abstraction into the computer such that we can work with it. But, in order to separate those things, which is understanding it, you can separate oranges from apples if you understand - this is apple, this is orange. It's not easy, you need a certain level of intelligence to do that.

The more complicated those objects gets, the more complicated the decision model. How many lines can you draw to separate them infinite number of them. So, if you want to do randomly, but you need one specifically, one one that perfectly separates them, as far as perfection is possible. So, and that line, that blue line that you see here is given by specific values for the weights - w1, w2, or 2w. So you have to find those weights, you have to find the value of those weights. So, how do we do that?

If I put, if this is just a light bulb and I put 10 million neurons together, what is 10 million, right now we have billion? So, the large language models go have gone over several billion parameters, which is you have several billion of these neurons. So, and now you want to adjust these values to make some sort of decision. How do you do that? So, that's not going to work, there is no equation for it, there is no linear equation system that you can apply. There is no conventional mathematics for it. That's tough job. So that's why most researchers have been doing in the past 60, 70 past years.

So, the heavy learning started the research mainly from biological perspective, that again, that the changing of synapsis govern the learning. That's what we know. So, that if the synapsis change, we can learn something. So, and the basic principle for synaptic plasticity is that if you, if a group of neurons can form together, and then can change in the strength and response to some sort of activity, you need some stimulus. So, something changes, some response comes and then group of neurons - circuits, sub networks - there are different terminologies for that, gather together to do a specific task. So, and then you, when you do that, associations can be formed.

So, if you feel warm there is a certain association. What does that mean? So, I'm, I'm sleeping. I'm in bed and I feel warm and cozy, so that's an association. So, if I'm driving on the highway and a car passes me really fast, the association is I have to be careful because now I sense danger, to much more complicated associations. So, the relationship between a stimulus and the response, becomes really, strongly connected. And, we learn that through changing synapses, in specific sub networks or circuits of the brain, to learn those type of associations. So, the Hebbian Rule, and this is, I'm not using many, many equations. But, this is a historic one and the one that comes after that.

The Hebbian Rule basically says, the amount that the delta Wij, so is the amount of change that you need to adjust the synaptic connection to learn something specific. If you want to learn smoking is bad, how much should I change? How many neurons should I change? Where should I learn (how to) change them? Very difficult. So, and then of course you have a parameter ETA which is just a learning parameter. It is a setting. It doesn't matter. And then you have X, I and XJ which are activations of presynaptic and postsynaptic neurons.

Very general rule that if you want to the how much change the synaptic connection has to be undergoing, is direct connection of what is happening before and after that neuron. That's basically what it says. ETA is, okay, you put ETA to zero, you don't know. You put ETA to one, you completely trust your sensations. Well, we don't want to completely trust the sensation. There have been 300 years of discussion going on in human perception. So, can we trust what we see? I don't know how many books we wrote about, how can I trust what I see? Now, we say yes, we can. But, when you come to the conclusion, can you trust the conclusion as well?

Then we came to the Delta Rule, another core idea in the 50s and 60s. That took the Habbian Rule, which is a biological one, and says. okay, now I need abstraction I have to bring it to the computer, those presynaptic, postsynaptic doesn't cut it in computer science, I have to make it more concrete. How do I do that?

Well, you made it more concrete and say that the amount of change that I have to make to one weight, one weight not even one neuron. One weight, so you see when you do abstraction, you have to simplify things to the painful limit because you cannot otherwise establish things. If I want to change one weight and X is the input of the neuron, how should I make that adjustment? The adjustment is D-Y. Y is the input what you calculate and D is the desired output. So, the difference between desired output and the actual output, D-Y, is the driving force, the error.

That's a gigantic abstraction. That's a fantastic core idea. Trivial from today's perspective. Every bachelor student should know this, not even master or PhD student or professor. But, from the 50s and 60s perspective, to make take that step, to come from the Habbian Rule to the specific, explicit Delta Rule, that was a gigantic step. And, we said, okay, if I want to change this weight, I have to know how much it contributes to an error. How do I know the error? I know the desired output.

Do we know the desired output as a human being? Well, desired output is whatever gives me cozy feeling. I'm not hungry, I'm safe, I have food, nobody's trying to kill me, that's the desired output. Very vague, general, but everything we do we do toward that to get that desired output. In computer science, we have an easy life. We have an Excel file, a CSV file. We have the inputs, we have the output. Okay, that's a desired outcome. Calculate that.

So, we get to 58 (1958). Now, Frank Rosenblatt comes in and provides an schematic representation of connection in a simple perceptron. Okay. What happens now if I put some of those neurons together and I want to teach it something? So, the first type of neural network now this is a next level of abstraction. So, if I take it and I want to make it functional, what does it take to make it a perceptor, to perceive something, to have an automaton? There is the word has died. Nobody uses the word automaton anymore because that comes from symbolic-oriented AI.

So, and then, make it to do something specific. It was such a revolutionary idea that New York Times, around that time, wrote that, "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence". Oh my God, talk about exaggeration. Just because of an abstract idea by Frank Rosenblatt. They thought, now we have, we can create a machine that can walk, talk, see and so on... We cannot even do that in 2024. We can do some of it in a very controlled environment, with a lot of assumption, with a lot of mistakes, yes, but that was 1958. Such a great expectation. No wonder we got the first AI learning sometime after that.

So, but you cannot do that unless you have a way of learning. How do you learn? If every neuron is a light bulb and you put a billion of them together and each one of them has input/output, how do you adjust billions of weights? How do you do that? So, 1959 came the idea that let's use something that later called Gradient Descent, another core idea in artificial intelligence. So, minimizing the error, okay we knew that, people told us. McCulloch and Pitts told us. And, Widrow told that about the Delta Rule, that we should minimize the error so you have to make a small adjustment.

From day one we knew that we cannot make big adjustments, that's one of the one of the striking wisdoms of scientific research in AI. From day one, we didn't know how to do it exactly, but we knew we have to do small adjustments. Because, if you have a billion parameters you want to make big adjustments, you will oscillate. You will go to from good to bad, bad to good, good to bad. There will be no equilibrium. So, if you want to get somewhere you have to make small adjustments, you have to go slow. You cannot do that if you want to make big changes.

You have to do it in iteration, of course, it takes time to decrease the error and then learn something. And, then you have to automatically learn and improve. All of them from today's perspective. Trivial, big deal. Of course, you're doing this. Yeah, that's machine learning 101. Yeah, I know, but this is 1959, it's not 2024.

So, and Gradient Descent is interesting because the core idea of the Gradient Descent because it looks at a diagram that we don't see. You do not see this curve. This curve is hyperdimensional. We usually don't see it. We see a very simple copy of it, but we don't see this in action. But, what we imagine with learning should look like this. The abstraction is this. You start somewhere and you're training.

What do you do at the beginning? All weights of your fantastic neural networks with a billion parameter. Get random values. You start if this is Milky Way galaxy, you start somewhere in the Milky Way galaxy and you have to get somewhere specific to a specific address. You take some steps, learning steps, you start decreasing the error. Oh, if you can move in the right direction, how do you know in which direction you should go? Big question. Gradient Descent should provide some answer and then you converge to a solution when you get to the minimum error.

Can you see the minimum error? Are there other low values that may deceive you that they are minimum error? Yes, of course, you can get stuck in local minimum error all the time. We didn't know that back then. So, and then, so if you can put this together and you can do this then you can now put things together and make it a multi-layer perceptron. Neurons, perceptrons, now I can make a multi-layer perceptron.

Now is 60s (1960s) and 70s (1970s). And, we can say, okay I need an output and then I can put several layers together of neurons and say, "Why layers"? Because this is the simplest thing we can do. To this day we are keeping this and that's one of the problems that maybe at the end I will come back to it. We are going feet forward. We still are doing feet forward. We will be doing feet forward for the foreseeable future, that's a major restriction we have in artificial neural networks, there is no solution at the moment for it. Well, no practical solution.

So, you have an output layer, you have a hidden layer, you have an input layer. So, now you have layers, so you have multi-layer perceptron. When we say perceptron, it could be an individual neuron, but it does something. It perceives. It's an automatic machine that learns something. It can do something.

So, data comes, feet forward, goes through it, you make a decision, you take some input and you do something. You say something, you make some decision of some sort. So, the back propagation was the first approach or the first idea, the first next abstraction to implement Gradient Descent. How can I move across the minimum error get to the minimum error and then at at the minimum error for everybody. So, to satisfy all neurons that I have. All millions and billions neurons that I have.

Paul Werbos came up with idea 1974, and as it is in the history of science most of the time, so the, it's not necessarily the first guy who come up with the idea that will solve the problem. He or she will just struggle with the first challenges, say something, establish something is imperfect, it's not working, it gets forgotten. And, then maybe some years, some smart PhD student will pick it up on the other corner of the planet, say this is actually a fantastic idea, let's work on it. Why we come back to this?

When Rumelhart picked it up with Jeff Hinton, the gentleman that Herna was mentioning with with Nobel Prize, and we looked at it again. So, the back propagation that was in 1974 was not a practical back propagation. The main idea was take the error that you have at the output, put it back into the network. How? Well, I don't know how, but you have to do this. Okay, so, it was not explicit. When the ideas are not explicit, then they don't get anywhere.

So, the back propagation is basically backward propagation of errors, based on a loss function. You can call it error function, energy function, anything else you want, with respect to all the weights. That's the core idea, all the weights. The error with respect to all the weights. You cannot leave some weights behind. We will see that when we get to the problem of vanishing gradients, that can be a gigantic problem. You cannot learn with half of your brains. Yeah, these are crucial ideas. They sound and seem trivial but they are not. They are absolutely paramount to do things in a specific way. And, then you update the roots.

So, minimizing the loss function, the error, to update all the weights. So, his work did not receive immediate recognition, Paul Werbos, but it was rediscovered and we will come back to it, to talk about it. Without back propagation, to this day, nothing would work without back propagation. We do not have yet, still, 1974 to 2024, we still don't have any other way of doing it, of learning it. That's amazing. That's amazing. We need some of those smart PhDs to come up with a new methods. We do not have an alternative for back propagation.

We are talking about feet forward, feet backward. We have some ideas, very theoretical, no practical results, yet. Then, we got to the Boltzmann machine. Boltzmann machine is very a very, very ambitious core idea in artificial intelligence because Boltzmann machines, the connected graph, actually resembles how the human brain is. In the three dimensions, you have billions of light bulbs and they are all connected in the space. It's not layer in two-dimensional, layer of layer, no, no no. In the space, that's how the human brain is. In the space, in the three-dimensional space.

But, if you have that, if you have a fully connected graph which goes back to works of enlightenment starting with Euler and calculus by Newton and lies and all that loss functions. AI, without those guys, would not have been possible. It's very interesting, very interesting. That, now, the reverberation of enlightenment is still reaching us. It's like the echo of big bang that we can still, 2% of the noise, that we can hear is coming from big bang, it's amazing. There would have been no AI without the work of enlightenment scientists. No AI.

So, and connected graphs goes back to Euler. So, without that fundamental work we wouldn't be here. But, when we talk about a graph like this, we only talk about visible and hidden neurons, because in a layer one, in a layer network, we know this is the input, this is the output. But, if you have a connected graph in three-dimensional space, things get very complicated. And, how should I learn here? So, Boltzmann machines are stochastic machines, basically, that follow probability distributions. And all, everything, that we do right now, so the the underlaying things, we have linear algebra, we have probability theory and every artificial intelligence neural network that we have, is an automatic manifestation of either linear algebra or probability theory.

We have not invented the new discipline in mathematics is all that, but we are figuring out how to do it in automatic way, in a methoristic way. So, Boltzmann machines are energy based, energy function, so we look at the likelihood of the observed data. So, how likely is it that this belongs to this class? And, of course, limited capability because MLP is still widely used but how should I train a three-dimensional graph? I don't know where is the beginning, where is the end? The multi-layer perceptron is easy. Come here go there and I can go back here, I can follow things, the trajectories are clear, nothing is a stochastic.

We are scared of being random. Randomness is, the god of randomness is merciless, you cannot really deal with that. So, we need some level of deterministic behavior if we are want to be able to use this. But, the MLP had a big problem and we needed some core ideas to get over the limit of multi-layer perceptron in the 80s (1980s) and 90s (1990s). So, the problem was, if you have complicated data, like say tissue images, a biopsy case. We would design, manually, some features that describe that complicated data.

Is it molecular data? Is it imaging data? Is it any type of lab data: Blood, whatever you have in medicine? So, you would manually design something that grabs some features that you would think are important. We would start, okay, what is the average? What is the standard deviation? What is the color? Give me the color histogram, give me the skewness of the histogram, things like that. And, then, you would you feed those features into an MLP. And, then, you would expect that MLP tells you, okay, this is breast tissue, this is lung tissue, this is adenocarcinoma, this is whatever, whatever classes you were interested in. And, we were using it mostly for classification.

So, and of course, this was not working. The information in the data you are manually trying to get the essence of the data. And, then, any network cannot be better than the feature that you extract. That was a major problem, we needed some core idea to take the control away from the user for feature extraction. Don't touch the data. Give me the raw data. You mess it up. You bring your bias with your limited knowledge. You don't understand the data anyway. You have is 10,000 parameters. How do you want to know what features are important? You don't know. We don't know.

Look at the papers published in 80s (1980s) and 90s (1990s). A lot of papers. You could do a PhD with just writing an edge detector for images. A 3x3 matrix. I designed it like this. Minus one minus 2 minus one. There you go. You have a PhD thesis. Problems were difficult, it's not from today's perspective very easy. So, feature extraction was a we needed some core idea to go after that.

The core idea came in 1980. Neocognitron and the core idea was this. Take the feature extraction away from manual design. Give it to the network. The network should decide what features are important in the data, based on what you want from the data, Because this feature may be important for classification, that feature may be important for prediction. We don't know. So neocognitron was a fantastic idea, was the predecessor of convolutional neural network, and you had the feature extractions S cells and pooling C cells. Okay, the data comes, you extract features, you do pooling and then you make your decision.

Why do you need pooling? It's like limiter. We can't, when you extract features it's a lot. We cannot handle it, not as a human, not as a machine. We always need compression, we always need reduction, because it's too much. We are limited biological and technological entities. We cannot deal with that level of data, we always need to limit compression, encoding, pooling, which is another fancy word for making it small. Just make it small, it's too much.

So, but Neocognitron was based on the idea of local receptive fields, hierarchical feature extraction, spatial invariance and it paved the way for CNN. However, it failed. It was a core idea, it opened the door. People realized, "Oh yeah, we should look at it. The network should get the features our cell." When you look at the pictures that builds on the retina and then the signals on the rods and cons go to the optic nerves and then through the optic radiation goes to the visual cortex in the back of our head. Nobody filters that, nobody does anything.

The signal goes there and something happens along the way that we encode the visual information in some way, because it's a lot of information. If I just look at you in high resolution, it's a lot of information. How do you do that? We didn't have effective learning algorithms yet in the 1980s. The computational complexity was huge. We didn't have training data. We needed to manually configure this, you cannot do that. And, we did not fully understand the theoretical implications of this to do it.

So, 1982, the idea, the core idea of self-organizing maps, was born. That, if I look, maybe I bring the idea of collaboration and competition. And, if I arrange, that's a core idea, if I arrange the neurons not as layers, but as a map. If I do that, then for every class, for every category, for every concept, I try to localize a winning neuron. A neuron that represents that problem, that class, so that's the winning neuron. But, in a neighborhood of it, also the knowledge will be shared. So, and that neighborhood will be crucial for understanding the vague boundaries of the data.

1986, probably the most decisive core idea of AI was born, which was going back to Paul Werbos and rediscovering back propagation. But, this time going in detail, I said let's make it work. Let's make it work. So, if inputs come you have a layer of data and then you have another layer, everything is fully connected and you bring it to the final neuron and you make a decision. The question is, "How do you make that"? So, okay, you take the error, which is the calculated output minus the desired output. And, then you put the error back into the network, one by one, and there was some mathematics put forward in 1986, in contrast to what was done in 1974, that we should do it this way. But, not everything was still clear how exactly we should do it.

So, if I have an artificial neurons and I have my inputs and I have some weights and I sum them up and then put them through some limiter and I get an input, output. And, I'm not happy with the output because the output is different from what I want to see. So, because I have data, which we ignore all the time this question. The data is human knowledge. We are trying to learn human knowledge. We are, which we are assuming, the human knowledge is reliable, is free from bias. And, people now talk about AI being biased. Of course AI is biased because we are using human society. We are racist. AI will be racist. Very simple. You want to solve the AI bias. Go solve the human society bias which is a bigger problem, much bigger problem. Let's deal with that.

So, if I have the difference between the actual output and the desired output then I have to put it back and calculate proper amount of adjustment to the weights of one single neuron, delta 1, delta 2, delta 3. But, that's a difficult task because how much should I change? How should I decide? How much should I change? So, the back propagation of simply back prop is for training neural networks, minimizing the error between predicted output, actual output by the gradient the change of the loss function. How much am I losing the error to get the minimum error? So, the problem which we are still dealing with, in computer science we are still dealing with this, for different type of problems not necessarily artificial neural networks. Credit assignment problem, one of the toughest problems in computer science.

So, 1986, again back propagation. If I have a network, and I get an error so I go I send the entire data through it, and I get the total error of 125.65, whatever that means. The arrow doesn't have a unit. What is the perfect error? Zero. Zero. I want you to learn whatever I teach you, zero. So, a big number is bad news, it's a lot of error, I want the error to go towards zero. So, I take that error, well, I want to make changes to the rates, to the synaptic questions. But, who is responsible for the error?

Now, I'm training Chat GBT. I have more than a billion, I have more than a billion parameters that I have to adjust. Who is, who's responsible for that? So, who gets the credit for the error, or the punishment for the error? So, the output of a network is the result of many interacting neurons and connections. Things are super positioning, you cannot separate them easily. There's a lot of overlap, that's actually the reason that AI can generalize in many cases, because we have overlap. One single neuron inputs, contributes to many many different decisions at the same time with different strengths. That's the actual strength.

So, the weight adjustment, that's a core idea. That's an app that, this is probably the the second and last time that I show equations here. That's a core idea. The weight adjustment delta W is minus which is opposite direction of error. Opposite direction of error growing. I have to go in other direction, because I want to minimize the error. ETA is a learning rate. Zero. I'm not learning one. I'm 100% learning which is dangerous. So, that means you whatever comes you trust. No, don't, don't trust. So, we may start with 0.5 in the middle.

So, it's just a manual setting. And, then it's the gradient, the actual gradient. The changing things. It's core idea. I always tell my students, "So if you don't understand this, don't write a PhD in computer science". I know this is basic, I know. But, do you get this? Do you understand what does that mean? So, we can simplify it by the chain rule. Again, now, we are looking at how much error changes when the neuron output changes. Error as a function of output. And, how much it changes, how much the neuron output changes when the weight changes. The chain rule, differentiation. Thank you very much Sir Isaac Newton. Thank you very much enlightenment. Again, I'm not exaggerating that whatever we have right now is still reverberating from what we got established during the scientific revolution after the enlightenment.

So, we have this, so we can have an answer for the Credit Assignment Problem. We can say who is responsible for the error? You are less responsible, you are more responsible. If you are more responsible, I make big changes. You have the complete wrong weight, so I have to make a drastic change, so your delta will be 0.75. And, you are not responsible. Okay, your delta will be 0.02. So problem solved. Well, maybe. Well, we hope so. So the back propagation, before back propagation it was a dead end and back propagation brought a lot of enthusiasm in the late 80s (1980s) and beginning of 90s (1990s) when I got started in artificial neural networks.

So, but we had a persistent challenge. That, again, we needed some core ideas to solve it before 80s (1980s) before 90s (1990s). If you had three layers, you could easily train them but you could not do anything exciting with it. You could not do face recognition, signature recognition, you could not recognize cancer, you could not do anything with three layers, but you could easily train them. You realize you can do really interesting things if you have more than five layers. But, you go five layers, you cannot train them.

What do you mean you cannot train them? They would not converge. The training would go on and on and on and on. Doesn't stop. You would not get to the minimum error to stop. And if you prematurely stop, the output was just messy and chaotic. It was not a deterministic reason. So, that was a challenge. We got we got a second AI intern because of that because everybody said we are spending so much money and AI cannot even recognize signature. Who cares? Let's get some edge detector and then some other algorithm. Who cares? AI is not working. It's a dead end. You could not get research money in the late 90s (1990s) and beginning of even 2000 if you were an AI researcher.

Okay, you, you, should go against symbolic AI, maybe fuzzy logic because Japanese were doing that, but not artificial neural network. That was a dead end. In spite of the back propagation, in spite of the multi-layer perceptron, because we could not. Even with back propagation, we could not. So many core ideas, we still could not build a large network. And, you can become intelligent if you have many neurons. Wow. We are we are smarter than zebra fish, I would claim. It's a huge philosophical claim that we are smarter than the fish, I don't know we are but we have more knowledge, let's say. We have more knowledge than the fish because we have bigger brain.

So Backprop was the base for the deep network. So in 2000s and 2010s, the Backprop become central to training artificial neural networks. I don't want to say too much about the recurrent neural networks. You need sometimes the memory of what was before, and unlike feedforward neural networks, you need some directed cycles to learn what was before. And, in recurrent neural networks, basically, you have a connection from neuron to itself. I don't want to spend too much time on it for different reasons, among others, because recurrent neural networks have not been as successful as feedforward neural networks. It doesn't mean that we will drop them. No, no, no.

I don't say anything about LSTM, another core idea which is the predecessor of transformers, just because we don't have time. But, if you have connections from neuron to itself, it creates problems for learning. So. if you're telling yourself something, you need back propagation through time and you get vanishing and exploding gradients. So, it causes theoretical and practical problems if you have recurrent networks. But you need recurrent networks. You need connection to yourself, you need to say something to yourself. What was before? Go back to the core of the computer science Markov Decision Process.

How many steps should I keep to know what is the next to come, if our Markov Decision Process order one? So, if I know what was before, I can say anything. So can you, can you analyze my behavior just based on the last hour, that I had some food and some coffee? Or do you need to know how was my childhood is very different? So, that's sequence, the difference of sequence processing. So, do I need to, how many steps do I need to go back?

Recurrent neural networks laid the foundation for that, but we don't go, we don't go much to it. So, I want to get to 1989, the beginning of everything. And, we came up with Convolutional Neural Network for the improvement of neocognitron. So, now we have something practical that says, "You know what?" Take, take an input. Apply convolution, which is filtering. Apply max pooling. Repeat that. Convolution pooling, convolution pooling, convolution pooling. And, then you can use your fully connected multi-layer perceptron to make decision.

So again, convolution, which is another word for filtering, which is extract what is important. That's what it means. Extract what is important. How do you know what is important? You tell me what you want, I will extract to generate your desired output. Whatever is salient for your output.

So this, and 89 (1989) was not the boom. But, it was the theoretical breakthrough. So, we brought three secrets, three core concepts.

SLIDE: Success Secrets OF CNNs

Learning to filter
Weight sharing
Divide & Conquer

Learning to filter. Before that, again, we had manually designed filter. Signal processing, engineering, a big part of it was filter design. You would sit down and just on the paper design filters. That didn't go anywhere in artificial intelligence domain.

Weight sharing, sorry for the typo, Weight sharing, core concept, and Divide and Conquer. So, three core concepts that helped us. So, okay. Hopefully you can hear me now. Sorry about that. Okay, let's continue. So we talked about filtering, so that you have an image and you apply filter. And those filters in the middle, they were usually back in the 80s (1980s) and 90s (1990s) and even into 2000s were manually designed. Again, you would sit down, and based on signal theoretical considerations, you would design them.

But, now, if I take it out and I have a set of filter for the entire image. Your image could be 1,000 by 1,000 pixels and your mask is 3x3, 5 by 5. So, if you learn that filter, you apply it on the entire image, that's weight sharing. So, you learn one filter of 3x3. Nine values, nine synaptic weights and then you apply it on a million pixel, 1,000 x 1,000 image. Weight sharing. Very powerful idea. Not only you are learning the feature yourself but also you are doing it in a very efficient, fast manner.

And, we finally figured out how to apply the core idea of Divide and Conquer on artificial neural network. It's not an easy thing. One of the major strengths of any computer science is, if I give you any problem, can you tell me how you would apply Divide and Conquer? Not easy. There is no general recipe for that. You may spend years of research on it and oh, okay, we should do it this way. It took us more than 20 years to do it for neocognitron.

So, you take the image, you apply filtering. So, convolution and pooling, convolution pooling, convolution pooling, convolution pooling. Why you do it excessively? Because you want a hierarchy of filters. You want features at different levels, at different scales. Because, if I look at a forest from a distance, or I looked at the bark of a tree close by, I see different things. So, we want to see all that. So, we need filtering at different scales and then we need to compress it at different scales. And, when we get to the bottom of it, then I can just classify. Which means, I do feature extraction with repetitive use of filtering and pooling, and then I use an MLP, the good old-fashioned MLP.

To this day, we are still using MLP for the final decision making. Which means this is the Divide. This is the Conquer. It took us 25 years to figure that out and it took us additional 10 years to make it practical. So, you figure things out theoretically but it is not necessarily to make it to the practice. It still needs some time. So, and then pooling, of course, the Internal Divide. So, then you pool. So, if you look at the blue one 3, 4, 2, 3. So, the maximum is four. Look at the green one 6, 5, 6, 7. Max pooling gives you seven.

Why maximum? Well, artificial neural networks, natural neurons, fire together, wire together. So, if you have a signal, the maximum signal is the maximum activity. So, it has some biological equivalence. That's why we use max pooling as something meaningful. So, and you, of course, reduce from 16 values to four values. 2006, Restricted Boltzmann Machine. Absolute core idea. That's the beginning of actual revolution. Still, CNN's cannot be applied. We have the CNN's, but we cannot apply them, we cannot use them.

So, restricted Boltzmann Machine is a crazy idea. It's one of those crazy ideas that you, if you have spent some time knowing support vector machines. And, when you understand what does it mean? What is the kernel trick? How do you go from one dimension to dimension? And, you cannot solve the problem in this dimension and you bring it to the higher dimension, which is more complicated, now you can solve the problem. It's a crazy idea. Restricted Boltzmann Machine is at the same level crazy.

So, if you look at a Boltzmann Machine, which is a stochastic probability distribution machine, which we cannot use. Is a much more realistic implementation of human brain abstraction of human brain but we cannot do it. There is no way for us to apply back propagation on a three-dimensional light bulbs, that's not going to work. We would love to do that, but how? So, if we design hidden nodes and visible nodes and we remove the internal connection between the hidden nodes, and we remove the connection between the visible nodes. And, we get the visible nodes out, and we get the hidden nodes out, and the remaining connection stays, that's a multi-layer perceptron. No, no, no, no, no, no, no. That's a Restricted Boltzmann Machine.

Why? Because Boltzmann Machine was good to learn probability distribution. I just put some condition on them. I still can do everything with them. Like what? Like learning probability distribution. Is that new? Yes. In artificial neural network, it's true. It's true. And, it's new. It's interesting, that if you are not deep in the subject matter, we have difficulty to understand the innovation. And, that's the struggle of the guy who innovates because he has to make the rest of us understand. This is new, yet it looks like multi-layer perceptron, but it is not, because you can learn probability distribution.

So, when you go to visible, to the hidden one, in this too. Basically, the hidden one should be smaller than the visible one, but it should learn the same probability distribution. What is the task? No task. No task. You learn the distribution. Fundamental shift in artificial neural networks. So, what, I think, Jeff Hinton has done a lot of contribution, but, I think this one deserved the Nobel Prize. Because this made it possible for us now to train deep networks. Without Restricted Boltzmann Machine we would not have been able to train any deep network including Chat GBT. Any deep network.

So, Restricted Boltzmann Machines were used to pre-train deep networks in unsupervised manner, Keyword 'unsupervised'. You don't need output to train networks if you use Restricted Boltzmann Machines. And, he came up also with Contrastive Divergence Algorithm to train them in an unsupervised manner. Connected to now Autoencoder, another innovation from the group, from Jeff Hinton, and developed in the 80s (1980s) and 90s (1990s).

The idea of Autoencoder, again, sounds one of those things that say, "For heaven's sake, why do you want to do something like this?" So, if you look at the concept of Autoencoder, it's there to compress representation, to compress vectors. We use fancy words, vectors. You want to compress vectors. Any vector that has semantic meaning, not just random numbers, it has semantic meaning. And, you want to use back propagation. You can do Dimensionality Reduction and Feature Learning.

So, you have a vector that is something, some, whatever it is, whatever it is. You want to compress it, compress it, compress it, compress it. And, then you want to build it back up and create the same vector. So X goes in, X comes out. Why you are wasting my time on computational power? What is the benefit of that? But, the benefit of that is the bottleneck, the representation, the smallest compression that you have. So, if you have 10,000 gene expressions and then you put it 5,000, 2,500, 1,200, 600, 300. Now you have 10,000 into 600. Wow. Now we can compute.

Oh, okay. Now you got me. So it's about representation. But we like multi-layer perceptors, we could not still train this. We needed additional key ideas. RBMs, RBMs came here. So, how do we train them? This two is an RBM, is a Restricted Boltzmann Machine. This two is an RBM. This two is an RBM. This two is an RBM. What just happened? We just figured out a way to perfectly, in an unsupervised way, train a deep network. Wow!

To this day, we don't have any other, doesn't matter what network you use. Is it a CNN? Is it a transformer? It doesn't matter. You have to do this otherwise you cannot start training. So, that's the core idea that really changed everything. And, when you do this, when you go through this multiple iteration and you look at each, just layers as being an RBM, they have to just learn. What does that do? It sync them in the hyper-dimensional space to speak with the same language. Okay, but what language is it? We don't care. They just speak the same language. Now, I start learning in a supervised manner. They are synced. It runs smooth, like a Swiss clock.

Amazing idea. And, after you learn that, you just mirror the weights to the other side. Another key concept. So, you learn the half of it. Just mirror, do a transposition, bring the weight here. Now, you have a perfectly, ready-to-learn neural network. Doesn't matter how many layers you have. 500 layers. Who cares? We can do it because they are synced. Fantastic idea. You can, you can sit days on it and think about it. It's like like the hopeless poets that sit down and look at the tree and say, "Oh my god, this is so beautiful." And. nobody else sees that beauty.

So, RBM's definitely, one of them in connection to Autoencoders. Of course, you get variational Autoencoders, the beginning of generative AI. Not possible without RBMs. You cannot train a Variational Autoencoder as the beginning of generative AI without pre-training them in an unsupervised manner.

The success of CNNs. So, finally the success comes. Finally, that's 2012. Finally, the AI community has something to to learn. The AlexNet beats everybody else, beats all other models and algorithms for the ImageNet. 1,000 classes. 1,000 objects. From cats and dogs, and airplane and pedestrian, brings the highest accuracy of object recognition. Computer vision. One of the toughest things that we can imitate human intelligence. Finally, in 2012 we can do it. So, because of all that: Neocognitron, CNN, Autoencoder, Restricted Boltzmann Machine, Abstraction, Back propagation. All that flows together and you train a CNN to recognize cats and dogs.

So, Generative Adversarial Networks AI. GANs, of course. We have, we now, we don't want to just classify data, we want to generate data. So, if the classification is, you just learn to, where should I draw the line that says, this is apple this is orange? This is cancer, this is not cancer? But, the generation works in a different way. And, again, we go back to RBMs. Without RBMs we wouldn't be here because RBMs brought us back from the cloud of Artificial Neural Network to the solid ground of probability distributions.

Probability theory is the strongest theory we have. It brought us to the moon. Nothing else brought us to the moon. Kalman Filter brought us to the moon, if that wasn't a hoax according to some people. But, I think we went to the moon. So, we had the technology to do to go to the moon. So, you learn the distribution and you generate more data. What is it good for? When that was introduced, people say, "What is this good for? I have my data. You want to give me more? What is it good for?" Now, we understand. And, when you ask Chat GPT, write me a poem about somebody who is in prison and is lonely. Okay. It's not the best use of Chat GPT, but okay, now you understand what generation means.

I asked Chat GPT, give me another tissue image that is an intersection between popularity and Adenosis. No other technology can do that. What is it good for? Research, education, fundamental research. So, GANs are interesting because, again, you get images, and you have a real tissue and then you have some random noise. It's amazing from random noise, you want to generate tissue, generate new tissue, from random noise. How can you make noise to tissue? To make an image, to something that makes sense. And, then, you give it to discriminator? The discriminator has to say this is fake, this is real. Core idea. The Generative Adversarial Network. Core idea, to double check.

In going back, to maybe the idea of self-organizing maps, that we combine collaboration with competition. Another manifestation of sort of opposite learning. So, you're bringing the opposite. So, this is real, this is fake. Can you figure it out? If you can figure it out. Because, at the beginning it's easy. Noise is noise. Everybody says noise, noise, noise, noise. You get to a point that you cannot say this is real or fake. There are, there's that website I'm sure most of you have seen, that there are fake faces, people who do not exist, but the photos are there.

The workhorse of artificial intelligence, the final core idea that we have had so far. We went beyond Autoencoder, beyond Variational Autoencoder, beyond Convolutional Neural Networks. We came to something bigger, at least according to Google guys, we came to Transformers.

That's the first time that a core idea, if I'm not mistaken, a major first time a core idea is not coming from academia, directly, it's coming from corporations. That's a new era in scientific research. It has implications and we see the implications. Now major language models, major AI models, foundation models are, most of them, if not all of them, the reliable ones are coming from industry. That's good and that's a point of concern because in scientific research we are driven by other factors. Yes, academic ego as well. But, overall picture for the benefit of the human society.

Most academic admissions are paid by taxpayer money. We are not supposed to work for profit. We get salary and we work. Yeah, we are driven, again, by our own ambition. But, companies, I'm not, I'm very thankful that we have good companies and they are doing all this, fantastic. But, it's a new trend that now core ideas are not coming from academia anymore. That's a problem. At the moment I see it as a problem. Maybe it's a blessing. I don't know.

So, and you have seen this picture in the seminal paper of 2017, 'Attention Is All You Need", and that was a revolution in 2017. So, in, the problem that was, we had two problems, long range dependency. If I have a document, so if you read a novel from Agatha Christie and you read two 200 pages and in the final sentences they find out Jim was the killer. So, you have to go back and what was the dependency? What happened?

If you want to write novels like Agatha Christie, let's say. If you have that ambition I don't think Chat GPT can help you, you need a talent for that. But, so what is the dependency when the things come at the very last part? How do I find the dependency to trace it back, and say, in order to get to that surprising effect, I have to arrange things in this way? Think about the correlation of complicated multi-dimensional data that we are dealing with. But, the vanishing problem was the bigger problem.

When you long, look at the long-term dependencies, large sequence of data, the text documents, molecular data, omix, gigantic images. So, how do you look at that? So, if you do that, you have really deep, really deep network. If you do that, you get the problem of vanishing gradient. So what does that mean? You put, you go one epoch. You send your entire data to the network. You get your error. You calculate the error and say, okay, the error is 156. The first three guys, your contribution is 66, I connect you, remains 90. Next layer 55, I correct you guys. What is remaining is 35. Next layer I have to correct 42. But, I have only 35. When I make that correction, next layer I have nothing to correct. Nothing to correct.

What does that mean? That means the layers close to the input will stay untouched. They are frozen. You are not learning with them. So, of course you have very deep, if you're from your 200 layers, 198, let's say. Learn, and only two layers learn, is that a big deal? Well, it will come back and bite us at some point.

Lack of generalization, hallucination, memory distortion, completely crazy responses, and definitely giving you some points for adversarial attacks, because you know this part is weak. I can attack it here. Cyber security of the next two three decades. So, Vanishing Gradient is a problem. We needed the core concept, the core idea, to remove this. To, to, take care, that was the reason. In order that LSTMs, long short-term memory networks, recurrent neural networks, they were not working because you made them deep to be able to handle large sequences. If I put a novel of two volumes of 700 pages do you understand that? Wow, that's a lot. Well, give me RNA sequencing that's even more difficult.

How can we do this? So, it was difficult because the sequential processing needed was difficult for parallel processing. Scaling was not possible. We could not grasp the context of the issue. The context is important. Are you getting surprised and impressed when you talk to Chat GPT and then you follow, follow and then he still understands what you are saying, gets the context? How do we do that? How do they do that? Attention.

So, the core idea, the last core idea that I want to talk about, is the attention. So, if I look at a certain, at a sentence that says, "I would like to understand the AI applications in pathology", the attention most likely should be on understand, AI, and pathology. What are the key words here, that if I get those words, I got it? So, attention, here, is where some colleagues disagree with me. It's a misnomer that we are using. This is not attention. The way that it is done in the transformer, it's not attention, it's correlation.

Well, okay. If you are delivering really good results, you can do misnomer. Naive base is one of them. Fuzzy logic is one of them. So, if you're successful solving problem, oh okay, you get the privilege to have some misnomer, nobody cares. It's working, it's working. But, it's a misnomer, and from a research perspective, can cause some issue.

So, we have different type of attention, Scaled Dot-Product Attention and Multi-Head Attention. Which is, we need to figure out, now the idea was, we need to figure out. To say what is important in a scalable way that we can go really deep. What is it? And, here, something happened that is fantastic, but is causing a problem that, again, will come back and bite us. And, that's, at some point, we have to drop Transformers. And, I'm sure many colleagues will disagree with me. At some point, we have to drop them because they deviated from a major principle that we maintained back in the 70s (1970s) and 80s (1980s).

So, what we call Attention, and started in 2017. You get a query, you get some keys, so the keys represent the potential information offered by each word in the sentence. You're, of course, starting with natural language processing that was the beginning of Transformers. Understand language, not images, to begin with, and then values. Values contain the actual content associated with each words. What is the value of each word? Which is the attention? What is the attention? And then you need some normalization factor and so on and so on.

So, Attention Calculation involve matching the query with keys. Now, we are talking Information Retrieval Language. But, isn't that Artificial Neural Network? Why are talking Retrieval Language? That's the language of 70s and 80s? Keys, query, value, that's hash tables. That's data structures 101 in computer science. Why we are bringing this into, into, Artificial Neural Networks? But, we want to selectively pull information from the values. So, we are creating a lookup table. But, it's not a deterministic lookup table, it's not a hard lookup table, it's a soft, quasi-stochastic lookup table. But, the stochasticity should give you diversity, not wrong answer.

How do we do that with just that one equation? No, with some topology. So input comes, we encode it, we get some features, we decode it, we get some more output. Okay, this is new. What? The way that we want to define it is new. Let's say I want to translate from German to English. I want to translate Ich bin ein Student to I am a student. The text translation has been a major task for AI. So, object recognition, text translation, summarization was, was a dream. Done. You can summarize any text.

Not perfectly, maybe, but we can. We can summarize text. Of course, you can multiply this. The input comes many encoders and many decoders. when the problem is difficult. Okay. So, input comes, goes to an encoder, I get features, I put it into the decoder n times. Each one of them. N times decoder, N times encoder. Why? Divide and Conquer. The problem is difficult. The principle has not changed. We are just finding another manifestation to do it in a different way. What we have in mind is understand long sequence, that's the motivation, which could also be the bias.

That's the motivation. We want to understand. Recurrent networks did not work. LSTMs did not work. CNN cannot do this anyway. So, we need something to process long sequences of data over time. How should we do this, that we do not have Vanishing Gradient? We can go really, we have to go deep. We have to go really deep. How do we go deep without getting Vanishing Gradient?

Okay. Input comes in. Encoding. Encoding. Encoding. Encoding. Feature. Decoding. Decoding. Decoding. Output. Okay, good, I'm still with you. I don't know when, where you will lose me. Okay, maybe here you lose me. Okay, what's going on here? So, I get the input. I come up with something new, Positional Encoding because in the human language, we are focusing on language.

Position matters in the language. I am a student from Germany. You cannot say Germany am I a student. It doesn't make sense. So, positions matter. Language as a structure. We have syntax. We have semantics. Okay, so, we need some Positional Encoding and then we encode. Positional Encoding, so we can use sine and cosine. Sine and cosine? First time I saw this I was surprised. Wait, wait, wait, wait. We went, we went away from conventional mathematics. Now, we are coming back to trigonometry?

So, if I look at a document that I have that says Ich bin ein Student. I want to translate this. So, first thing, is that I have to positionally encode this information. And, I give to every token, which is, let's say every word here. I give an index 0, 1, 2, 3. I have three, four tokens and I can create now Positional Encoding using some sine and cosine function, and then I get some values. So, and of course, sine and cosine, you get also positive and negative values. For example, if I look at student and then I have P 31, which is with bin, I am, is minus 0.99. What does that supposed to mean?

Negative positional encoding. Well, in this context could mean, that bin comes before student. That's all it says. In a different context, it may mean something. Positional encoding, a key idea in Transformers. Sometimes, we suspend it because positions do not matter. In my case, in tissue images, position do not matter. In natural sense, position matters. The sky is here, grass is here. Position matters. Sometimes it matters, sometimes it doesn't.

So, now. And, the output size, the decoder part, when we send the output, goes through the decoder and then a linear multi-layer perceptron, basically, a softmax. We get the output probability. We put the shifted output back into the system. What? Do you remember the current network? We said the neuron has connection to itself because you need the memory mark of decision process. I need to know what happened before. So, for that we need to shift, bring back shift. Put it back as an as an input for autoreggressive generation, or sequence-to-sequence tasks, one token at a time. So, then you have the complete picture.

Transformers are very difficult to understand. Very difficult to understand. So, then you have the complete picture. That's one head. That's one head. I can have multiple heads. Then I can get context. A word can mean something in a poem, can mean something in a scientific article, can mean something in a colloquial conversation. For each one of them you need a head. But that's fine, because then this structure you can have parallel, you can send each one of them to one GPU. You can learn independently.

Challenges of Transformers.

We are almost done. Computational Complexity. They are data hungry. They are Memory Consumption. It's becoming monopolized by everybody who has, go try to see the CEO of Nvidia. You will see prime ministers and presidents of countries waiting to talk to him because it has become a question of national interest to people to have GPU power. You cannot train Transformers for reasonable task if you don't have the infrastructure. You cannot interpret them still. They are still black boxes. But, the Scalability and Deployment, Energy Consumption is a big problem right now. Nobody is caring about that because we want to generate results.

Where Are We?

Where are we, and we are done. We have a lot of success with Generative AI, translation, summarization Chat GBT is fundamentally a talking encyclopedia. I use it all the time but I still double check and fact check everything with my physical Encyclopedia Britannica. So, but it will be perfected and it will be a talking video. We can use it for many different purposes. Fantastic. But, I'm not sure we are out of the Chinese room. I think still the room does not understand Chinese. The Transformers have not opened the door that we go out. We understand language. I think we are doing exhaustive correlation analysis.

The efficiencies that we have right now with the Transformer, we went away from the original AI ambition. That's a huge problem. We went back to manual design. You look at the structure of Transformer. Okay, input comes here, and then I do this, and then I do the linear, and then I do normalization. That's a manual design. I thought we wanted to go away from that?

CNN much, much closer to the the way that human brain works. Autoencoders are much closer to what human brain partly does. Transformers are not. What is the implication of that? You will not be able to imitate aspects of human intelligence with that. And, they are still inaccurate in medicine, very inaccurate.

If you look at the left one, so the black one, that's the way that human brain should be modeled. Again, a Boltzmann machine. The middle one, if you have a subnetwork, the blue one that is working, doing something. It should be embedded within the gray one, which is the bigger brain. We don't have structures like that. And, then, look at that right one. What is the transformer? This is imitation of that. No, but it's doing fantastic things. Okay, I'm using it. But I'm not losing the picture that we want to go toward the graph one. The ones that Euler actually said, okay, this is, this is what we can do with this type of structure. And, they are very inaccurate. So, recently, four or five months ago and they are coming out.

Foundation Models for Hystopathology. Gold standard in cancer diagnosis. Foundation Models were published. We tested them, we tested all of them. Accuracy was less than 44%. But, network. What technology? It has to be reliable. At the moment we are not there. So, thank you very much, for your attention.

Foundation Models and Information Retrieval in Digital Pathology

This talk explains how foundation models — deep neural networks trained on massive datasets — can handle a wide variety of tasks. It also highlights the benefits of retrieval-augmented generation, particularly in domains such as pathology, where combining evidence retrieval with generation improves reliability.

Presenter: Our next speaker is Dr. Hamid Tizhoosh from Mayo Clinic. He is a computational biologist and a distinguished professor of computational biology. He has a long history of significant contributions to Machine Vision in histopathology at Toronto and Waterloo and is continuing that tradition of contributing to this field. And, I'll just let him introduce his topic, which is very, very timely and I think will be of interest to many of you. Welcome.

H.R. Tizhoosh, Ph.D., Professor of Biomedical Informatics, KIMIA Lab, Mayo Clinic: Thank you very much. So, let's, see if we can get started. Okay, thank you very much for the opportunity. So, if I'm successful, I will try to convey what we have learned in the past one and a half years, two years. To look at both foundation models and information retrieval in digital pathology. So, the big question that we always start with is, why do we need AI? Do we need, do we really need AI? So and, what is it?

I know we do identification segmentation, all those things, but what is the reason that we need AI? To us, we want to solve the variability. So, if AI cannot solve the variability, I don't know why we are, what we are doing with AI? All those segmentation identifications, search, anything, conversations. If it cannot help us to reduce or eliminate variability, I don't know why we are dealing with AI. Of course, in histopathology or in Medical Imaging in general, we know that we have variability, and this is a scary issue.

Even, intraobserver variability is a more, scary issue. That you can ask after a while and then the agreement of the pathologist with himself, herself in this case, is 76%. Well-known problem, because we are dealing with microscopic world, so it is tough. And you can measure it with, you can measure it with Krippendorff's Alpha. Anyway that you measure it, you see for the Krippendorff's Alpha to have a consensus, you need at least 66%. So, all of it is below 66%. That means there is no consensus. So, what does that mean if there is no consensus? Of course, pathology is not alone.

The variability is also pervasive everywhere in Radiology. So, can AI remove observer variability? That's a big question. So, can it help us to remove variability? So, we think there are three ways and two of them we have talked about it in the past. Three ways that AI could help us, but it is up to us, up to pathologist, to accept or not. So, either you classify the image, and the classification is yes and no, stage grading likelihood. What does that mean? When would classification solve the variability? When many physicians accept what the machine is saying, that's not going to happen.

So, the next one is, if you search for it. You search for the similar patients, biology, histology, background, everything and you bring back the metadata, the reports, treatment, plan, outcome, everything. So, what are you expecting here? How should done AI remove variability? So, the physician has to accept what many other Physicians have said. So, if you retrieve correctly, that's more likely to happen. So, I would rather trust my senior colleague, the colleagues that I know. I know, I know his judgment, her judgment. So, but the question is, can you really retrieve the right tissue? That's a big question. If you can do that, we may have a chance.

The third possibility is generative AI. So, you are conversing with a machine, and the machine is answering your question. What are we expecting here? When can, so if all of the pathologists trust conversational AI, where your ability is solved? So, that means one physician must trust generative AI. That's a big problem right now. So, we have a lot of reasons to mistrust generative AI.

So, when would AI be trustable so that we can hope that we are going in the direction of removing variability? So, for all of them: Classification, Search, Generative AI, they have to show high accuracy and no bias. Not, not the case yet. Not the case yet. So, for classification and search, they have to show high generalization because generative AI can generalize pretty well, if you have enough data. For Classification, it needs to be explainable, which is not. I'm sorry. The activation maps, I don't get it, I don't understand. I'm a computer scientist, I don't know what that means.

Okay, look at here, that's what I'm saying. It's okay, but you are not explaining it. So, this is not explanation. And, but, Search and Generative AI, they can explain. Search explains with evidence, Generative AI with explanation, with smart, knowledgeable explanation. And, the last one is, Source Attribution, which only search can do. Only retrieval can you give you a source for what it says. Classification cannot do it. Generative AI cannot do it. So, if you don't have a source, how should I trust you? Because I'm the one, as a pathologist, who signs and says, okay, this is it, so, I'm taking responsibility. But, Generative AI, has additionally, no, it should show no distortion, no hallucination, and so on, so on. A lot of problems.

So, what are the fundamental limitations? So, learning, if you go to:

Classification: it's supervised, it gives you high accuracy, but it's difficult to explain; Lack of generalization, needs large label data.

Search is unsupervised. It's agnostic to organ and disease. Operates on small data sets. You can have as little as 50 patients and can start losing it using the search and retrieval, and it can explain via evidence the cases of the past. But, it has low accuracy and needs expressive embeddings (good networks) and needs extra storage for indexing, which has been widely ignored.

When we go to Generative AI: this is widely self-supervised. It's strength is conversation, which impresses everybody, but needs very large amount of data. By the way, none of us is there. It seems in, in healthcare, we not there, AI way ahead of us. We don't have the data, I'm talking large archives. Where is the large archive? I don't know any large archive at any major hospital that is readily available, accessible to do these things, we don't have them, yet. So, completely, wrong answer possible. Then you're disillusioned, Oh my God, Generative AI just told me something stupid.

And, expensive maintenance. Maintenance of Generative AI is not for small clinics. So, it will be monopolized by companies or by large research hospitals. So, it's, it's something to think about. Learning needs annotations, search and guessing. Generative AI is guessing. Yes, it is guessing. It guesses most of the time, it guesses.

Well, so data: medium to large for learning, small to large for search, colossal for generative AI, for guessing, because you want to guess properly. You need a lot of data to to learn how to guess: task, classification, detect, predict for learning. Finding and search is just finding the evidence, it just goes after evidence. Where is the evidence? I try to retrieve it to bring it back. And, guessing is generation and conversation: issues, underfitting, overfitting, not explainable, lack of generalization bias for learning, for finding again low accuracy bias and needs external storage.

I'm repeating that. Again and again, from different perspective. Guessing the distortion, hallucination, no source attribution, not explainable, and biased too. AI cannot solve the problem of bias. Human Society is biased. We are racist, AI will be racist. Very simple. We can, there is nothing moral High Ground that we can put in charge of AI, that's not going to happen.

So, foundation Models. Basically, foundation Models, we have large models. So, we have a large model and then we put things in it. It seems we found a trick to not talk about overfitting. You said, okay, my model is overfitting. I make it so big that you cannot even verify that I'm overfitting and, now, overfitting has become hallucination. So, so we have a new name for overfitting. So, it seems, that something about 300 million parameters, we call it the foundation. So, that's it, basically. It's the size and then the bigger you get, the more data you need to fill it in. And, then you can do something but there is some value into that.

So, foundation models are general purpose, deep models, that can handle a broad variety of tasks. They are trained on massive data. They are self-supervised, generally, so they they can do generalized skill, which is very valuable. We have to figure out how do we use them and they can be fine-tuned for specific tasks. So, again, so if you have too little data you will underfit, you have the right size of data, you will fit. If you have too much data, and your network is small, you will overfit. And, now, you go to extremely large Network, massive data sets, and we hallucinate, instead of, instead of doing, overfitting. So, it's just getting lost sometimes in the trajectories of those exhausted, correlation machines that we call Transformers.

Very simple, it's explainable why we hallucinate. So we put some effort into that and said look so if you want to replicate what CLIP does with 400 million image caption cases in histopathology, how many images we need? So, roughly, it's 1%. We need at least 1% of the data to hit the industrial norm. That's 5 million whole-slide images. Even Thomas Folk's groups know something that is, basically, at the moment, is the, is the best thing, hopefully, somehow it becomes accessible to everybody that directly or indirectly has 1 million. So, 5 million minimum to reach that for that and that level of function such that they can get good embeddings, which is good representations.

If you have a foundation Model properly trained, and I'm sure I don't know. Is it 5 million? Is it 4 million and a half? Is it 6 million? It's not 200,000 is not 300,000 whole-slide images, it's much more than that. So, if you have it, you have fantastic tissue representation, and maybe we don't need annotations anymore. So, when is a foundation model a foundation model? Why? It has to do Zero-Shot Learning and it has to have quality of embedding. If your foundation model can be beat by a regular CNN that was trained on TCGA, this is not a foundation model. You just have, yet another fine tune, and you just wasted a lot of storage and GPU hours. It's not a foundation model if you cannot do Zero-Shot, and if the expression, the quality of your embedding, is not expressive enough.

So, this paper, for example, reported 81% accuracy for image search. So, we said fantastic, we need this type of stuff. Let's test it. We tested that and that model was beaten by a CNN. So, again, the methodology is great, the pathology is great, the team is great, where's the problem at? Data. Garbage in, garbage out. Twitter data? Of course. 200,000 Twitter data and some of them are screenshots, you don't even know at what magnification, so not high quality data.

A CNN trained on TCGA will beat you. A CNN that is not a foundation. So, it's not about the size, it's about also the quality of data. Of course we know that, yes we know that. But, I think, so, and that, that, that, one of them, one of them, again, is a CNN. It's an ordinary CNN, it's not a foundation model. Not 200 images from Twitter, 240,000 patches from TCGA, from multiple hospitals, largely high quality data. Yeah, there is a lot of debris and ink marking and things like that on TCGA, but fundamentally, good data. It's coming from hospitals, directly. So, we, we, we looked at it and said look, so that's fantastic work, but guys, we need more quality data we cannot work with online data. I'm sorry, I do not trust things that are being done with online data, with internet data. We have to get more serious.

Again, so I mentioned the Thomas Folk's groups. Again, this is the type of works that you have to do and one of the challenges is, how do we make it then available because we are using Hospital data and then hospitals get jittery. So, okay, so that's our data. Do we want to commercialize it, what about patient privacy and all that. So, but, that's the way to go, fundamentally. So, it's not about the smart group, fantastic model, good technology. But, doesn't work because is using bad data, low quality data.

So, what is happening, is in-context learning. That's fundamental. That's one of the things that we learned in the past, maybe one year. So, unlike traditional training, so what happens, the for in-context learning does not update the weights of the network to learn something new. It uses the existing model, and given context to infer something new. That's very impressive. And, so, the learning occurs when you provide a prompt and then several examples of input/output and then the network does something. What, what is happening? So, there is:

Few-Shot learning: So, that means, you have a prompt that includes few examples.

One-Shot learning: that you provide one example, something that the network has never seen, because that's a really good foundation model.

Zero-Shot learning: you don't even provide any example but you provide instruction how to do it. Indirect, like reinforcement learning. So, if this is your foundation model, and we looked at some of the examples, practically, to see how it works. If this is your foundation model and you train it so now it's trained and then prompt comes... First prompt, something happens. Second prompt. Third prompt. Fourth prompt. For every new prompt, if the network is large enough and has been properly trained, it develops sublinear system inside itself. It's like magic.

There is a certain threshold that you hit. And, then, during the training it creates those circuits and the prompts sort of activate and amplify that. So, this is how we know. From really smart people. You even have the theory for that, if you take a look at that paper. Fantastic, mathematical, framework that shows clues, that this is what is happening. That's the way that things get generalized.

Well, so, foundation models. Yes, we want to deal with that hallucination problem because it has such potential, but it needs high quality, clinical data. So, and, there are other ones. The other problem is that critical threshold, and that's what when I mentioned the five million whole-slide images. There is a critical threshold, that when the network gets there, it's gigantic enough it can create that sub networks inside itself. And, it realizes, oh this is, it develops its own Divide and Conquer, so to speak. So, but, we are there at the moment that some techniques are running out of data, among other. So, if you continually, continuously scale indefinitely the network size, at some point you will not have enough data. Like the English literature, we will not have enough English literature anymore, this year, to train. Everything has been absorbed.

We don't have that problem in Pathology. We have not even started, to use anything. So, so, not, not going to happen for us. So, yeah, Shakespeare and William Faulkner are all digested, but histopathology, not yet. So, searching is intelligence. Go back to the history of AI, it started with search: Logic-based search in the 50s, General Problem Solver (GPS). That was the reason that second AI intern started because people promised too much and could not deliver. So, another problem, a star search algorithm for robot navigation. The beginning of AI was pure search and retrieval. So, playing games, from backgammon, to tik tac toe, to chess. Everything was alpha-beta pruning back then, in the 50s (1950s) and 60s (1960s).

So, the history of AI is connected to search and retrieval and now it's, strangely, it's coming back. We had, we had expert system, expert system, that completely disappointed everybody because they were not working. They were not working because they were using simple, very static rules. If this happens and this happens and this happen, it must be this disease, not flexible enough. Now, we have a Renaissance of Information Retrieval, fantastic for hospitals because now you can exploit your data and you will have your data, nobody will digest it, and they will need you. Companies will need hospitals because they have to verify the hallucination of their networks with the clinical data.

Retrieval-Augmented Generation (RAG)

Going from guessing to finding. So, if you have large language model trained on massive data problem with factual accuracy, and staying up to date is very difficult for large models. Now, if you have external knowledge sources, which we have. We have PACs, we have tissue registry, we have lab information system, we have a lot of data. So, they use Wikipedia and PubMed for general purpose, but RAG can help large models to become more manageable and more trustable.

If you prompt a RAG platform, a large model that also retrieves information on the site, as a component, as a separate component. So, you retrieve the knowledge and you augment your prompt with the retrieved information. So, large language models can become factual. They prevent hallucination and obsolete information can be easily removed from those sublinear networks that magically appear inside large models.

Retrieval-Augmented Generation (RAG) is the future, it's the foreseeable future. It's what we have to do, especially in medicine, to make things more reliable. Enhanced Accuracy, Increase Reliability and Trust, More transparency through Source attribution. Nobody can tell you, ask Chat GPT, "Where did you learn this?" It cannot answer you, but now they go and search in Wikipedia and say, "Oh, I saw it on the Wikipedia." Oh, you, don't say. So, how do we do that in medicine? Because the archives will not be by Chat GPT Servers, we'll be in the hospital. So, we have to do our own foundation model, how do we do that? We have to, in order to search, I will go fast through this.

Searching in Medicine

We need high quality data. We need diverse populations. We need multimodal data. We need fast search algorithms. We need low demand for indexing storage, storage is very expensive at the moment. It has to be robust, should not fail frequently. And, it's to be user friendly, people should be able to easily use it.

If you look at the gigantic whole-slide images: we patched them, we divide them, we push them through some network, ideally, should be a foundation model. Such that, those impressions, those features are really good. And, then we combine them, we make them small, such that, it doesn't need a lot of storage. It cannot be random, it cannot be sub-setting, we cannot process old patches, because it's expensive, it's additional obstacle in deploying AI. It has to be universal, it has to be applicable to all type of tissue, it has to be diagnostically inclusive. You cannot miss anything.

You see, I'm just counting how difficult it is to do search. Because you need search to do trustable foundation models, and large models. So, you go there, and come here, you come here. So, you get to the nuts and bolts of the problem, actually. It has to be really fast. If you really want to do large scale, 5 million and more, then storage has to be efficient. I know for academician this is not an attractive research. They don't care. They only talk about speed and accuracy, speed and accuracy, and, you are not even accurate.

But storage is very important for the deploying of AI. So, you get your images. You patch them, you put them through a good Network, you index them. Now, we have our atlas of evidence. Now, we can connect it through Retrieval Augmented Generation (RAG) to large models. You see where I'm going with this? So, I want to really deploy large models, but I cannot because they are not trustable, so I need information retrieval. I go back and say, do we have it? No, we don't have it. Oh, okay. So, we have to solve this before we can.

If I send an image, we find similar cases, we have metadata, the pathologists who ask that the information come from other Pathologists. That means, if I find the top three, top end cases, that are similar to my patient. Now, I put the pathology reports of those into a large model, I can generate the consensus report. Without retrieval, this is not possible. So, and, summarization. We are using summarization at Mayo, many colleagues I know. Because summarization is a safe thing. No distortion. No hallucination. Just take this document, summarize it for me. Don't do anything fancy with it, just summarize it. It's one of the safe things we can do with large language models.

If you can connect it to the retrieval and search, now, the evidence of all those cases in the past years under digital dust, they become accessible. Fantastic source of information. Okay, so what about that? This or that? Both of them, we want both. So information retrieval works with small data and small network, but foundation models are very large. So, the strength is that information retrieval convinces you through evidence. Foundation model convinces through knowledgeable conversation. Very different and they complement each other. Fantastic, as if somebody has set it up.

So, information retrieval can do any disease, rare or common. Whereas, foundation model can only do common cases, because we need millions of cases. So, rare, complex cases, not a case for foundation model, you need retrieval for that. Interestingly, information retrieval is explicit. Foundation model do the same thing, but in an implicit way. So, it's a different way of processing the data. But, information retrieval is visible, accessible, explainable. Foundation models are invisible, not accessible, not easily explainable, unless they can do source attribution.

But, from the information retrieval - as Low hardware dependency - easy to add and delete cases, easy model update. Whereas, updating hardware and maintaining foundation model is very difficult. So, something that we thought about at [unintelligible]. We sat down and talked about it. So what is it? Where are we? So, they are fundamentally the same thing.

Fundamentally. It's very interesting. If I send a long tissue for individual search, conventional search, good old-fashioned search, not AI even. So, there is a function that calculates something, finds the answer for you, and says this is Lung adenocarcinoma. That's conventional information retrieval. That means search and retrieval is via hard lookup table. You have a lookup table and you look into it, with features. Now, with networks, I can give the same long tissue and say, what is this? And, there is a path, there is a function embedded inside the network, give me the same answer.

Which means, network is a soft look up table. Fundamentally, the same thing. Implicit versus explicit information retrieval. But, when you do implicit, you lose the source, that's the problem. So, we have to combine retrieval with foundation models. So, you have to create an atlas to do that, and you have to index your data. Multimodal data is very important, we need that. So, I'm going through this Atlas, it has to be inclusive. Your archive, your atlas, your retrieval system has to be inclusive, has to be free from variability. It has to have semantic equivalence to conform to anatomic and biologic nature of the disease. Then, you can have multimodal, whole slide image, molecular data, clinical data. And, do Association learning, such that you can retrieve and show it to the pathologists to include in the final report.

So, we did something with, this is our last experiments we did a Harmonizing Self-DIstillation with NO labels (H-DINO) and then we do the harmonization as sampling of the same tisue that is relevant for us. And, then, we came up with a topology that is a foundation model but we didn't have enough data, high quality data, accessible to make it a foundation model. We didn't call it a foundation model. But, basically, you can give an image to it, it will describe it for you. Or, you can give a description and it retrieves for you. So, you can go from image to text, from text to image. This has been done before in a different way, we did the cross model. Training was very difficult because at the same time it has to learn the report, the description, and the image, at the same time.

So, we looked at it. 22 distinct primary diagnosis for breast cancer. Not just [unintelligible]. Also, strange ones, rare ones. Adenosis, [unintelligible], so things that you don't usually see in the papers. And, we look at the recall at five. We do that both for patch and WSI-level. We included description, initial descriptions for that, were, had not been seen during the training. Very promising. Very good. And, then we say, okay, if I describe it, can you go and bring me the top three? And, I know the expectation is that the diagnosis should be this and that, very successful. But, we are looking at, we did not call it the foundation model. I rather, I rather undername it, because I don't want to be tested as a foundation model. So, I want to be tested as a regular model. And, then when you have enough data, 5 million, then we do that.

So, we need high quality, clinical data, foundation models need to be meticulously tested and information retrieval definitely can make foundation models more trustable. Multimodal search is needed, we don't have it, we are working on it. We should just pull all our intellectual energy to go in that direction. And, many colleagues have helped, such that I can stand here and talk to you.

Thank you so much, I appreciate it.

Presenter: Thank you very much. We have time for some questions.

QA with the Audience:

Audience Member 1: Hello. Great follow up on the talk The that you gave already, yesterday. When you talk about, so you need, are you looking for a repository with five million slides, essentially?

Dr. Tizhoosh, Ph.D.: We have it. We have it. We are working on it. We want to do things properly and at Mayo we work a little bit slower, but we test everything we do, twice, before we go to the next step. We have data. We are, we are putting things in place to do it. It will take some time.

Audience Member 1: So, you use the general concept of disease, like do you need five? And, how do you find disease then? Like, do you need five million slides on cancer, do you need five million sides on prostate cancer, or on something more specific like bcis and breast cancer?

Dr. Tizhoosh, Ph.D.: It depends how you design it. You can design it to be independent of the diagnosis, you just need to know multimodal. And, then, the way that the topology is laid out, will determine what type of data you can do, because the the main thing is the self-supervision, what is the self-supervision? What is the task that in a self-supervised manner should be learned? We have some design but we have not done any initial test, yet, because we are waiting for our data to be reconciled in one place, and then we can start doing that.

Audience Member 1: Okay, thank you.

Audience Member 2: That was very interesting. I have a question, which I think you discussed but I didn't get it. So, please, [I] apologize, but I'm wondering. So, can you use the, the output from the information, which as prompts, in the foundation?

Dr. Tizhoosh, Ph.D.: Absolutely, yeah.

Audience Member 2: Is that what you're doing?

Dr. Tizhoosh, Ph.D.: It will force the foundation model to go take another trajectory, which statistically, shows to be a more trustable project.

Audience Member 2: I see, okay. So, that's how you integrate between?

Dr. Tizhoosh, Ph.D.: Or you can use it, that's one way of doing it. That's what we call augmented. Or, you can use it to prune, edit the response, people do that too. There was a, there was a good paper from deep, deep-mind people, exploring that for non-medical cases.

Audience Member 2: And, then, one more question, which is, so, we think that, you know, foundation models trained on histologic images, are better than those trained on, let's say, Image Net or other unrelated images. Do you think that, that is actually the case or could we perhaps generate larger imagerepositories by combining images from different domains?

Dr. Tizhoosh, Ph.D.: I think it is possible. The question is, again, when is a foundation model a foundation model? We still have, and this is one of the things that, because, nobody is an expert in foundation models. Nobody can be expert in foundation models. How can you be expert in something that is two years old? Nobody's an expert. You may be an expert, of course, you may be an expert in artificial network. I would consider myself cautiously as an expert in artificial neural network but this just got started, we don't know. But, if a foundation model is really a foundation model, if those linear sub networks have been formed. If you have crossed the critical scaling threshold for the data, then, yes, absolutely. So, we can do that.

Audience Member 3: Just as a final, quick question. Hopefully, it's a quick question. If you train on multiaxial data, in addition to the imaging data, if it's available. Does that enrich the cross-linking and utility of the foundational model, so that you don't need as many cases?

Dr. Tizhoosh, Ph.D.: It depends on the topology. The topology has to be laid out for that and then you will, you will need a multi-loss function which will make the training more difficult. But, yes, it is possible.

Audience Member 3: Okay, great.

Presenter: Thank you again.

[Applause]

Image Search: Past, Present and the Path Forward

The lecture traces three decades of research on image search in histopathology, moving from handcrafted features and barcodes to deep learning and divide-conquer-combine methods. It highlights challenges such as indexing whole-slide images, patch-based retrieval, storage and validation. Despite significant technical progress with systems such as Yottixel, clinical deployment remains limited. The talk emphasizes the need for robust validation and practical integration to truly impact pathology practice.

Image Search in Histopathology - The Past, Present, and the Path Forward

Presenter: Welcome back. Good afternoon, welcome back. We are continuing with Dr. Hamid Tizhoosh. He is going to talk about Image Search in Histopathology - The Past, Present and the Path Forward.

H.R. Tizhoosh, Ph.D., Professor of Biomedical Informatics, KIMIA Lab, Mayo Clinic: Thank you very much. This is not going to be easy but we will try. So, we'll try to cover the research of the past 30 years, basically. And, a lot has been done but, strangely and maybe you understand at end why, after 30 years there is still no image search in any hospital. Systematically, correctly, comprehensively, as a part of the systems that we use, which is very strange, but we will see why.

So, the idea of search is actually quite simple. Find similar patient, similar tissue, similar sequences and then you can infer the same diagnosis, same prognosis. But, to implement this simple idea you have to take it seriously. So, again part of the validation that we did made us understand what it means to be serious. It doesn't mean that you have a serious face, it means you have to do things in a certain way to be serious.

So, Information Retrieval is obtaining relevant information from an archive and that includes searching, accessing, retrieving pieces, specific pieces of information that the user, in our case pathologist, is interested. And, that Information Retrieval involves, always, indexing. Which means creating some tags, some identifiers, that make the information understandable to the computer. So, indexing is not for the user, it's not for the pathologist, it's for the computer.

So, of course, Information Retrieval (IR) is very old. It's important. Digital libraries, e-commerce, search, healthcare, academic research, many other fields, use Information Retrieval. The context is important, user preferences are important, and, of course, the quality of what you retrieve is important. Which is one of the challenges of Iformation Retrieval in general and image search in particular.

So, Content-based Image Retrieval (CBIR) is one type of Information Retrieval that searches for image. I usually say just image search. Technically, this is wrong, you have to say Content-based Image Retrieval. I just say image search because using keywords and metadata is not going to help us much. Very limited at least. So, the search process is something that uses some features from the image: colors, texture, shape, spatial layout, how many cells you have, what shapes they have, the density of cells, all that.

It's the features, characteristics, that you use to index and search for images. So, how CBIR - Content-based Image Retrieval, or image search, works. You extract features as we said. So, color, texture, shapes, and recently deep features to index them. And, indexing is basically those features in maybe the same format, or you change them a little bit. To make them smaller, to make them clear, to make them less redundant. And, then the query, or search, is you query the image by indexing it. The new image comes, it gets indexed and then compared against the features of everything else you have in the archive, which means you have to go in and index your archive.

You have six million whole-slide images, you have to index them first to make them searchable. So, you have to create some tags, some identifiers, some numbers that computer can easily find them. Could be a massive undertaking. And, then you can retrieve. Then you can find the most similar cases and then if the hypothesis is correct, that similar biology, similar morphology, similar genomic means, similar diagnosis, similar prognosis, then we have a new technology.

Of course, Content-based Image Retrieval is used in image search engine, digital assets management, surveillance systems, multimedia, you name it. So, it's in many other fields, other than medicine, has been implemented. People are using it. People are making money with it. Not in medicine.

So, the Medical CBIR Content-based Image Retrieval aims to assess health professionals: researchers, clinicians. All types of medical images, X-rays, MRI, CT scans, ultrasound images, of course histopathology data and more. And, we are hoping that those images contain valuable information. Not really the images but the associated metadata that are stored with them: radiology reports, pathology reports. That's the value that somebody, the pathologist, has put in the image. The image itself is actually quite useless, at least for the computer.

It becomes valuable when we read the pathology report. So, when you connect the pathology report to the tissue image, then it becomes valuable. So, a bit of history. Many, many people have worked in this area. And, I went back and looked at the papers that these usually site. And, I said, oh my God, I know all these people. So, and look at it and see what they have done. So, we cannot possibly go in all details, and I will not bore you with technical details, but I just want to give you a cross-section. After thirty years of image search, where we are.

So, everything starts with gray level, with black and white. This is almost three decades ago and people started looking at tissue images. And, interestingly, surprisingly, from today's perspective. They use quiet, sophisticated machine learning approaches to look at tissue images. We didn't have digital pathology, so to speak, yet. But, people start looking at, okay, what happens if I have ... and I want a sugeon now. You have to be quite visionary to do that. And, two years later, people start talking about, okay, let's look at the Gleason factor of prostate, and what we can do? How we can search for this? We need appropriate features.

The most important question was immediately, actually, recognized. Amazing from today's perspective. And, geometrical structures, of course, were of interest to do that. And, people looked at, also the the next candidate, was, of course, breast cancer. Let's look at the breast cancer cases. Fifty-seven captured cases. That's quite large for 2000. And, individual nuclei and groups of cell nuclei. Again, so this is really the, the fundamental of recognizing anything in tissue images, was recognized for content-based retrieval of breast cancer biopsy cases.

And, people start looking about, talking about, and thinking about. Okay, how do we represent this? How, if I find images, we have to start thinking about user interfaces, how we represent those cases to the pathologist, and so on. So, what features do we use? Color, histogram image, texture, Fourier coefficient and wavelet coeffecients. You cannot publish a paper right now with Fourier transformation, and not even vaguely, nobody will publish it. Deep feature, deep learning killed everybody. So, and then, looked at effectiveness of the feuture. It's, to this day, a relevant question. So, it's, it's just amazing when I went back to look at this paper and said, How, how people thought of this 20 years ago? And, you see again and again, you see familiar names.

Vector representation of tissue image, to this day, is still the backbone of image search. So, you have to represent the tissue as a vector. Small number, small set of numbers. And, then you have a chance to index and to find. And, you have to look at the similarity measure. How do I measure similarity? So, if images become two vectors, which is a huge challenge by itself, but then I have to compare two vectors, and I'm comparing squuamous to adenocarcinoma, so, it's very interesting that people start doing that. And, people in this paper looked at different distance metrics, distance that are conforming to the biological nature, and they propose a new one.

To 2007 again people were looking at new structures of... let's look at the database, and then we extract features, and then sophisticated machine learning techniques, like manifold learning, were proposed. Still, we don't have Deep Learning yet, per se. 2012 has not come yet for the AlexNet to win the ImageNet competition. So, but, looking at, okay, so if you have this type of feature, and you want to do machine learning approaches, you need some, go back, revive some of the old-fashioned techniques like principal component analysis, because we need to get the features that are not redundant. Maybe, generally, in the statistics may be quite common and quite trivial. Not, in histopathology, not in Content-based image retrieval.

So, people start looking at other type of features, not just color, not just histogram. Oh, we had scale-invariant features, we had sift, and we have surf, and we have many other things that people used. And, Amazon was using them to recognize book covers, and all that. Again, a lot of success in non-medical fields. And, people start looking at individual structures in the tissue images. So, can we can we grab individual structures with a scale-invariant feature? So, scale-invariant is a problem, rotation-invariance is a problem.

To this day, Deep Learning has not solved the problem of rotation-invariance. And, the tissue doesn't have any prefered angle or rotation. And, this is in no way a challenge to any pathologist, but is a challenge to AI. So, and we get to 2012 when we know Content-based Microscopic Image Retrieval System is something that people talk about. And, this is, this is actually from Mateen's group around that time. And, talk, now, we start looking, really, investigations become more and more serious, more and more images. And, we look at semantically-annotated data to make sure that we are really matching the right things with the right things, so, research is becoming more serious.

So far, still, nobody has attempted to look at the whole-slide image, in general. So, what, what do we do with everything at once? It was, it was, just too much. So, and another, another approach was look at, let's, let's go and see, because one of the main things about Content-based Image Retrieval, has, the output, it doesn't matter. The output can be diagnosis, it can be grading, it can be subtyping, it can be anything in contrast to supervised learning that you have to determine. I train to say malignant or benign. The output that is attached to a pathology image and a pathology report can change abruptly and it doesn't matter, the system still works. Because, when you compare, find similar patients. The output, you can go and say, okay, hence the output is this. And, you can change that output.This is one of the amazing flexibilities of Information Retrieval which supervise AI can not imitate.

Transitioning to Hashing and Barcodes:

Okay, so at some point we transitioned to hashing and barcodes, which is a fancy name for... We realized, look, things are very computationally-intensive, things are big in pathology, we have to come up with something that is fast, and compact. The fastest and most compact thing we have in computer science, is binary zero and one black and white. So, let's make things black and white. Not the images, those features, the representations. So, and one of the things was, okay, can I, can I get projections? The same projections that we have used, we are using in CT machines, radon projections. Can we project features and then make them binary and then suddenly we have a binary code for a tumor? Can we do that?

We looked at, can you look at, can we bring autoencoders, compressed information? Again, things are big in histopathology. The main problem is, from computer science, things are big. Autoencoders are a natural candidate. Can I, can I compactify them, make them smaller, and then make them binary, such that I can retrieve information faster? Another one, okay, can I, can I assign a barcode to an image? Not a barcode that you see in grocery store, the barcode that is coming from the content of the image. Can I do that? Again, I want to use some sort of deep barcodes or deep models.

So, one of the, one of the crucial ideas was, and this, this is something that is also part of one of the search engine, is to look at the change of the feature and encode the change of the feature, not the feature itself. That was a crucial idea. So, hashing is a huge field in computer science, many sophisticated methods. This method is extremely simple compared to them, painfully simple. Just take the first dimensional derivative of features and you get a good thing that says, okay, if the features are characteristic for your tissue, so are their changes. So, if that's true. And, if you take the changes, the changes are actually more robust to noise as well. Because your feature magnitude may go up and down but the change is more robust, going from feature to feature.

So, and then, the deep features, basically, said, okay, can you go back to the tissue and we have a comparative study of, let's say CNN compared to Bag of Visual Words, Local Binary Patterns, some classical features that people were working on? So, and the supervised graph hashing was another example. Now you are encoding individual cells, that's a very sophisticated method. Let's put individual cells in binary format. Their shape, their mixture in binary format and then we have an atlas of individual cells in binary format. It's, it's a very, then you do group to group matching, basically, very sophisticated idea.

Another approach was, okay, no, no don't do it with individual nuclei. That's unmanageable you cannot do it. We have more than how many? Five million. Five million nuclei per whole-slide images, nothing? We cannot do it. Keep it, keep the hashing at the feature level. So, which means, you have a feature of your tissue, hash it, map it to some zeros and ones, that's what hashing means. It's a fancy number, fancy terminology for saying make it black and white. Do it for SIFT feature, HOG feature, GIST feature. These are all pre-, deep learning features, handcrafted features, which many people may dismiss them, but they have still a lot of value. They may have still some value here and there.

But, the deep deep barcodes were, basically, sort of a breakthrough. Because, then, you take the deep features, it doesn't matter from which network they come, and you convert them to a barcode. Is, if you, if you get 1,024 numbers from a deep network, you get a 1,024 numbers of zero and one, bits of zero and one. And, now, you have something that you can encode. And, this is very, very small, this is very, very compact, and this is the fastest there is. Nobody can do faster than binary, binary is the fastest.

So, okay, so before we go, so let's get on, on one page to the nuts and bolts of image search in histopathology. I want to keep it, Leon Pantanovic and I, we thought we should, we should basically clarify some stuff for everybody, not just for the ones who are deeply in information retrieval.

Why we search

So, Why we search. We want to observe cellular morphology, identify cells, noting cellular abnormalities, looking at tissue integrity, looking at inflammatory response, analyzing tumor characteristics, using special stains and biomarker analysis. All that can be done with information retrieval. So, when we do that, that means, because the images are big, as has been mentioned. You're aware, you put it through the network, a patch, a tile. Not the entire whole-slide image because we want to close the Semantic Gap - the way that pathologists understand the image and the way that computer understands the image. And, we get deep embedding. The deep embedding is those, those, basically, here. This pink layer, one of the layers of the network, is the so-called deep embedding or the features.

So, we don't care about the output of the network, what the network says. If you put a lobular carcinoma through DenseNet, DenseNet will classify it as matches. Because matches are one of the things that DenseNet has learned. But, we don't care about that, we use the feature. So, but, whole-slide images are big, we need some sort of Divide and Conquer. This is a question that in most paper is being ignored, as if there is somebody that will do it for us. Nobody will do it for us, we have to deal with this. So, what we do, we do patching and tiling, everybody does that. And, then, the question is, do you want to use all of them? So, 100,000 by 100,000 pixel as Muhammad was saying.So, you want to process all patches? So, do you have discount from some cloud provider? How, how do you want to do this?

So, we do Divide and Conquer in computer science, generally, because we Divide, we Conquer, and then we Combine to have a solution. For example, sorting a list of numbers, you recursively divide the list in half, and then sort the smallest manageable list, and then you combine. CNN's, you repeatedly apply subsequent convolution and pooling, that's Divide. That's the way. That's why we have deep learning, because we figured out a way to Divide. And, then, the Conquer, is fully connected layers, the MLP that was not working back in the 70s and 80s. And, then, the last classification so the Divide, Conquer, Combine. Any difficult problem we can touch, try to recognize these three stages, it's characteristic, you cannot. And, many people just skip the Divide because it makes the job a lot easier.

And, of course, in search we have also. You have to divide the whole-slide image in many patches and select some of the patches, select how. And, then, you get the patch features, and you encode the features, and then you can just compare. So, again, Divide, Conquer, Combine. So, if I have a million numbers that I want to sort, so I divide them, cut in half, cut in half, cut in half, cut in half, until I get to a small problem that is manageable. One number, I can sort one number. One number is sorted by itself, I don't need anything. So, you have to Divide to get to the base case and then you Conquer. So, now, I can, if I compare eight and four, that's an easy problem. Eight and four. So, so, and then I can Conquer, and then I can Combine, becomes a lot easier, so I can sort the list.

Now, we don't have a, if you have 100,000 by 100,000 x three channels, and you are not just sorting, we want to identify cancer. The problem is an order of magnitude, more difficult than sorting. So, we cannot skip Divide. How do we Divide? You can do sub-setting. So, you just get one, big, large part of the whole-slide image, this is the cancer. Why, if you have only one primary diagnosis? Maybe, you can do it. So, or you get a set of patches, somehow, that represent something. So, okay, how do we do this? If I patch and then I Divide, that has to be unsupervised, biopsy-independent, diagnostically inclusive, efficient. Then, I have the patches I can send them through some network, Conquer them by getting some features, and then I have to Combine them.

Somehow, I have to make it smaller, because we have storage problem. We cannot just, one of the main problems, if I maybe, journals do not ask you that. But, the administrations of hospital will ask you that. How much additional storage do you need? So, that's one of the obstacles that we have to overcome to deploy. So, why do we need the Divide? Well, you cannot do random patch selection because that's not reliable. You will miss something, you cannot do random. So you cannot do sub-setting because that's supervised. You have to train something, you have to, you need annotations subject to variability. All that, all those problems, we cannot do that. And, it will be just for breast, just for prostate, just for my hospital, we cannot do that, it has to be unsupervised.

Or, you say no, no, no, I process everything. Well, again, I don't know. Are you working at Google or Amazon or Microsoft? So who will pay for this? That's one of the reasons that in our validation we removed smiling, proposed by colleagues from Google, because they process everything. Wow, we cannot afford to process everything, we have to make a selection and then process the selection. So, that selection, the Divide of the tissue. For many purposes, not just for search, people are skipping that. Even for classification, even for prediction, so even for grading, even for staging, you have to work with a subset. Just look into the details of the papers that you read. Where is the Divide? Are you expecting that I subscribe to Google to Google Cloud?

Google is a fantastic company, but, I, I don't have money to pay them for a million whole-slide images. So, it has to be universal. The Divide has to be, it has to be, it must be unsupervised. It has to be independent of the tissue size, and type, and shape. Is it core biopsy? Is it excision? Whatever it is, it has to be independent on that. It has to be diagnostically inclusive, you cannot miss something. It has to be fast, it cannot ask for extra storage. The last one is a very unpopular topic for academic and academic researchers. You cannot publish papers, you say, we save, we save storage. What? Who cares? Just, just buy, just buy more SSD's. Who pays for it? So, it, it's a, it's an undertaking, that if you're genuinely interested in deploying AI, it's a matter. We have to solve this, we have to address this.

So, you have your whole-slide image, you do your patching, your Divide. You have a network, any network, may do pre-trained, regular, CNN, Vision Transformer, Foundation model. You get your index, you can create your Atlas. Or, you can create your archive, your index archive, and you can connect at your database of additional information. So, you can start doing information retrieval with. So, if I have an image and I send it to an image search and the image search gives me back three similar patients, those patients come with a lot of additional information. That's what matters for information retrieval. That's where information retrieval cannot be beaten by any other technology because it's self-explanatory.

It brings the information, say, I'm saying it is lular carcinoma, because I found five patients, and they are like this, and we treated them like that. That's evidence-based medicine, doesn't need additional explanation. It's based on whatever the pathologist has done in the past, so if the pathologist is looking at this image, the other information is coming from others, so that's sort of virtual, computational consultation and your colleagues are not even there. So, if you have that, and you get your similar cases, you can combine them the metadata and you get a summary that could be a sort of computational consensus now llms can be useful because summarization and aggregation that's what they can do, not subject to hallucination, not subject to distortion.

So, how do we measure the search, the search accuracy? The query comes, we search, we find cases, let's say we have two classes, red and green, and we sort them based on distance. The distance based between those feature vectors. So, okay, so then somebody says, okay, I will look at the top five, right? I will look at the top five matches. If I want to show I cannot look at top thousand because I want to show something to the pathologist, I cannot overwhelm the pathologist by showing top 100. So, if I say this is adenocarcinoma and the pathologist says, why? I want to show top three. So, top three and top five is the max you can do, because if the pathologist ask for evidence, the evidence has to be manageable.

So, if I look at the top five and I look at top n accuracy based on computer vision, the class is green, and you're correct because the first one is green. But, if I look at the majority of top five, the majority is red, and you're wrong. So, the computer vision community makes it simple. Say, if I retrieve 10, and one of them is correct, I'm correct. No, in medicine you are correct if the majority of things that you do is correct. It's tough measurement of accuracy, very conservative. If you get top 100 at least 51 of them have to be correct for you to say you are correct. So, and of course you can use uh precision and recall but a fun score, macro average, of fun score is probably one of the most reliable things we can do.

So, we did a large scale validation of search engines and we have done, it took us seven months to do it, uh the paper is under review, the preprint is available to the public. So, we, you need, some requirements. You need high quality data, you need diverse population, you need multimodal data you need fast search algorithm, low demand for indexing storage, robustness, and so on, user-friendly interface. So, we looked at several recent ones, we could not go back, look at the past 30 years. We look at the recent ones, deep learning, whole-slide images, things like that. And, we looked at, okay, which one of them has Divide. SMILY didn't have a Divide, so we removed SMILY. And, say, well SMILY doesn't have a divide, we cannot, we cannot look at it. Most of them, all of them, for Conquer, they use the features. Okay, good. So we can use this network or that network, that's not, that's not the point of fighting and discussing.

And, for Combining, some of them do Brute Force, some of them are Barcoding, some of them do trees, and some of them they just do nothing and they just Compare. So, again, we removed SMILY because it doesn't have Divide. I personally, do not take any method that does not have Divide seriously. They may get citation, and they will get citation. I do not take them seriously, because how should I use it? How should I test it? I cannot go and ask for resources to deploy it in the hospital.

Bag of Visual Words

Bag of Visual Words, very old method. So, is, really has a unique Divide. It's not explored properly, it needs customization, but it has a lot of potential. Maybe we look at the bag of visual words again with deep features.

Yottixel

Yottixel is one that I have been involved with. It has a unique Divide that is called Mosaic. You can use any network for it, it use binary encoding, but it needs customization and it's a commercial product. Not that the commercial product is a bad thing, I'm not involved with the commercial side of it, but not that it's a bad thing, but I want my freedom as a researcher to do whatever I want.

SISH

SISH is too complex. Uses Yottixel indexing, that means no innovation. Slow. Uses an obsolete data structure. Needs excessive storage. And, so, when we say this is Yottixel's encoding and indexing. Indexing is the backbone of every search engine. The tissue, the whole-slide image comes in, gets patched, get clustered, you get a mosaic. So, on average, Yottixel gets 80 patches on average, gets 80 patches to represent the whole-slide image. And, then, goes through DenseNet, which you can replace with any network. Barcode, and you have a set of barcodes, bunch of barcodes. So, and, people came and added autoencoder, and a code book, and they called it SISH.

The Yottixel people have difficulty for that. That's an academic discussion, okay, whatever. But, if you, if you want to propose a new search engine, can you please come up with a new Divide and Conquer? Because we need that, we desperately need that. And, Yottixel authors have written comments on it, but SISH has bigger problems than not having an indexing, and is, among others, is the storage.

RetCCL

RetCCL is another one. It has a very simple structure, it's basically a network, it's not a search engine. But, when papers get published, in really good journals and platforms, you have to deal with them. You don't have a choice, you cannot pick and choose. People will ask you to deal with them, to compare with them.

So, this, now,

Divide

Bag of Visual Word, Yottixel, SISH, RetCCL, the four remaining ones for us to look at it. So, for the Divide, Bag of Visual Words, very specific one. Yottixel has Mosaic that everybody else seems to be using.

Features

Features, again, they use different type of networks, and coding is: Clustering, Barcoding, Long integers.

Space

Space, all of them are linear, with the exception of SISH, SISH is exponential. Exponential means it's not doable, it's not feasible.

Matching

Matching Euclidean Distance, Hamming Distanc, going through a Tree, and Cosine Similarity.

Speed

Speed, all of them the linear. SISH is constant time, which is extremely impressive. When you hear constant time that means, doesn't matter whether you have 100,000 whole-slide images or a million, the time for search will be constant. Oh my God, I want to have that. So we went after it. So, I want to have it, if it is constant time.

Post-processing

And, then, post-processing, none of them has reranking but two of the methods have reranking.

So, we used a lot of techniques, to a lot of data sets, to test internal and public data set. To test this search engines and the ranking is a problem, because if you do ranking after the search, that means you are not happy with the search. You can post-process, but if you are doing again reranking, because search is a ranking, you rank your distances. If you rank additionally after your retrieval is done, that means you are not happy with your search. Go and optimize your search engine, do not introduce post-process.

So, and, when you do ranking that means you cannot perform really patient WSI matching, because you rank patches. And, when you rank patches, you may lose the overview of whole-slide image to whole-slide image comparison.

Can you do WSI?

So, in Bag of Visual Words it's easy. You compare to histogram, that means you are comparing two patients, the whole-side images are two different patients.

In Yottixel it's easy because you compare a Mosaic of one patient with Mosaic of another patient. Again, you can do tissue to tissue comparison.

In SISH and RetCCL you cannot do that. at least you cannot guarantee it, that you can do it, because you do rank additional ranking of individual patches. So, you break down that patch selection and then you may get, you may get good numbers for classification, but you will lose your capability of comparing tissue to tissue. That, that's something we cannot let go.

So, think about digital twin. Do we have something like digital siblings? Do we want to do that? I want to have the capability to compare WSI to WSI. We looked at top one accuracy, Yottixel was the best. We had, we looked at majority of Top-3 Accuracy, Yottixel was the best. Again, disclosing I'm one of the inventors of Yottixel. And, I was just telling [unintelligible] that I left this part to be done by three other colleagues, and somebody came and double checked our results. Not just somehow, implicitly, you say my method is the best, because I'm a handsome guy. So, you have to make sure that when you do, things are verifiable and reproducible for everybody.

Majority of Top-5

Again, Yottixel was the best. So, if you look at it, and we also looked at this indexing and search time. When you look at the search time, Bag of Visual Words, one histogram was even slower than Yottixel, because Yottixel uses binary information. But, RetCCL and SISH were very slow. Two and a half minutes with powerful GPU, two and a half minutes per WSI, whereas, Yottixel was twelve seconds. But, SISH has been published with the title fast and scalable. So, but, yes, if you just search and somebody else indexes for you, which is an additional problem, but failure is also a problem.

We saw that Bag of Visual Words and Yottixel fail in a few cases. One of the things that nobody reports, how many times you fail. If I give you a patient, how, can you, are you always successful or sometimes you fail? Because, I have to know the principle of graceful degradation in computer science. If you fail, fail in a way that you do not cause any issue for people. So, and RetCCL failed in 37 times, SISH failed in 182 times. Part of it is, again, because of that ranking that they add to add to accuracy. Because, sometimes they remove all patches and nothing remains, the algorithm collapses.

"Indexing" Storage

So, indexing storage, a very, it's one of my favorite topics. So, how much additional storage do you need to do your search and retrieval? How much additional storage do you need? So, and we measure that, how much additional overhead you need per WSI? For Bag of Visual Words 0.03 kilobyte for SISH was 97 kilobyte and if you go to 1 million whole-slide images, that's 10 gigabyte for Bag of Visual Words and 31 terabyte for SISH. So, I want to see who will load 31 terabyte into the memory, I want to see that as an engineer. Impossible, you cannot do it. So, what do you do? You read 100 gigabyte by 100 gigabyte. If you do that, your constant time flies out the window. You do, you are not doing constant time, you are not fast, you will need 10 minutes per whole-slide image.

So, but, this is what we leave out because speed and accuracy are very attractive topics, storage is not. But, then, you get and validate, things fall apart. That's the part of, yes, the idea is simple, but, are we serious? So, the storage is a problem because democratization of AI will depend on it. There are developing countries have already problems with that. And, you, if you don't care about the developing country, that's fine, that's fine. But, do you, if you think Rich hospitals will just pay for it, because we can pay for it. Because, you have a lot of, you need a lot of storage, you need a lot of GPU power. Forget even about the carbon footprint. Nobody wants to adjust, waste money, and it's not responsible. So, we have to look at extra storage that every search engine requires.

Which one is the best?

We look at it, so these are the default proposed. We rank them based on everything. So, and, this is Yottixel with KimiaNet. So, the KimiaNet has been trained with the CGA. This is a simple, accurate, fast technique, but it's a commercial product, that's what I don't like it. So, no disrespect for that company but I, it doesn't, gave the freedom away from us to work on it. Our best bet is probably Bag of Visual Words. It's good, old-fashioned techniques, has a lot of room for, for improvement, it's compact, it's fast, it just needs better features, we can work on it. So, there is, there is room to work on it.

So, we looked at some of the papers and Foundation models are coming out. They had looked at SISH, they say reported at 47%. We agree with that, we know that SISH is providing low accuracy, but we compared that with Yottixel and to regular network. So, a Foundation model against CNN, small CNN, CNN is not Foundation model, so doesn't add up, doesn't add up. So, a lot of things, and that's, that's a real concern, that's a real concern. That I don't know how things are done, I have no idea. We have to be nice, I don't know.

How do things get published and you validate them when they fall apart? What happens? What are we not checking as reviewers, when we read papers? But, we cannot use any of these search methods that I talked about, because we need a multimodal search. So, that's the problem, we cannot use Yottixel, we cannot use Bag of Visual Words. Even if we improve it, we cannot use them because we need multimodal approach.

Which search engine to choose

So, all of them, Yottixel and KimiaNet, may be the best, but they have low accuracy. The speed and storage are important in pathology. There is no automated WSI selection. And, there is no algorithm for setting magnification and patch selection. And, there is no multimodal approach. So, I will stop here, and because I had, I had something, I needed, a little, a little bit more to show, a demo of multimodal Atlas that we have. But, we can we can do that another time. So, maybe I go, just to maybe two minutes, go through the...

Presenter: Yeah.

Dr. Tizhoosh, Ph.D.: Okay, so let me see.

Presenter: Yeah. We love looking at demos, don't we?

Dr. Tizhoosh, Ph.D.: Yes, yes. Let me see, okay, so, let's go to our prototype. Hopefully everything works. So, selecting an atlas. This is a, this is an offline demo case. So, I want to look at this whole-slide image. I have some questions or some notes. I have, let's say gene expression extracted from RNA sequencing and I want to look at top three. And, I upload them. So, the upload is, is not real time. A lot of it has been buffered just for the sake of demo and after the upload is done, this is supposed to be a network.

So, this was, this was, this is the out, this is the query, this is the image that he just uploaded with gene expression and some question or descriptions or doubts. And, then the search engine found that was real time so the, the search was real time. And, these, these are the top three. The gene expression of them, notes and diagnosis: Squamous, Squamous, Squamous. That's an easy case, I know. I'm showing easy case, and this is TCG, because I don't have the clearance to show anything internal just yet.

So, I want to generate a report, so I look at the testing of the cases. For example, I want to look at the whole-slide image and gene expression together. And, when you use together, you see that my patient here, my patient here. So, it's really separated here with a Squamous because I used whole-slide image and gene expression at the same time, multimodal combined. So, it becomes relatively easy. I know, lung cancer is an easy case. So, I do that, and then I go back, I save this. What I want to do? I do add visualization, give me, show me.

So, this was my image. If this was my image, which part did you find in other patients that was similar to my patient and so on? And, I, other things, and I just say, okay, generate a report, and then we generate a report. We can, let's say give a patient ID here, and this is the gene expression, the top 20 Gene Expressions that are common between those things, and the majority both saying this is a Squamous Cell Carcinoma and the LLM is giving us a summary of those reports. So, if I, if I go back, so I had three reports of the three cases, we combine them into one, and say, okay, this is the description, probably, of your patient. And, you can save and exit and hopefully a report is generated. So, and the report, this is very rudimentary at this stage, just showing you the, showing you the principle.

Thank you so much, I appreciate it.

Presenter: Any questions?

Question 1: [name unintelligible], University of Louisville. Thank you so much for this lecture, really amazing. What I'm thinking here, you know these types of search, have many different use cases. You can use that for research, for education, but I'm thinking particular about the clinical use case. And, as a pathologist, one of the things that I wanted was, whenever I would come across a unusual tumor, to find instances of this unusual tumor in a clinical database, that has thousands and thousands, maybe hundred thousands of different diagnosis. Do you know if, if anyone has ever tested these systems in this specific setting? A setting that you have a large database, with many different types of diagnosis, and we are looking for a rare instance? I would say that this is the use case that people would be looking for.

Dr. Tizhoosh, Ph.D.: Actually, one of the test cases that we have, that's why the accuracy of all these search engines is low, was our breast cancer data with 16 soft types. Most of them very rare, we have two three patients a year at Mayo. So, and, you can look up at one of the papers that we did with WHO, Publish in Modern Pathology. We look at 38 soft types and we showed that even the rare cases, and that's the beauty of information retrieval in general. You do not, it doesn't matter that you have imbalance, if you have good arrangement, you can find the rare cases. Because it is not trained for that, it looks for similarities, it's different. But, principally is one, I would say generally is one, of the strength, strength point of, information retrieval, for any method. And, it can be done, the only condition is you need good features.

So, which is, now, everybody says either you train for your own Atlas, and in that modern pathology paper we did that, or you have a gigantic all-purpose model Foundation model that is really good. Then, it should work, so, because, it is unsupervised. Supervised methods will have problem with that, but not unsupervised. Clustering search does not have a problem with that, in general. You're not talking about a desired accuracy, yet.

Presenter: Thank you, any other questions?

Question 2: Yeah, I have a question. Thank you first of all for a very great talk. I was wondering, so it seems like, previously, it used to be Convolutional Neuron Networks and now it's Vision Transformers. increasingly, that are the Deep Learning framework that's being used to represent whole-slide images. But, I was wondering, if you had used or considered or know anything about state-space models as being a possibly a new, paradigm?

And, I'm certainly no expert on the topic, but from what I understand, they're a recurrent neural network based system. And, because of that, they actually really excel at finding a needle-in-a-haystack type problem, which could be really useful for whole-slide images where you need to represent only the most relevant part to minimize storage costs, and so forth. And, additionally, the state lives on the GPU, on the ASRAM of the GPU, so it turns out that training them is quite efficient compared to Transformers, which are, have a quadratic relationship compared to the input, whereas state-space models may be more linear.

Dr. Tizhoosh, Ph.D.: Are we talking about reinforcement learning with state-space?

Question 2 continued: Or yeah, yeah, so like the Mamba architecture. Which is, you know, that's a language-based model but there are there are vision applications. I'm just curious if you know anything about it?

Dr. Tizhoosh, Ph.D.: Go back, back in 2002 to 2007, I did some work on reinforcement learning. Generally, I'm afraid of the state-space models, generally. Recently, I've have not looked at them. I'm definitely positive that they will, there is improvement. There are other models.

A Vision Transformer model, I'm, I'm guessing they will disappear in a few years because they are actually a deviation from the imitation of human brain, they are handcrafted architecture. CNNs are closer to the human brain than Vision Transformers. Vision Transformers are exhaustive correlation analysis machines. They are working, we are using them, fantastic. We will keep them as long as there is nothing better.

But, we have meanwhile, forward, forward models that Jeff Hinton put forward. We have Komogorov-Arnold networks that are very promising. There is a lot of things happening, but for what we do, because we are really interested in deploy into the clinical utility, so we, I would, I wouldn't mainly put my main part of resources on things that are still at the basic side of research. We will keep an eye on them, but I want to do something that is mature enough, such that we can, hopefully, with a little bit of effort, we can deploy.

Presenter: I know you have other questions, but we are not done, we are not done. What we are going to do is first thank our speakers.

[Applause]

Learning or Searching: Foundations Models and Information Retrieval in Digital Pathology

This talk highlights the problem of intraobserver and interobserver variability in pathology, showing how inconsistent diagnoses threaten patient care. The speaker contrasts two AI approaches:

Large classification models. This approach aims for high accuracy through classification but risk-limited generalization and hallucination.
Search and retrieval systems. This approach grounds decisions in evidence from past cases.

Dr. Tizhoosh argues that reducing variability should be the primary goal of AI in medicine and that retrieval-augmented methods may offer more trustworthy, evidence-based support for clinical consensus than classification alone.

Learning or Searching: Foundations Models and Information Retrieval in Digital Pathology

Presenter, Adam Shepard: So, hi everyone and welcome to the 15th TIA Center Seminar of the Academic Year. So, my name is Adam Shepherd, and I'm a post doc here at TIA Center. For the people online that are new to our seminars, we aim to invite researchers from across the globe to present new and exciting work. Before we get started, I just like to remind everyone that a few of us members from the TIA Center, [unintelligible] 2024 conference in Manchester, titled Recent Advances in Computational Pathology. The full deadline is today, and it's a really great opportunity to submit and, hopefully, present some work. If you have any questions, feel free to ask. [unintelligible] would like to give a quick introduction.

Presenter 2: Hi, so I'm really honored to have Dr. Tizhoosh to be our TIA Seminar Speaker today. Dr. Tizhoosh has been at the forefront of developments in the area of computational pathology and really fortunate that he accepted our invitation. So, thank you very much, I'm really looking forward to your talk. I think Adam will do the formal introduction of yourself, but I just want to say I'm really grateful for you to accept our invitation and to take time out of your busy schedule to inform us about all the exciting things.

Just a little bit more. So, thank you very much for for joining us. So, for everyone online I'd like to, in person, I'd like to introduce Professor Hamid Tizhoosh. Professor Hamid Tizhoosh, is a Professor of Biomedical Informatics in the department of AI and Informatics at the Mayo Clinic.

From 2001 to 2021 he was a professor in the faculty of engineering at the University of Waterloo, where he founded the Kimia Lab. Before he joined the, before that he joined the university, sorry, before he joined the University of Waterloo, he was a Research Associate at the Knowledge and Intelligence Systems Laboratory at the University of Toronto, where he worked in AI methods, such as, reinforcement learning. And, since 1993, his research activities encompass AI, computer vision and medical imaging. He's developed algorithms for medical image filtering, segmentation and search. He's the author of two books, 14 book chapters and more than 140 journal and conference papers.

The title of our talk today is Foundation Models in Histopathology. So, again, thank you very much for joining us, and, when you're ready to get started.

H.R. Tizhoosh, Ph.D., Professor of Biomedical Informatics, Mayo Clinic: We can start?

Presenter: Yeah, please thank you.

Dr. Tizhoosh: Thank you very much for the kind introduction, I appreciate that. I'm grateful for the opportunity. I look at this as, a, just, reporting back to the community. And, hopefully, we can share some high-level findings and abstract, maybe, philosophies and which direction we have to move, and what we have to do. And, hopefully, we get some questions and some feedback.

The title is more or less the same thing. The question is about Foundation models and Information Retrieval in Digital Pathology and what, what, what does it look like at the moment, and maybe where we should go. So, the point of departure for us is observer variability in medicine. So, which seems to be the source of almost all problems. So, whatever we touch, triaging, diagnosis, treatment planning, seems to be subject of variability. So, you ask multiple physicians, multiple experts, you get different responses. And, of course, for us, perhaps, in digital pathology, which is the diagnostic gold standard for many diseases, looking at whole-slide images and then coming up with a diagnosis or subtyping or grading is one of the most important aspects of variability.

And, there are many, many reports coming, E1, that shows the six breast pathologist look at different cases. 100 cases of invasive stage two carcinomas. And, you get an interobserver variability of 0.5 to 0.8, and more scary intraobserver variability of 0.76, which, is always, is always interesting. So,that if, if you have a proper wash out between the first and second observation, that, why even intraobserver variability is that high?

It's scary from a patient's perspective, it's scary. And, of course, again, there are many, many, and you can use, Kappa Statistics. Or, you can use something relatively new in medicine, Krippendorf's Alpha, to look at variability, and you measure it. So, here, a little bit larger 149 consecutive DCIS cases, 39 pathologists. And, you see the Krippendorff's Alpha. You have, you can talk about consensus if you have at least 66%, that, that's a measure coming from Communication Theory, people have started using it also in medicine in recent years.

And, you see all of them are below 66%, which means if you look at this type of, and of course, when we report this, it's not about the solution, we are, just, people just talking about the problem, how bad the problem is. And, cytopathology is not an exception, cytopathology is suffering from the same thing and in a different way. Because, for example, in cytopathology we may be counting stuff and measuring things more explicitly, quantitatively, than diagnostic histopathology.

So, and, if... the question is, can AI remove variability? I, I don't know what, what question we can ask more important than this? Doesn't matter what AI does, is it segmentation, is it identification, is it searching, is it providing embeddings, is it detection? Whatever that is, if the ultimate goal is not removing variability, I, I personally would have problem with saying why we are doing this? If AI cannot help us to remove the observer variability, or eliminate, or reduce, drastically, at least.

So, this is something that I show in many of my presentations. That, okay, so if I give a piece of whole-slide image and you, you may classify to get rid of variability. Which is, you are saying yes and no, malignant yes and no. You may come up with a stage of grade and you may provide a probability of likelihood that, that tissue sample belongs to a certain class. So, the question is, what is it that we are saying if, when we use classification? What is it that we are saying?

We are saying many physicians, or all physicians have to accept what the machine says, that's what we are saying. So, will that happen? Well, if, if the community at large trusts AI, then that may happen. So, with, with regular deep networks that's not going to happen. Most likely, if you have one or multiple trustable Foundation models that could happen eventually, as, as part of conversational AI. So, but if I go another route, and instead of classification, I search and I find similar patients, and bring back also the annotated information reports, patient data, everything else. What is it that we are saying in that case?

In that case we are saying that one physician has to accept what, what many other physicians have said, that's evidence-based. Because, beyond imaging and molecular data, the other source of evidence that we have, is the cases that we have already diagnosed and treated. Evidently, and, we know it was free from variability, free from error, that's a statistical evidence. So, what we are accepting here, what we are expecting here, is that the physician accepts what many other physicians have done and said.

That's more likely to happen. That's the fundamental difference between Retrieval and Foundation models. So, Foundation model has to convince us through knowledge, and knowledgeable smart conversation. Whereas, Retrieval convinces us through retrieving evidence. Which, of course, could be much more, much easier, if you can really find the relevant information.

So, if you look at General Comparison, because Foundation models, or classification, basically. Deep down, or come from that corner, and search and retrieval. So they, they, classification, historically, not so much Foundation model, now have been based on supervision. Whereas, search, it was based on unsupervised learning. Now, both of them are shifting to self-supervision, which is great, fantastic. And, the strength of classification has been high accuracy, so you, if you train something. Of course, naturally, you get high accuracy.

The initial papers that came out from AI community in histopathology, they, they reported, I don't know, 98%, 99% accuracy. I mean, you can look at it, and see there are easy cases. The, the, the tumor is obvious, even any young pathologist could do that. But, it was the beginning. So, it's, it's acceptable what we were doing. Whereas, unsupervised search has generally lower accuracy but is agnostic, agnostic to disease, and operates on small, both a small and large data set. So, the classification is usually difficult to explain, and cannot generalize easily, and it needs a lot of label data. Again, we are talking about classical before we move to self-supervision and things like that.

So, needs, and research needs, expressive embeddings, which has to come from somewhere. So, search and retrieval historically do not deal with feature extraction. At the beginning years, we have just dealt with raw data, no feature extraction. But, there is also a lot of information to interpret. I put it in weakness, but it could be the strength, the same way as low accuracy. It could be a weakness because it's cautious, it could be a strength because it's cautious, it's very conservative. So, it's, it's not easy to put this to classification and search in, in balance, and compare to each other. So, but, if you go make the transition to the Foundation models...

So, we know that, basically, there are deep models that are general purpose. So, they are not designed for specific task and they are supposed to be adaptable to a large number of tasks. And, they are trained on massive data sets. So, and, usually, is colossal amount of unlabeled data. And, here the trouble starts. Because, massive data set. Who has massive data set? Well, not many hospitals, not many health care systems have massive data sets. And, if they do, they are not accessible for many reasons. They, not, all of it, is digital. We have hetrogeneous archives and repositories they are not anonymized, they are not accessible, easily. So, it's a problem, that in the public domain is not there. But, in the medicine we have to deal with it.

They are based on self-supervision: so it's not unsupervised, it's not supervised. We go towards self-supervision and we look for finding patterns and correlations to be able to generalize. And, they can be fine-tuned for specific tasks. So, for downstream tasks. However, if you have something that you say, that's a foundation model for histopathology, this is what is that means. Your general domain in histopathology, the expectations shift. It's not like I bring clip that has been trained with cats and dogs and airplanes and bicycles, and then the zero shot learning is not expected to do much. But, if you have a foundation model that is supposed to know histology and histopathology from the GetGo, the expectation will be different.

So, a point before that. So, we, we moved from regular models to to Foundation models, or extremely large models. And, it seems we wanted to get, basically, rid of overfitting problem. So, we had, if you don't have much data underfitting, if you have the right size you do fit, and if you, if your model is too large, you overfit. And, then, we said, you know what, give me the all the data, I make the the model big, and then we don't have that problem. Well, great, fantastic. But, if you make the network extremely large, you get other set of new problems, which is hallucination.

So, you make things up, which is uncontrolled, it cannot be controlled. Uncontrolled, following through the trajectory of input, output relationships in a gigantic model. So, you cannot really, we we, just basically postpone the problem and now we are patching. We, now, we are patching left and right to make it, to make it work, such that we do not hallucinate. You do not see the same enthusiasm that journals and publishers display for publishing results on foundation models.

When you go to granting agencies, I don't know about Europe. In United States, granting agencies, especially NIH, is very skeptical with respect to what foundation models can do, and what dangers they may have, among others, because of hallucination. So, okay. So, we know that, for example, we know that the WSI, if you have some sort of unsupervised clustering based approach, you may come up on average of 80 patches per whole-slide images representation. I will get to it that you may say, okay, I will go through all of them, but, it's not feasible.

And, technologies like CLIP, they use 400 million image-caption pairs. So, that means, actually, if you want to do something comparable, you need, roughly... Because, you, you are not doing doing cats and giraffes and airplanes, we are just coming to one domain. So, 400 million, let's say divided by 80, roughly, so you're talking at least 5 million whole-slide images. But, not just whole-slide images, we need the reports, we need social determinants, we need the lab data, we need radiology, we need genomics, and so on....

So, that's, that's a monumental, data management project. It's not, and, and now you start to understand why people are experimenting with Twitter and PubMed, and so on... Because, nobody has this. Including us, nobody has it. Well, you may have it, but you cannot operate on it, you cannot really train something on it. That means, practically, you don't have it. So, it's a monumental task. Which, even if you do it as a single hospital, it will be of limited use. Even if you train it like us, with anticipated six to seven million patients. Still, the population diversity will be low, so you need initiatives of multiple hospitals, multiple countries, probably, to do that.

So, of course, the the main thing about Foundation model is not a new topology, we are using the same workhorse, the transformers, and, is just about the sheer size. And, what happens of developing and forming a linear, subnetworks inside the network, beyond that critical scale? That there are some theoretical, fantastic, theoretical works pointing to that. So, what,...

From our perspective, when we talk about Foundation models... It is all about, if I give you a patch, while you're not there yet, exactly, to use the entire whole slide image. You may get a gigantic patch, maybe 7,000 by 7,000 pixel, maybe. That's, 8,000 by 8,000 pixel is the largest that I have personally tried to put to a GPU. But, we cannot put, at the moment, we cannot put the entire whole-slide image through the GPU. So, patches, going through the network, getting some embedding.

So, so, when is a foundation model a foundation model? Well, we expect two things from a foundation, from a model that claims to be a foundation model in histopathology. Again, it has been trained for histopathology. Two things:

Zero-Shot Learning: So, you have to be able to classify never-seen data. If you cannot do it then you cannot say, look, so you have to fine-tune me. No. That means then you are a regular network, don't say that you are a foundation model.

And, the Quality of Embeddings: This is the toughest thing you can do, in my experience. The toughest test for a foundation, for a network that claims to be a foundation model, is to check the quality of embeddings. So, get features, get embeddings and use it for retrieval, because it's unsupervised. You do not touch anything, you do not, just use it to see... Did it capture the histological clues, the anatomic clues in the image, just, without any fine tuning, without anything?

So, if it is a foundation model that has seen histology and histopathology, it is expected that to do that. Because, fine-tuning, that means I want to do a specific task, but I am just interested in quality of embedding.

Okay, so we want to do that for search, and I have been one, as one of the, one of the colleagues who has advocated that search really is intelligence. And, this is not something new, because search goes back to the roots of AI, back in the 50s. The logic-based search was the beginning of AI. Search for proofs of mathematical theorems. We had GPS, which was one of the biggest claims of AI back in the 50s and 60s, to come up with something that can solve all problems. Doesn't ring a bell?

Foundation models. They claim, okay, we can do it all. We are going back to the GPS in a, in a little bit more cautious way, but we are saying, basically, if you give me gigantic amount of data, I can solve all problems. So, we are, we are now going back to the reason of the first AI winter and saying, now we have a solution for that.

A* Search Algorithm: A* Star Search Algorithm in the 60s, is finding the shortest path for optimal solution in a graph.

Alpha-Beta Pruning: For game trees. Expert system that been probably the reason for the second AI winter. And, now, sort of, without talking about it, we are reviving them too. We are putting rules in place to prune/edit the response of, of foundation models. It will develop in that direction, combined with retrieval, isn't that a new way of expert system? Probably, it is.

So, the Renaissance of Information Retrieval is happening. So, the biggest example for that is Retrieval-Augmented Generation (RAG). So, we have the LLMs. They are impressing everybody by telling jokes and writing poems and all that. So, trained on massive amount of data, they are really good at human-quality text, most of the time. And, but, they have problems with factual accuracy and they have problems with the staying up-to-date. So, and you cannot retrain and fine tune a foundation model every two weeks. It costs a lot of efforts.

And, External Knowledge Source. It, you, need External Knowledge Sources, let's say, like Wikipedia and PubMed, if you're working in public domain, to take information and supplement the LLM's knowledge. So people are doing that. This has been going on for some time. People realize, immediately, that the knowledgeable conversation that can get out of the hand, because of those unpredictable trajectories, in correlation trajectories, inside the transformer. You can bound it and control it with retrieval, which is accessing evidence in, in your domain. So, you can, nobody can question that, because this is the evidence, this is the historic data.

So, you can prompt a RAG platform. That means you retrieve, you search for the knowledge, and you find relevant information. And, then you combine that retrieve information, and prompt it, feed it back to LLM to text to generate more reliable text. So, that means now the LLM can, the foundation models can base their response in more factual information, reduce level of hallucination, and get rid of obsolete information by augmenting through retrieval. Fantastic news for people like me who are, want to stick with, also retrieval.

Don't give up retrieval, because retrieval is a foundational technology in, in computer science. We don't want to get rid of that. We, we knew that we needed, and specifically in medicine, information retrieval is accessing the general wisdom, the medical wisdom, the evidence from the past. We cannot get rid of that.

So, the advantages of Retrieval-Augmented Generation (RAG), of course, and, and accuracy increase reliability and trust. More transparency through Source Attribution, which deep networks cannot do. They cannot tell, ask Chat GPT and GP4 and LLAMA 2 and any of them. So, where did you see that? Can you tell me a source? Well, unless you connect it to an information retrieval system, they cannot attribute a source to what they are saying. Which, is, for us in medicine, a, a fundamental requirement to back things up. Because somebody has to take the responsibility for the diagnostic reports that the pathologist is writing.

What about the Generation-Augmented Retrieval (GAR)? Can we do that? So, can we generate, and I show if I'm, if I'm, if I can get to that, I show you a very simple example of that. That you retrieve and then you, you, you use generation to add value to the retrieval. So, not much work has been done in that domain, but it's definitely coming. Many people will realize that.

So, how do we search? Well, the primary thing for us in digital pathology is whole-slide images. So, but, so requirements for creating index datasets in medicine? In general, is you need high-quality clinical data. That, that's, that's playing with PubMed, and Twitter, and even online repositories doesn't cut it. You need high-quality clinical data. So, it has to be diverse population, it has to be multimodal, we are still mainly on whole-slide image, a little bit of text. You need fast search algorithm when we get there.

At the moment we really don't need fast search algorithm because nobody has the 10 million patient database to search in it in a multimodal way. But, of course, we have to plan for it. We have to be prepared, we have to have algorithm when the, the repository is opened up. Low demand for indexing storage, something that every single paper that I have read, including our own, has ignored it. So, you cannot ask that for indexing at the archive, you additionally, you need 20% of the volume, just for index the data (especially the whole-slide image). Who should pay for it? People, hospitals cannot pay for it.

I understand that this is not a sexy topic for researchers who just want to focus on the theoretical side and say, my algorithm is fast and accurate, but is it lean for storage or not? If it is not, that's a huge problem. It will not be ad adopted in the practice, it has to be robust. Many techniques that we test, tested, they failed many, many times. This cannot happen in the clinical workflow. And, they have to be user, have a user-friendly interface. Not much has been done in that regard as well.

So, if you go back to the Gigapixel whole-slide image, most of the time we just patch them. And, then we, we get a selection of that, it has to be a Divide. And, then you Conquer it by putting the patches through some sort of network, you get some deep features. And, then you have to Combine it, by somehow aggregating or encoding those deep features and you have to do this every time. So, but, the Divide has to have some conditions, because you cannot do random patch selection, it will not be reliable.

We cannot do sub-setting instead of path selection, because it's supervised, needs annotated data. Then you are doing, you can do it for specific cases, but it will not be a general purpose approach. And, what most people, most papers are doing, they process all patches. That's excessive memory requirement, it will make it slow. And, the, the only argument that I hear from researcher is, okay, buy some GPU subscriptions on the cloud. Well, really? Who should pay for it, who?

So, maybe, my employer can pay for it, maybe your employee can pay for it, but can a small clinic in a remote village in Congo pay for it? So, we are all talking about democratization and making AI accessible. I cannot, I cannot take that statement seriously, if you do not pay attention that processing whole-slide images is very computationally intensive. You cannot process all patches, you have to make a selection and process those, such that the memory requirements are lean.

So, the Divide of whole-slide image has to be universal. It has to be unsupervised. It has to process all tissue sizes and all shapes of a specimen. It has to be diagnostically inclusive, it cannot miss any relevant part of the tissue. It has to be fast, yes, which many papers have published just on this aspect of it, but it has to have, also show storage efficiency. Which means, you have to extract a minimum number of patches and encode them in a very efficient way, such that we can we can save them. It, it cannot be less than, more than 1% of the whole slide image. We cannot add, the overhead cannot be much more than that, and that makes it really difficult.

So, then, the rest is relatively easy. The whole side image comes, you patch it, you send it to, to some network and you get the features, you add it with the metadata, you can start matching and searching with some other detail. So, when you send a queried whole-slide image to an image search engine, you go in, you retrieve the similar ones, you retrieve the associated metadata. Here, then, LLMs can come and help, Foundation models can come and help. And, the main thing is that the pathologist who is doing that, and the retrieve cases. The retrieve cases come, actually, from the other Pathologists that are not there.

So, this is, this is a sort of virtual peer review that can enable us to, basically, do computational second opinion. Consensus building. So, and, that's extremely valuable in medicine. So, if he can build computational consensus, and, say, that's a second opinion based on historic data. And, we can do it in a, in an efficient and fast way, reliable way. Perhaps, we can do something. So, we...

Presenter: Someone asked a question in the chats, in regards to the division process. So, is there a risk, that by dividing the WSI into packages, that you lose some important information?

Dr. Tizhoosh: If, if we do patching, we lose information?

Presenter: Yes, so the process of...

Dr. Tizhoosh: You, you should not, that's the challenge. You should do unsupervised patching. You get the relevant information and you should not lose information. But, you should not do it with the 2,000 patches that that WSI has, but do it with 50. And, that's the challenge. That's why people don't do it and they just do multiple instance learning and put everything in bags and makes it is much easier.

Presenter: Yeah, I think the person was just referring to potentially losing any context by taking specific patches.

Dr. Tizhoosh: If, if you don't want to do it in a supervised way, we don't have any other choice. And, that's very difficult to do, to do it in unsupervised, that's the challenge. We have spent some time to develop new ones, such that we can go in and make sure that on, from pattern perspective, because if it is unsupervised you don't know what is what. And, then you may also grab fat, you may grab a normal epithelial tissue. So, that's the challenge. Do it unsupervised. Don't miss anything that is diagnostically relevant, that's the challenge.

And, I agree, it's not easy. And, we have not done large scale validation to see if it is reliable, if it is doable or not. So, we looked at multiple search engines and some of them are new, some of them are not. And, things that we see, for example, some of them, for example, SMILY by Google, didn't have a Divide. So, and, I asked the colleague who was presenting that work in pathology vision, and he was saying, okay, just just subscribe to the cloud and he was saying that with a smile. Well, I know, I understand the, I understand the, the business model but that's not doable for the hospital.

And, another point here is when we look at validating the search engines. Some of them use post-processing or reranking, which is a problematic thing. If your search engine is failing, you cannot just patch it by adding ranking, ranking after the fact. So, you may post-process for visualization, you may even post-process for a little bit more accuracy, but, if your search, fundamentally, has failed and then you try to compensate with additional ranking, you are not doing search, you are doing classification.

So, we tested that with both internal and public data sets. The paper is under review, hopefully will come soon. A copy is on the archive. So, we used roughly 2,500 patients, 200 patches, 30, some 38 subtypes, just to test, but, we looked at multiple things. We looked at top one accuracy, majority three accuracy, majority five accuracy, and we looked at [unintelligible] score, not precision. Again, some papers just use precision, because they were focused on classification.

So, we looked at this, the time for indexing, we looked at the time for search, we looked at how many times the approach failed and we looked at the storage requirement. And, then, we ranked them based on that. So, we did not calculate any additional number, we just look how well they did for each one of them, and these are the three major ones that we look at.

The Bag of Visual Words is still is one of my favorites, but it has been ignored in the recent list literature. And, is not good for accuracy, but actually in speed and storage, can beat everybody else. That's interesting.

So, Yottixel, which is, I have been involved in that, seems to be a really good one, but even that is not a good choice for moving forward. So, although Yottixel and KimiaNet were the best overall performers, and here is when the KimiaNet or any other network DenseNet, deficient net, can be replaced with a good foundation model, if one is available. And, we tested that with other things. We compared it with, with, with PLIP, among others, which completely failed compared to KimiaNet. A CNN, not a foundation model, a regular small network. All, all, search engines that we have tested, all of them, they have low level of accuracy, so they cannot be used for clinical utility, so, that, that was a major thing.

So, and, most of them do not look at speed and storage at the same time. So, they, they just focus on speed, that seems speed to be more sexy from academic perspective. And, storage, who cares? Just buy some storage. Well, if that's the way, then who cares about the speed? Just buy some GPU. We cannot just get rid of the requirement by saying that. We do not have automated whole-slide image selection. Just think about it. I, we, went back and said, who gave us this 600 cases? Our pathologist. And, then, I went back and asked him, how did you select this?

Well, we did some search with text. It's basically quasi random. So, what happens if you have 1 million whole-slide images? Do you want to index all of them, is it necessary? If you have Squamous Cell Carcinoma, maybe. If you have Renal Cell Carcinoma, probably not, because it's so distinct, you don't have much redundancy. Why should I index all of them? So, we do not have any model, any algorithm, for automated whole-slide image selection. The, everybody sets magnification on patches randomly, empirically. I do 1,000 by 1,000, we do 200 by 200, that one does 500 by 500, 20x, 40x, nobody knows what it is, probably, it's depending on primary side and primary diagnosis. But, we have not dealt with it because we were busy with other stuff.

And, most importantly, no multimodal search. We do not have any multimodal search in Histopathology. What happens if I have images, and I have text, and I have RNA sequencing, and I have a radiology image at the same time, I want to search. So, that, that's that's the type of Information Retrieval that we need. So, if I look at retrieval and Foundation model, so what is what is the difference, or what are the common points? So, most of the time, so retrieval works, really, with small models, small data sets, and the computational footprint is small. Whereas Foundation models, everything is gigantic, so that's a major thing that we have to keep in mind.

So, Information Retrieval convinces through evidence, whereas Foundation model convinces through knowledgeable conversation. Definitely, this should be combined, there is no question about that. RAG was a major step in that direction. Information Retrieval can deal with rare cases, even if you have two, three cases. WHO has blue books, you have prototypical one case, one whole-slide image. Well, Foundation models usually are perceived to be more for common diseases where you have many cases. You can tweak them, you can tweak them such that they can process also complex cases, rare cases, but they are not meant for that.

But, most importantly search is explicit Information Retrieval where our Foundation models are implicit Information Retrieval. We have to realize that, we are looking at the same thing from two different perspectives, and if you realize that then we can go back and design things in a more intelligent way.

The source attribution in Information Retrieval is visible, is accessible, is explainable. Whereas, in Foundation models, it's not visible, it's not accessible, is not easily explainable. Again, RAG is a good decision in, in the right, in the right direction.

Maintenance. Of course, Information Retrieval is, is much more, well, well posed, because low dependency on hardware updates, you can add and delete cases relatively easily, a bit, depending on the indexing algorithm. New models can be replaced to old models relatively easily. But, with Foundation model you cannot do that, it's heavily depending on hardware updates, high efforts for prompting to customize, expensive re-training cycles may be necessary.

Foundation models may be for really big institutions and big companies and big corporations. Information Retrieval could be for the small guys, for the small clinics. But, again, there is no question that combining them, going back and forth, will be really, to exploit the possibilities.

So, again, on the, on the left you see a typical search and retrieval. Baxically, you have a table, you have a lookup table. Long tissue comes, there is some sort of function, and it gives you the address of the correct diagnosis, it says lung adenocarcinoma, that's what, the search works. If, if, you may have an explicit hash function, or not, but you have a function. But, on the right you have a network, and it does the same thing. Long tissue comes in, through a complex trajectory, and the network is the function. It gives you, again, the lung adenocarcinoma, does the same thing, implicit Information Retrieval. So, we have to realize that and make sure that we just use these two things interchangeably, or in tandem, best.

So, we have been working, I don't know how much time I have, do I have time? I, I don't see my clock here. So, so...

Presenter: We still got about 20 minutes left

Dr. Tizhoosh: So, okay, so we we are working on what we call Mayo Atlas. So, we built an atlas, for us is an structured index collection of patient data, well-curated, that represents the spectrum of disease diversity. And, the index, which is: we need patient data representation, it has to be semantically, biologically, anatomically, clinically and genetically correct, reflecting correct pattern similarities. So, atlas is overloaded term, it has been in use for almost centuries. So, you have to clarify what what you say. It's a repository, index repository.

So, if you have many patients, you have, you add the first modality. For us, primary modality is whole-slide image. You index it, you add the index. Second modality, pathology reports, you add the index. Third modality, fourth modality, you add gene expressions, x-ray images, and so on and so on. And, then you add the output, which is amazing aspect of Information Retrieval. You can at any time replace or update the output. Networks, deep networks cannot do that, because you are training with them.

So, and then you can, you can, and another point is that any modality may be missing. And, depending on your indexing, you still can infer knowledge. Even if the X-ray is not there, or gene expression is not there. You should be able to just go in with incomplete information, infer new knowledge for the new patient.

Atlas Requirements

So, what should an atlas, which is a index repository for intelligent Information Retrieval, characteristic should show?

Inclusion - It should be inclusive. An atlas should contain all manifestation of that disease. So, if you're talking about lung cancer, you should have all representative cases, which is not easy. And, that's why we said, okay, look, we need an automated way of selecting whole-slide images. Manual inspection, visual inspection cannot do it.

Veracity - The atlas must be from free from variability. You cannot just complain about AI, that AI makes mistakes, physicians make mistakes too. And, then we put it aside and nobody knows that the case that we did two years ago was a mistake, was a variability mistake. So, you, I, have to double check, you have to double check things that you do in the atlas.

Semantic Equivalence - So, the indexing must conform to anatomic and biologic nature of the disease, which is quality of embedding. Here deep models and Foundation models can be very helpful. So, if you come up with a multimodal indexing, which is, okay, you go through some sort of network, you get some embeddings, and you do what we call associative learning. Associative learning, and then you put them together with one index, then you can start doing and building an atlas of disease.

Whole-slide images come in, molecular data comes in, clinical data comes in, go through some certain Network. You do your association learning, which is, which part of those embeddings is a common point? When I see this tissue, then I see this gene expression, then I see this point in the X-ray image. And, then you can do search and matching, you, you can provide the top matches, provide computational second opinion, and then the pathologist can make the decision, write the the final report. Says assistive, in nature. And, definitely, this should be combined with the power of large language models.

So, we, for, for sake of public display and public demos, we chose a TCGA one, because we don't have the clearance yet to show internal data. Tissue, whole-slide image, and RNA sequencing from TCGA. Look at just the tissue image. That's the accuracy that you get with simple matching. You bring in RNA sequencing, it gets a lot better, of course, in a case, that's a relatively easy case for RNA sequencing. And, then you combine tissue and RNA sequencing, and you would say, naturally, you expect that the accuracy increases. Of course it does.

That may be different from different primary sides, of course. And, long, I, when I showed the demo to colleagues, I got criticized, you have chosen an easy one. I know, because we don't want to be too tough at the beginning of developing a, a new system.

So, when the query patient comes, and you find the first match, and the second match, and third match, and you retrieve the information. You aggregate the metadata through a large language model. That's why I called the generation, Augmented Retrieval. So, you find three reports, you combine them with an LLM to one report to describe. So, that's autocaptioning in a very different way, retrieval-based autocaptioning, basically. Then, you can do region matching, between the query and matches. So, we can visualize your evidence, why you are saying this patient is this and that.

You can visualize the atlas by just using t-SNE and UMAP, and any other technology, to show your patients, and where is your patient position. You combine them all into a computational second opinion report. The vision for us is to, to, basically, make that accessible. I don't know when that will happen. So, we, we have to do that Foundation model. So, we connect the Atlas, the retrieval, to that Foundation mode. Is it two years from now? I don't know. If, if, we get the money, if I get the money, I will push it to we get it in the prototype in a year. But, I'm guessing it will be more than a year. So, nobody has experience with crunching numbers with six million whole-slide images, so it, it, it will be tough.

So, these are the people who have contributed to that. I don't know how much time we have? Do we have time for a short demo or we are already over?

Presenter: Yeah, sure, we've got about 15 minutes lef, so we still got time.

Dr. Tizhoosh: So, here is a very short demo. Do you see that, Welcome to Mayo Atlas?

Presenter: Yeah, we can see that.

Dr. Tizhoosh: Okay, okay so if I sign in, and I choose my simple, easy, lung cancer atlas from TCGA. So, let me see, I can find the demo folder, use my easy demo image, and then I have some description related to that. It doesn't need to be the report, because for a new patient you don't have the report, and I choose the gene expression. So, there are some maybe questions. Whole-slide image, some question or description. The initial visual inspection, and then gene expression.

It will be difficult to upload RNA sequencing, so you have to process it, get something out of it, it will be tough. And, then you start uploading. The upload is not here. Completely realistic, it's buffering, a lot of stuff for the sake of demo, so, it's not completely real time. And, then the results are there, this is very simple prototype. So, that was the, the query data, is a pathology slide. So, we have three matches, and you can, you can, you can select the third one and say, oh, let me look at that. You can go in and compare whatever you want, and see, make sure that it is, really convince yourself, that it is correct. And, it could be top three, top five, top seven, whatever.

After you have convinced yourself, which is, you have to do it. You see that, here, we have reports, also notes, attached. That's the evidence. So, if I see full notes. So, that's the first match, second match, third match... So, these are the results. And, then, if I show at the summary. The summary is... This part is the part that I called it Generation Augmented Retrieval. So, you get the reports of the top three matches, and you combine them into one report to, basically, autocaption the image that you found through the retrieval. There is a lot information that you can do, so we say, okay, generate a report for me.

For example, I can look at the t-SNE and say, if I look at the the whole-slide image on WSI you see it's really clearcut. Again, this is an easy case. And, these are the patients for Lung Squamous Carcinoma, Adenocarcinoma, and here is my patient. So, it gives really nice overview that this is the population in the atlas, this is my patient. And, I can select that, and I can save it. Go back, there's a lot of things that we can change, and I can say generate a report. So, and this is my patient, number 123, and so this is the summary, the majority vote says this is Lung Squamous Carcinoma.

We have a gen, gene expression heat map that looks at the top 20 High variance ones. And, we also provide a sample query. This is the query, the first match you can look at. So, everything can be put in the in the report. And, they say, okay, save and exit. And, it's, if not, it doesn't crash, it generates a report for you. And, say, okay, we looked at top three, look at top five, this is Lung Squamous Cell Carcinoma, this is what the generation of retrieve cases put us, and these are the visualization, of, with respect to individual modalities, and these are some sample cases. And, that's, that's in a nutshell what, what, basically, the atlas can do, combined with Foundation models. Which, at the moment is very weak connection for us. We have to do a lot more work to make that bridge stronger.

So, thank you so much, I hope, I hope this was helpful.

Presenter: Well, thank you very much, Professor. That was a really interesting talk. So, with that, should we open the floor to questions? I noticed a couple of questions on the, on the chat. So, I think I'll start with those then.

Question 1: So it says, For multimodel retrieval, what do you do in the case of mixing modalities?

Dr. Tizhoosh: Oh, yeah. So, if, if you get individual embeddings and you don't aggregate them. If you aggregate them is good for saving and storage and speed, but if you don't aggregate them, you just encode them, or compress them, and you just concatenate them. That gives you the freedom to if, if one of them is missing, you know which one is missing, you just compare the others. The downside is, that your, your patient index will be much longer than it should be. So, you have an embedding for image, you have an embedding for RNA sequencing, you have an embedding for reports, and so on and so on. So, at the moment, we are, we see no other way. You have to keep the embeddings of individual modalities separately. You may encode them, compress them, and that gives you the freedom to, to, basically, search separately.

Question 2: Thank you, and a second question here, was also, How do you suggest dealing with rare cases in a retrieval based Paradigm?

Dr. Tizhoosh: So, in rare cases is it's relatively easy, and, hopefully we can soon publish some results. Because, you may, you may have as little as one rare case among others. So, you have many common cases and you have several rare cases. So, again, big condition if you're embedding. If you're indexing has done its job and your embedding is high quality and you get the expression, so, the rare embedding could be, should be, part of the top end retrieval.

In our experience, if we have the suspicion that is a rare, complex case, you, you cannot rely on majority vote among the top end retrievals. You cannot, it's too risky. You, you may come up with, what we call the histogram of possibilities. And, you have to say, look, it, it and, especially, this is one of those things that you have to provide when you deliver the top end search results. That, among the top end, although the majority says this, let's say the majority says that's [unintelligible] carcinoma.

But, you have one case, let's say, you have one case which is not the top search, among the top five, it could be the fifth one, let's say. So, which is adenosis or is papillary carcinoma, some of the rare cases for breast. So, you have to display that and you cannot rely on majority. So, it, it's one of the challenges that retrieval has to do, but you get the information. That's the minimum, provided that your embeddings are good, big condition, of course. And, that's where everybody is waiting for good foundation models to give us good embeddings.

Presenter: Thank you.

[unintelligible]

Question 3: You showed an example where you were building the autocaption from retrieval, from generation instead. How would you, how would you rely on that kind of generation from top k cases if it was a rare case? That that was really the essence of my question.

Dr. Tizhoosh: It seems, it seems that at Mayo, to my knowledge, many of our physicians are using LLMs to summarize diagnostic reports. When, or reports about patients, when patients come, and they come with 30, 40 pages of information. And, it, this is a regular, challenging, clinical workflow that somebody has to read all of it and say, What, what has been done for the patient? What what type of treatment? What was the history?

And, one of the things that is very low risk in using LLMs, is to use them for summarization. Because they are not generating anything, they are just summarizing, at least we think there is. I have not seen large-scale validation but the general perception is that summarization is more reliable. So, you can, if you give three reports that are with, with the condition that retrieval has done its job, are describing the same. And, what we did, the only thing that we did, we looked at top five. And, say, majority was three and it was a squamous. So, we do not provide the other two reports because the majority vote says it's a squamous. So, we provide three reports for Squamous Cell Carcinoma and say, Combine and Summarize, and in our limited experience, no large scale validation, it looks good.

Question 4: Again, if it's a rare case then all top three might not be relevant, that, that's the whole point, right? By definition, all of the top three cases might be totally irrelevant?

Dr. Tizhoosh: Could be, the, all top three could be what, sorry?

Question 4 continued: They might be irrelevant?

Dr. Tizhoosh: Could be. Yes, yeah, then the search has failed, then the search has failed, yes.

[unintelligible]

Dr. Tizhoosh: If, if you have, if you have a really small data set that can happen, if you have a really a small atlas. If you, for large atlases what we do, we combine the top three and top five, by the way we cannot show top 50, because who should process that? If we show the top three we can get some additional supporting evidence, statistical evidence. By looking at top 50, top 100, just as confidence measure, we can do that for medium size and large repositories.

To say the top three says, it is this, and when I look at top 100 the top 100 says this as well. So, again, with exception of rare complex cases, this may work. But, one of the things is that, this and many other aspects of information retrieval, as a community, we have not paid enough attention, so we have to, we still have to do a lot of work.

Audience Member: Great, thank you very much for a fascinating talk.

Presenter: A question from an audience member.

Question 5: Yes sir, thank you very much for the really interesting talk, I really enjoyed it. Sure, just, sorry, just as I can, want to turn on my camera, so you see who's here watching. Thank you very much, again. So, I've been following your work, and so the question that I have. I noticed that the one of the first part of the retrieval algorithm, which is the sampling of patches, it's really important and it affects the efficiency of the algorithm, largely as you mentioned. If I remember correctly, what it does, the Yottixel algorithm that you're currently using, is based on the similarity of patches.

Basically, selects the patches that's a higher number of cells in it, which is good, especially in can, most of the cancer types. But, I was wondering, if [unintelligible] is also a good case, it's also a good algorithm or a good method then, then, then what what you looking for is, doesn't have many cells in it. You say, say you want to retrieve a [unintelligible] sample, right? Which doesn't have many cells in it. And, basically, what Yottixel does, that doesn't sample much of [unintelligible] matches as it doesn't samples fat, right? So, was wondering if there is any, if there, there, if it is problematic? And, if it is, it will be any workaround for this problem?

Dr. Tizhoosh: Well, Yottixel doesn't do that, KimiaNet does it. KimiaNet to, to be, to train on both diagnostic slides and frozen sections of TCGA. It made the assumption that we will just grab and process those ones with high cellularity, making the latent assumption that this is all about carcinoma and mo, maybe a little bit of inflammation, but not any other. And, this is perfect, legitimate assumption for TCGA, it's all carcinomas. Yottixel does not make that assumption.

Although, as I said, Yottixel, like any other. Although, Yottixel, to compare to the others, is really good, is also not good enough because, among others, it, it used the DenseNet or we use the KimiaNet for it. The embeddings are not good enough, and it has also some other parameter setting that others have taken over, and it makes it less practical. So, if you make any Assumption of that sort, in search, you will be limited, you cannot do that. So, you cannot make high cellularity assumption. And, that's why, probably, if you apply KimiaNet on non, on lymphoma, let's say, it, it may fail to really give you good embeddings.

And, any other method, that has made that assumption for the search and retrieval, the embeddings that you get have to come from somewhere, and that do not make any anatomic histologic assumption about the image. Or, so, what we know is this, it seems we will not solve the accuracy problem of image retrieval. And, again, we, we, it was it was, it was very eye openening to me, to myself as well. Methods, search methods, that claimed to have been designed for rare cases, provided 17% accuracy for breast, rare cases, 17%.

So, which means, okay, this is not a solution at all. So, it seems the only way that we have, maybe only two way, is either you have a super, fantastic foundation model, which I don't see it. Well, I mean the colleagues from from Mount Sinei, Dr Fox group, they did with 1 million, but is not publicly available, so, I would love to test that. So, or, you take a regular one, small one, CNN one, small transformer and fine-tune it for every Atlas, even if it has 200 patient. Which, is challenging, even if you use self-supervision. So, but, the computer science tell us, no [unintelligible] theorem.

You cannot just be good, unless you customize. So, maybe, we have to fine tune for every Atlas, for every organ, for every primary site, but, which is a challenge if it is a small. And, everybody assumes we have a gigantic data set, but we don't, nobody does. To my knowledge nobody, no hospital, has that access, relatively accessible. So, let's work and focus on small data sets. I mean, but most likely you will run into the problem that journals will not publish your papers because they want to see millions and millions. But, okay, so if I want to do million I have to go online and use online data, but hospitals don't have it yet.

Question 6: Another question. I guess you're right, though I'm also wondering about any [unintelligible] sampling, because it seems that sampling is really playing, is really playing an important role here. But, in the same vein, so was wondering, you might have come across any idea like this. As you were presenting your work, was wondering if we can use image retrieval as an approach to, to label, to, together, label data semi data set, or I don't know what to call it, self, self-supervised labeled data set?

Say, say you just provide one patch of DCIS. Or, or any specific type of tissue to, to, the, this image retrieval and ask it to retrieve similar patterns from million whole-slide images. That will give you a really, really good data set to train another model in a supervised manner. So, I was wondering if anyone has looked into this kind of application of image retrieval or would you think it's a good idea to do that?

Dr. Tizhoosh: No, I, not to my knowledge, not in this way. The, the problem is in image retrieval, like any other field, you have to stick around. You cannot, you cannot come in, publish one paper, and have the illusion that you have solved the problem and go away. You have to stick around, invest, fail, develop solutions and test and realize they don't work. There are many other problems.

I give you another problem that has not been solved. Every patient comes with multiple whole-slide images. We are, at the moment, latently, assuming that they have only one whole-slide image. No, they don't. So, as long as you cannot process, that patient comes with seven, eight, ten, twelve, fifteen whole-slide images, and then patient representation and patching becomes even more complicated. Now you have seven whole-slide images, eight whole-slide images from one patient, and you have to select patches in a way that you do not miss anything and you do not overload your selection with normal tissue, which can misguide the search and retrieval, very difficult to do.

We have just started looking at that. Information retrieval in histopathology, digital pathology, very complicated. I, I don't think we have even started to do that, we have been working on the surface.

Presenter: Great. We've come to 3:00, but are we okay to just ask more questions for five minutes?

Question 7: Can I ask a quick question now? [unintelligible] Thank you for the nice talk. I really enjoyed your paper, What is the foundation model, as well, and some of the other critiques you've been publishing, so, so appreciate that, that role of yours to the community.

Dr. Tizhoosh: I'm not making many friends with that.

Question 7 continued: Well, in science it's, it's not my my opinion or yours that matters. We just need to validate whatever we do with with the results and that's what you are after. So, I really appreciate that, so thank you for that. Another thing, I, I wanted to like ask you, or get your thoughts on, is one requirement for foundation models is transferability. In one of the slides that you showed you mentioned that there are, like, zero-shot learning, for example, is a way of measuring transferability of a foundation model. I was just wondering if you had measured the transferability of this search-based Foundation model approach that you are proposing for classification?

Dr. Tizhoosh: Very good point, no, we haven't. And, I, I probably, I won't touch that until we really have access to the... We have been preparing, since last August, to, to access our data on the cloud, which is at the moment around, short, probably of 7 million whole-slide images. It's not 7 million patients, but 7 million whole-slide images. By end of the year we'll be approaching 9 million. So I, and I, I want to do tests with that. I want to see if we go in and then we have rare cases, we have common cases. And, the questions that Dr [unintelligible] would ask, so what happens if the top search results are irrelevant? Which means something is fundamentally going wrong. And, if you have to establish the base, the patching is okay, the embeddings are okay, the speed is okay, the storage is okay.

Now we go to bigger questions. Okay, so can I transfer knowledge? Can I do so? Because if none of that is working, I'm not saying we will wait, really, to solve all those problems perfectly. No, no, no. But, we need a reasonable base in order to go after those sophisticated new bridges that we can say, okay, can I go from here to here? Can I do zero-shot classification or not? We have not, so we have not, and the reason that I don't do it, even with small or medium-sized data set, is probably, is just psychological, because we are excited to just get prepared to go operate on the cloud data.

Question 8: Okay, makes sense. The other comment I had, is, is one distinguishing feature of, of Foundation models that I see emerging in the field. Not only in computional pathology but in general, is that, let's say we've got chat GPT, by itself we never taught it to reason, for example. And it may not be able to reason perfectly, but it does show more promise than just being the next word predictor. So, it has learned this additional human-like capability of showing, somewhat, indications of being able to reason.

And, I was wondering, if, if when we talk about Foundation models in a specific domain, for example, computational pathology. Whether, it is relevant to make domain specific Foundation models or would it be more useful to build on top of these more general purpose Foundation models, like chat GPT, and then adapt them for a certain domain? Or, not even if like. For example, if you just do retrieval augmented generation using Chat GPT embeddings in some way, what are your thoughts on that? Like, should we invent a New Foundation model for computational pathology going to that domain or should we wait until the these multimodal models that are being proposed by the industry, more or less, more so than Academia at least, just wait for them?

Dr. Tizhoosh: I was, I was reading a very interesting theoretical work from a young, fellow scientist, recently joined MIT. That, he put the theory forward that, beyond a certain scale of data, large models start developing smaller, linear models inside. And, that's among other, is the theory, responsible for those new capabilities. At the moment, I'm thinking that, since there are a lot of things that we don't know about deep models, large models, it's very risky, and, regulatory bodies, like FDA, will be suspicious of that.

If you take, let's say ChatGPT, and just fine-tune it for histopathology, I won't do that. That's why I have not been trying to experiment with anything else. I want to wait, get my hands on 6 million, and then I have the reports. I have x-ray, I have RNA, I have everything, I have multimodal, so six million should be enough. And, the histopathology is a small enough domain, compared to ChatGPT, which covering literature and politics and science and everything, is many domains, so, general purpose. When you come up with a foundation model for histopathology, it's not general domain, so, I don't understand the term general purpose histopathology. What do you mean general purpose histopathology? It's histopathology, are you giving me a specific network?

I don't know what is happening inside of ChatGPT, and others, that I take it and fine-tune it with my gold-value, clinical data. I would rather train something from scratch, and it doesn't need to be that that big, because our field is really special. Again, and we have to do that, computer science, no free lunch, you have to specialize, you have to customize, otherwise you, you won't be accurate enough. But, since we don't know what is happening inside, unless for specific tasks, again like the summarization of text, okay. Just getting some embedding for images, okay. But, as a conversational partner, and people have started to use conversational information retrieval term as well, that comes into the clinical workflow, at the moment, I won't do that.

I don't trust enough what is happening inside that we can use it. And, what the conversation that we have with that model, will be attached to the diagnostic report, to the treatment planning. I'm looking at that. That may be a too too conservative view, but, but I want to build something that is reliable, and we tested two three years, and then we can really start using it. And, it's worth it to wait for it, get your hands on really high quality clinical data, and do it from scratch. Doesn't, doesn't prevent us from using fine-tuned one for less sensitive tasks.

Question 8 continued: Okay, well thank you so much, interesting perspective, nice, nice meeting you.

Dr. Tizhoosh: Thank you, nice meeting you.

Presenter: Brilliant. Well, have we got any more questions from anyone? Sure, do do we have time for one final question? Professor Hamid is that okay?

Dr. Tizhoosh: Yeah, I, I, I'm, I'm available. So, I'm assuming we are running out of time, and, it's, everythings okay, we got some questions. Thank you.

Presenter: So, okay, so what was your question?

Question 9: Very interesting talk. Just one question. You mention about expressive embedding in in one of your slide. I just want to know, how will you define that expressive embedding? Because, and the second thing is, that when you are generating the report, based on your retrieval. So, one thing that I noticed, is you also mention about inter and intra observer variability. So, normally the reports are generated by the pathologist, so how will you handle that thing that last when you are generating final report from LLM?

Creating an atlas, the second condition was veracity, that the atlas is free from variability. The only thing that remains is the, is, basically, the, the diversity in the language that the pathologist may say the same things with different words, and large language models should be able to deal with that. But, we are assuming, and we have to make sure that the veracity is there, which means every case that we have in the atlas has to be double verified, even though it's a historic data, it was done two years ago, five years ago, we have to check it again when we put it in the atlas. And, that's another reason that an atlas cannot be millions of cases because we cannot do that for millions ions of cases.

I forgot what was the first question?

Question 9 continued: Yeah, the first was about expressive embedding. How will you define that expressive embedding, because, yeah...

Dr. Tizhoosh: Our simple, simple approach is this, if embeddings are good, if deep features are good, we do not push the other parts of information retrieval. We just go for a simple comparison, very simple comparison, we just use Euclidean Distance. Don't use any sophisticated hashing, compression, variance, analysis, nothing. Principal component, nothing. Just take the embedding, compare them, and if they are good enough, it should give us reliable accuracy. And, most of the time, almost all search engines failed. All of them failed, including Yottixel. All of them failed. They can't. They, we, don't have good embeddings, yet.

And, that's why you're trying to, say, okay, grab a small one and grab a small model, and fine-tune it for your small Atlas. That's, that's the, that's the approach, until we get good, some good foundation models.

Question 9 continued: Okay, thank you.

Dr. Tizhoosh: Thank you.

Presenter: Fantastic. Well, if you don't have any more questions then I think we'll we'll draw the seminar to close. But, thank you very much, that's been a really interesting talk, and thank you for your time and spending extra, I know we've gone over about 10-15 minutes.

Dr. Tizhoosh: Thank you very much. Thank you, I appreciate the opportunity.

Presenter: And, I just want to thank everyone online for joining the meeting, everyone in person. Just want to remind everyone again, our next seminar is next Monday and we're joined in person this time by from radb UNC Netherlands so, hopefully see everyone then. But, again, thank you very much Professor Hamid, and thanks online.

More about research at Mayo Clinic

ART-20592742

Knowledge Inference in Medical Image Analysis: Hamid R. Tizhoosh

Videos

Core Ideas in Artificial Intelligence (AI): From Perceptron to Transformers

Foundation Models and Information Retrieval in Digital Pathology

Image Search: Past, Present and the Path Forward

Learning or Searching: Foundations Models and Information Retrieval in Digital Pathology

More about research at Mayo Clinic

Mayo Clinic Footer

Legal Conditions and Terms

Advertising

Reprint Permissions