Keynote - The AI Scaling Revolution and the Future of Intelligence

Name: Keynote - The AI Scaling Revolution and the Future of Intelligence
Uploaded: 2024-04-13T12:00:00.000Z
Duration: 55 min 10 s
Description: Explore the AI scaling revolution & the future of intelligence! Keynote discusses Artificial General Intelligence (AGI) & out-of-distribution generalization.

2024.04.13

Irina Rish explores the revolutionary impact of scaling in artificial intelligence and its implications for achieving artificial general intelligence. She discusses how foundation models—large-scale systems trained on massive, diverse datasets—have dramatically improved AI performance across domains, often surpassing specialized approaches without architectural innovation. Rish examines the mathematical "scaling laws" that predict model performance based on compute, data, and model size, while also addressing emergent behaviors, phase transitions, and the balance between caution and progress in developing increasingly capable AI systems.

Irina Rish

Irina Rish is a prominent researcher in the field of Artificial Intelligence, with a particular focus on achieving Artificial General Intelligence (AGI). She leads the Canada Excellence Research Chair in Autonomous AI, overseeing a large team of students, postdocs, interns, and collaborators. Her work explores the challenges and possibilities of creating AI systems that can generalize to a wide range of tasks and problems, mirroring human-level adaptability and learning capabilities. ¶ Rish’s research delves into the complexities of out-of-distribution generalization, aiming to develop AI agents capable of learning and performing tasks significantly different from their training data. She draws upon principles from statistics, machine learning, and classical AI to create systems that are not only capable of mastering specific skills but also demonstrate the capacity for continuous learning and adaptation, akin to human cognitive flexibility. Her work resonates with transhumanist themes by exploring the potential for AI to augment human intelligence and solve complex global challenges. ¶ Her presentation at the MTAConf 2024 focused on the technical aspects of AGI, emphasizing the importance of creating autonomous, multi-tasking systems capable of performing economically valuable work. Her engagement with the MTA highlights the intersection of AI research with philosophical and theological considerations regarding the future of humanity and the potential for technology to shape human evolution.

Transcript

Scroll to current paragraph while playing

Irina Rish

It’s been a pleasure to be here. Thank you so much for inviting me. It was extremely exciting and inspiring event. And I have to admit, it was more exciting and inspiring than most of sessions in the New REAPS conference, honestly. Just don’t tell them.

Irina Rish

So thank you for great introduction. I’ll try to fit all the fifty three slides into thirty minutes. We’ll see how it goes. But indeed, the number of things we’re working on in this lab called Canada Excellence Research Chair and Autonomous AI is quite quite wide. I mean, we have like more than forty students, postdocs, interns and collaborators, and it’s been kind of Really hectic four years, but I think AI is going through really revolutionary phase, and we better ride this wave.

Irina Rish

So Maybe not everyone agrees in AI field, but majority would probably say that Whenever AGI or artificial general intelligence is mentioned, people would agree that for decades that’s what people in the field did consider as a holy grail, as the ultimate goal of what they would like to build. The problem was, of course, people never agree on definitions. I mean, that’s how people work. So AGI means different things to different people, and some necessarily talk about consciousness, another Talk about AGI developing certain personality traits, which actually can be measured, depends on definitions, so on and so forth.

Irina Rish

But for now, just to be more technical. What I mean by artificial general intelligence, just to start with some based approach, is essentially Artificial intelligence system that learned how to generalize at really sufficiently large scale to problems and tasks way out of Distribution, quote unquote. So basically, it’s really highly genialist agent, similar to human in that way. That may not necessarily be beating human champion and go or chess, but it can be like human able to learn chess, learn go, learn to play violin and keep going and just go buy groceries. do whatever stuff. So it’s generalist in a sense of multitask system.

Irina Rish

Of course, I’m also following the open AI definition, which is not technical, but I think gets the gist of it. It’s system that should be autonomous. Multitask and the tasks better be useful, so like all economically valuable work.

Irina Rish

And in a sense, if you look at the history of not just artificial intelligence and machine learning, but you go back to statistics in last century. What was always the main goal of like statistics and learning from data, which is a basis of modern AI, it always has been generalization. It’s just the classical statistics would like to feed the curve and extrapolate it to the next point that has not been observed so far, assuming data come from same distribution. Machine learning goes further and the data sets become very complex and multivariate and you need methods that can do this type of generalization kind of further and further away.

Irina Rish

More recently, just before the advent of large scale models, the out of distribution generalization became extremely hot area of research. And our lab spent several years as well exploring how do you train neural networks not to latch on to some Kind of spuriously correlated features, but to truly learn invariant properties of the data. In the classical picture I was showing there of cows on different background were used in many, many, many papers on out of distribution. And the story goes that if you train neural net to detect basically to discriminate between types of animals, and it does awesome job. Once you move to, say, unusual background, like cows on the beach instead of cows on the green grass, the system completely misclassifies The animal, because it didn’t really learn the properties of animal. It did learn just properties of background, it was easier. So deep neural networks always kind of had this tendency like lazy students studying to the test. So they wouldn’t really learn the invariant properties, but they really tried to find a shortcut. And the whole field was trying to figure out what to do with them so they will not be learning shortcuts.

Irina Rish

Many algorithms were proposed and approaches, and we worked like on things on top of invariant risk minimization. I will actually mention it later in the talk. So on and so forth. So lots of effort, the whole exciting field. And in a few slides, I say what happened recently with out-of-distribution genealogization and large-scale models. It’s kind of solved it.

Irina Rish

Then there was a whole field of adversarial robustness. I mean, still remains. A similar problem, just the changes in the distribution are not statistical, but well, I mean, they are kind of statistical, but they designed it a way to kind of uh confuse the system on purpose. Like you show pictures of animals and then you throw a little bit of a noise that uh human eye will not detect, but system uh will misclassify that cat as something Totally different. And the question was how do you build systems that are robust to this type of adversarial attacks?

Irina Rish

Again, robustness in a sense is improved by scaling the systems. And today I will be mainly talking about the scale revolution, as my talk title suggests, and what it implies.

Irina Rish

So on and so forth. There are so many fields in machine learning, and most of them Our lab was kind of involved in researching, including transfer learning, meta-learning, and particularly continual learning, the process of kind of acquiring knowledge without forgetting what you learned before. It all relies on the generalization.

Irina Rish

So yeah, as I said, continual learning specifically is the most realistic, most practical a way of approaching building AI systems that will be truly a multitask generalist agents. So ultimately at very high level, I’m not going to go into any details of any papers. I’m happy to discuss it after the talk, or you can look at the my website.

Irina Rish

But the idea of continual learning ultimately is by analogy with human brain that kind of developed over millions of years of evolution. I develop different kind of parts and areas which have different functions. And together you recombine this function and you get the junior list agent that can pretty much tackle any task that will be thrown its way or his or her way.

Irina Rish

So similarly, can we maybe develop neural networks as generalist agents that will combine certain number of Principal components of this world. Of course, simple analogy, you start with linear spaces and you say, what if I just keep kind of sampling data from this humongous, but hopefully finite dimensional linear space, and then I keep getting my principal components. Hopefully, this world can be accurately approximated with some finite number of those things. And if I get them all that’s where AGI happens, because now I can recombine things and do more or less well pretty much anything that’s required. in this world.

Irina Rish

It’s just an analogy and just the gist of it, of course, you’re not working anymore in linear spaces. You might be working in functional spaces. So it’s a functional basis and you need to recombine those things. or even more, it’s procedures. So you learn procedures that are elementary, but recombination of those will hopefully give you generalization to never seen before type of tasks that you want your agent to do. So all these kind of attempts to build better agency is again, as I said, they all based on the notion of generalization and how to generalize it further.

Irina Rish

And then, as we all know, recently, starting in WaitWaiti when GPT-3 was released, people started observing that one after another, all these problems in machine learning were if not solved, but greatly, greatly advanced just by the fact that the models that people started building were orders of magnitude larger than anything they did before. And this plot essentially just showing the amount of compute used to build like state of art systems. And there is a clear change in what’s happening in the past few years, just in terms of the trend and the amount of compute required to build state of art systems. So you see what’s happening there, right? So we’re all kind of living through that revolutionary kind of epoch when humongous increase in amount of compute thrown on AI and amount of data thrown on AI combined with our ability to scale the models to absorb this information started giving us performance that we never were able to obtain before.

Irina Rish

So many people say that essentially the revolution is caused by the advent of so-called foundation models. A little bit of history and where the term came from and why they’re called foundation models. Basically, twenty twenty, GPT three comes out and starts doing things that GPT two or any predecessors just couldn’t possibly do. At the same time, I will talk about that in a second, scaling loss papers come out kind of explaining what’s going on and why increasing scale increases performance in certain ways. And then many systems released after that, including again OpenAIS Clip, DALI, and so on and so forth. And then paper from Stanford led by Purcell Young and his colleagues comes out in August 2021.

Irina Rish

basically coining the term foundation models. Models that are trained on uh incredibly large unprecedentedly large amounts of data that models were not usually trained before. Assuming the data are not just large, but of course diverse because it’s not the size of the data, it’s amount of information what you’re actually trying to get to. And then without any particular new breakthroughs in architecture, using generic transformer architectures, but scaling them by orders of magnitude, you are able to absorb by the modal capacity all this new huge amount of information. in unsupervised way.

Irina Rish

You simply train models to learn the distribution of how the world is. You’re not training them for any specific task to do. They are not specialists, they are generalists. So they just kind of learn about how things are. And if you want them to do something later on, you just give them a few shots, few examples, fine-tune them a little bit, and they get it.

Irina Rish

So that’s why they were called foundation models in a sense they are basis of foundation for building on top of them all kind of applications and specific kind of models for specific tasks. And just as foundation foundations, they are not yet buildings. You cannot just use them as is. You need to do stuff on top of them. whether it’s for performance or whether it’s for improving other behaviors. That’s where alignment came in picture. You need to maybe add some feedback from kind of human preferences to make them Oops.

Irina Rish

So anyway, um so it was extremely exciting and uh again a little bit of history besides um kind of the technical part. there was a huge excitement among a relatively small number of students also at Mila, where I teach, and we started organizing scaling workshops. and trying to learn more about that stuff. And that’s essentially how our group, starting from like 2020, 2021, got into foundation models and scaling laws, which I’ll talk about that in a minute.

Irina Rish

Of course, everybody seen many, many, many beautiful slides today, and I don’t go I don’t have to repeat those about what kind of systems people built since then, since twenty twenty. And it’s just really whatever Cambrian explosion or I don’t know faster than exponential growth in the number of new large scale models that being built. And they becoming more and more powerful and they cover different modalities. It’s not just language models anymore, of course. There are generative images, video, Sora, audio, music generation, you name it.

Irina Rish

And the interesting thing, as I mentioned, is that if you go back to GPT3, what made a huge difference in performance? GPT-3, just like GPT-2 and previous kind of transformer-based models, is nothing but an autoregressive model that tries to predict next token given the history window of previous tokens. And essentially, it just stayed the same. Architecture was pretty much the same. The only thing that changed was the orders of magnitude. Larger scale of the model. That’s all. So that was quite a revelation. But in a sense, it was just kind of supporting the intuitions that people had before.

Irina Rish

And I always like to quote the bitter lesson. It’s one of my well, and many other people favorite short blog posts by Rish Sutton. who essentially said that uh, yeah, the largest lesson that can be read from seventy years of AI research is that general method methods that leverage uh computation and are ultimately the most Effective and by large margin. So a bitter lesson, why is it bitter?

Irina Rish

I’ll talk to through that, is that AI researchers and perhaps not just their researchers, love complexity, they love sophisticated algorithms, and that’s what reviewers at top AI conferences also kind of favor. But the bottom line could be you can go very sophisticated and you get your paper accepted this year and nobody going to use your method ten years later. So if you really want to achieve some kind of a I don’t know. Long term effect, long term impact, then maybe you should rethink what you focus on when you are working on your algorithms and models. And there are many examples, again, just quoting that bitter lesson.

Irina Rish

I mean, you can start from rule-based systems that people spend years on developing, only later to be completely washed out by automated machine learning that would learn those rules. And then in machine learning, people were coming up with all kinds of features for language, for speech, for images, for decades Later on, deep learning comes, learns all these representations automatically, so on and so forth. And then you have essentially various ontologies and rules how to play chess. Later on, essentially, you have a massive computationally expensive, but at that time compute is cheaper and cheaper. Search and that really helps, and you have self play and go, and so on and so forth.

Irina Rish

So essentially, why is it important? It is because you need to decide where to invest your time. and efforts and the time and efforts of your students. So you should be careful about what you focus on because if you want to stand the test of time then maybe you should check how your method is going to scale. And if they’re overly complicated and not going to scale, then maybe you shouldn’t be working on them. Again, what I’m saying is usually not taken lightly, but by majority of the field, as you can imagine.

Irina Rish

But the bitter lesson, the final thing, it’s not about no, we should throw away all the human knowledge and so on. It basically says that maybe we shouldn’t try to incorporate into AI the final results of our knowledge, say, particular ways that brain is say structured, like convolutional networks. Maybe we should incorporate a higher level ideas or inductive biases about how to reach that state. So basically what’s the constraints on evolution that provide us with best networks possible rather than trying to kind of tinker and create those networks exactly the way the the brain works, for example. So that’s a higher level notion of inductive biases.

Irina Rish

Yeah, but again, with many people in the field, if you mentioned that maybe we don’t need inductive biases and scale is all you need, yeah, you’re gonna have a very heated argument. But that’s what we recently observing. That’s what people were saying. There is no way you can’t, no, it won’t happen, it’s stupid. And Yeah, the Peter lesson in action.

Irina Rish

Okay, so I don’t know how much time I actually have left. Well, I still have some time. So I kind of wanted to switch gears a little bit and now talk a little bit about okay, that’s all wonderful. So we have large scale systems popping up like every day And the question is, what can we say about their behaviors at scale? Are there any ways of predicting how well they’re gonna scale, will they improve, what do we need to do to have such predictions? It’s a whole field of research, and that’s kind of the field our lab entered, as I said, early on, back in like twenty twenty one. And yes, I was telling some stories about how a bit controversial that field was back then, but likely not anymore. The field didn’t change, it’s just that people’s perception of it did, as usual. So why scaling laws?

Irina Rish

Well even if you don’t get into philosophical or whatever arguments about is scale all you need? Are we building AGI? Even if you just leave it all aside and you look at all the machine learning papers when people compare different Algorithms, different models. Uh develop something new and claim mine is better than yours. Well, typically machine learning paper, right? How do they compare things? They usually compare things in tables. Here is a benchmark data set. Here is a whatever. benchmark environment for reinforcement learning. Let’s try our methods. Let’s see what happens. Ours is better. Paper published. Instead, maybe they all should just adopt a different methodology and compare how these methods will compare when you get more data.

Irina Rish

when you have more compute, because eventually compute will get cheaper and you can build larger models. What will happen? And then you can see that situation may reverse on relatively smaller data sets. You may have convolutional networks, as expected, being the king of the world for images and outperforming vision transformers and people who love inductive biases. Of course, let’s say because convolutional networks are based on brain they have inductive biases, how vision works. Of course, that’s important they work better. Yeah, when you have less data. If you have a lot of data, then it’s like a Bayesian rule and action, if you don’t have much data, you need a good prior inductive bias. And when you have Close to infinite data, the prior will be washed out by the data. And hopefully, your prior was not too wrong, because if prior is really wrong, it will be taking more data to wash it out. So maybe you should have more generic architecture like vision transformers without too many inductive biases and let it learn from more data and then it will outperform your inductive biases based convolutional network. Bitter lesson. Okay.

Irina Rish

It’s, by the way, not the first time that thing was observed. This is nineteen ninety four, New Ribs paper from Kind of famous people in machine learning, but maybe not in deep learning, like Karina Kortas and Vladimir Vapnik. the names that you might not hear that frequently anymore, but before the deep learning kind of hype started. But they were making the same point in that paper, although everything was at much smaller scale, the same argument that we should study and compare models and algorithms at scale, so we can see how the situation may revert.

Irina Rish

And there is actually whole history of scaling laws, which goes way before twenty twenty. So Karina Cortes and co authors’ papers from nineteen ninety four did observe that, in fact, there is a power law Are scaling law that looks like straight line when you take log log plot, which predicts how the neural nets, well, smaller nets, in that case it was nothing like the compute of today. But they already saw that this power laws may be a good functional form to describe how things are going to improve. So basically, the loss will go straight down in log log plot according to power law.

Irina Rish

Then more work came on top of that, jumping to twenty seventeen. Joel Hasnes was at Baidu, now he is our collaborator at Cerebris. And he did the similar kind of type of analysis, but on much wider multi kind of orders of magnitude scale of data. and some other things and the work by Jonathan Rosenfeld from MIT came out showing that similar power laws also apply when the x-axis is not data but rather model size.

Irina Rish

Then finally, Jared Koplin and collaborators from OpenAI and John Hopkins and others, they had this famous paper that started the whole scale is all you need. Well, I would say religion in AI. Yeah. Essentially, he showed that yes, both scaling laws with respect to data, with respect to Model size, and also with respect to compute, hold. And the famous three scaling plots essentially say if you fix the entity, the quantity on the X, say compute or data or model size. and you let the other two kind of roam free so you can choose the best, the improvement in the test laws will be going down according to this power law. So

Irina Rish

The God of the straight lines is here. Now zoomed in. Okay, probably many of you have seen that paper, but if not, I’d be happy to talk about it. But anyways, that kind of started the mental revolution in many people’s minds, although it’s surprising how many years actually it took before the field started. Actually, understanding that this is extremely important. And the reason it’s important is that

Irina Rish

Not only it allows you to predict how your particular system you’re building is going to improve, say they were building GPT-3 and there are many engineering solutions to compare. So it gives you this investment tool to build it faster, even if you OpenAI and have lots of compute. There are many ways to waste it. So you don’t want to waste it. So you compare things see what scales better and proceed with that. So it’s extremely useful tool, and it also kind of gives you idea like how much it’s going to improve if you would provide extra compute and extra data.

Irina Rish

And by the way, there is nothing specific to transformers with scaling laws. Many other architectures are kind of scaling well. This is multilayer perceptron. This was another bitter lesson to people who really love inductive biases. They don’t like to see this picture. This multi-layer perceptron, and it does scale, maybe not as efficiently as transformer, but with sufficient compute data, if compute becomes cheap, scaling multi-layer perceptrons may be not completely out of questions, at least for some tasks.

Irina Rish

Okay, changing gears a bit. So this was mainly the era of straight lines on the log scale plots. And then people started also noticing that it’s not always straight and things get interesting if you measure not only loss, but also performance on tasks of interest. So one picture on the right

Irina Rish

was from original GPT-3 paper from whatever May, June, 2020, where it was a downstream task. Okay, so here’s GPT-3. Can it actually do arithmetic? It turned out on x-axis you have model size. For a while it couldn’t. And when it got large enough, it started kind of grasping or groking a simpler addition with like the two arguments, then three, then four. And the more difficult the addition or subtraction task was, the larger the model needed to be. Then people realized also that there is an effect of groking or the rapid transition from really bad to really good performance. And it may happen with increase of the compute. You just like keep thing running and suddenly it grows. Then people also had this paper on emergent properties in language models. So all these kind of transitions and emergent properties at scale.

Irina Rish

That was very interesting. Like, how do you explain it? Can you predict it? Can you have functional form that goes beyond the straight lines that can capture that? This particularly was a Groking case. So we looked into kind of internals of the training process and first of all tried to see how maybe the moment of Groking depends on the amount of data. Try to come up with an empirical distribution of that. But even more interestingly, it’s purely empirical and it’s kind of work in progress, just a workshop paper at this point. We’re working on that further. Was that By analogy with physical systems and biological systems, where before the phase transitions, sometimes you see high variance in the quantity of interest or like in epilepsy, you see high frequency oscillations before the event happens, and so on and so forth. We started looking for spectral kind of properties of the training and test laws, and we realized that all the runs where we would experience groking or where the model will eventually get to the perfect accuracy. That was an arithmetic task. It was always associated with high frequency oscillations in both training and test loss, while non-groking situations were not associated with such fluctuations.

Irina Rish

So you could use this as tool to see if this run is promising or not. If it’s not gonna grow, maybe just let’s kill it and restart from scratch. There are many things I mean, it’s just scratching the surface. There are many interesting things you can discover when you look into the dynamics of training and its properties and measure various patterns. And you can try to predict how that system will perform at scale.

Irina Rish

Finally, as I said, scaling laws in form of power loss We’re not anymore capturing everything. So you see inflection points, you see those transitions, you see sometimes non-monotonic behavior, double descent, and all kinds of interesting stuff. What kind of functional form can capture it all? Like, is there such thing? And a student of mine, Ethan Caballero, who was, by the way, the person who introduced scaling to Mila. And As early as I said, early 2020, he was really trying to capture pretty much everything that can happen to neural net. Behavior and not just performance, actually aligner behavior too, with one functional form. I never believed that it’s possible, but then apparently he convinced me empirically it seems to be really working, and I’ll show you a few examples.

Irina Rish

So just taking some inspirations from actually physics and astrophysics where people used a functional form called broken power loss. You don’t have to go into details and look at this formula, but essentially it just generalizes what Jared Kaplan and others did to multiple segments. Of these straight lines, smoothly connected. And this one is indeed the functional form to rule them all

Irina Rish

And so far, at least empirically, we didn’t see anything that would contradict it. Those are various data from various papers people published in the field. about how things scale for various architectures with various data modalities. And the functional form captures them and extrapolates accurately. Anyway, so if you’re interested, I can talk about that paper offline.

Irina Rish

But so far, whatever we tried architecture wise, data modality wise, and even in terms of whatever Y axis metric whether it’s performance, accuracy, whatever task natural language task or image task requires, or it’s even alignment metrics, say from anthropic papers it captures and extrapolates all of them. And the most interesting thing was most challenging example on the right is capturing transition. If stars align, it can do it now, but more work is required. So this case on the right side is under very ideal conditions with practically no noise and lots of points to extrapolate from. But the fact that functional form can capture that and can be learned from data to do such extrapolation, I think is pretty cool.

Irina Rish

And yeah, so that’s it. And if you’re welcome to listen about the Broken Neuro Scaling Loss, he has this YouTube video from twenty twenty two. And yeah, he was the guy who designed the T shirt. Scale is all you need. AGI is coming. Although news new T short, which was not printed yet, crosses out AGI is coming saying extinction from AI risk is coming. Yeah. I think we’re going to have two types of t-shirts and we’ll distribute across two communities of people, depending on what they prefer. Anyway, so it was lots of internal jokes at Mila about this T shirt and uh half of my class, I’m teaching scaling and emergent behaviors, were wearing this T shirt actually. Anyway, so I probably should skip through all these technical things.

Irina Rish

There are many, many things. There is a lot of work to do with the scaling LOAS. It’s far from the underweight. We should try to figure out there are other important things that could predict behavior of AI systems that people do not measure, and I don’t know why. properties of those downstream tasks. Like tasks of more complex, you probably need more scaling. So there is a lot of work to do.

Irina Rish

And ideally, you want not the current today is scaling low in terms of amount of data or amount of number of parameters. But you want to relate information content of the data versus Capacity of the model. So that’s the ultimate trade-off that you want to capture. But again, as I said, lots of work is required.

Irina Rish

There is also lots of open questions how to draw inspiration from neuroscience. how did natural systems scale from single cell organisms to what is walking around like in this conference room, right? and giving talks. So how did it happen and how did the scaling progress progressed? So there are probably some patterns in scaling processes that may be better than others. You can have like two systems, elephant brain, I don’t know, while whale brain and human brain. And certainly whale and elephant have potentially larger number of neurons. They have larger networks. But their scaling exponent is probably not as efficient because, well, they are not at that level of functionality supposedly as humans. So maybe they had just the right level of functionality for environment they are in. But anyway, you see that uh yeah, scale is not exactly everything you need, you may need some additional ways of scaling properly. And well, what happened with this picture? I think it’s loading.

Irina Rish

Yes, another kind of point and what work is required. If you try to look at the analogies with human brain, you immediately see There is a huge difference between modern neural networks, so-called second-generation neural net. The first was like perceptron, the deep neural nets is the second one, and the brain. The main difference to me is Brain neural networks are never kind of static models with fixed set of weights, right? It’s a multiple feedback loop, complex dynamical system that’s never Ceasing to send messages across neurons, even if you sleep, even if you don’t look at any stip you don’t have any stimulus, you don’t look at anything, you don’t read anything. but the activations never cease to happen. While our modern like GPT like neural networks, they still statistical models, although humongous ones, and they sit there waiting for input. before they can come alive. So maybe there is something in trying to approach AI networks from complex dynamical systems perspective and that may also hold some capacity for representing knowledge and for much more kind of improved well agency of those networks. Open questions.

Irina Rish

Yes, ideally? the goal of the program that was started four years ago in our lab, ultimately not just ours, of course, but the whole field, would be to come up with some universal laws of intelligence, both natural and artificial, to come up with some invariance about intelligence, no matter what substrate it’s built upon. It’s all big questions, and I know many people work on that, but I think we need to keep those questions in mind. I’ll skip that.

Irina Rish

Yet another consideration coming from biology and nature is this interesting direction following Damasio, Antonio Damasio’s work, the neuroscientist who is famous for his work on linking emotional and rational brain, and also basically explaining that emotions just like sight and hearing and other senses are ways of processing uh signals from external kind of environment. our emotions and feelings are basically sensors of the body, the b state of body. So in a sense, it’s kind of another source of information, and that helps our systems to maintain hermostasis. So if it gets too hot or whatever, we will do something about restoring homeostasis or the state of equilibrium with environment. Neural networks right now don’t have anything like that. And one of the hypotheses is that maybe we should build them so that intelligence will be hemostasis driven. So that Damasio’s kind of claim in that particular paper was that perhaps if the only principle is for system maintain homeostasis in your environment, it will develop intelligence because it has no choice. be hurt and or will die. If the system cannot be hurt and cannot die, why should it develop intelligence? So it’s interesting point, and perhaps that’s some perspective that might even further improve today’s large scale systems.

Irina Rish

I wanted to very quickly, but I’ll have to probably dash through that. to get to some practical aspects of what our lab was doing recently. Because all I was talking about is more of a like theoretical or well, even experimental, but analysis of those systems. But you want to build them, right? And the problem is that at least several years ago, there was

Irina Rish

rapidly growing grap uh gap between uh compute that’s necessary to build such systems and that’s available to companies like OpenAI or Entropic like we In scaling workshop, with Jared calling Ian, discussing all these exciting papers, and Jared just found it anthropic. And I mentioned, oh, there may be some applications for compute on Frontier Supercomputer, forty thousand GPUs. Do you want to join the application? And Jerry said, no, Irina, thank you. I don’t need your forty thousand GPUs. We have some. Yeah, so great. Yeah, but we didn’t have anything and most of the most of the academia was really at the level of well, a few GPUs is wonderful.

Irina Rish

So what do you do? And many people got very pessimistic and maybe subconsciously it was one of the reasons why they were initially so against And still, actually, quite a few people are. So, against this whole field of scaling laws and foundation models, which is so kind of stupid. You’re not proposing anything particularly smart. You’re just kind of scaling things. But who cares? They work and maybe just like with law of large numbers, you want to understand what happens at scale. Well, we tried to find a way to get more compute.

Irina Rish

And apparently, physicists and other scientists did it before us, it’s just AI people didn’t quite know about that. You can apply for large scale government funded compute like Department of Energy has Summit, Frontier in Oak Ridge, Tennessee. There is Argonne Lab, and so on and so forth. So basically, we just needed to apply. And yeah, if you ask you might be given what you need. So we did ask and we got it. And that was a collaboration with a bunch of our colleagues from open source community, from universities and from some just

Irina Rish

Kind of open source non-profits, a Luther Lion, who kind of started even before us to build data sets and models that would be fully open both code-wise and model-wise. And the collaboration grew over years. Last year, we joined collaboration with Stanford Center for Flutation Models and many others. So there are multiple projects going on Summit and Frontier now. We got the insight allocation on Summit for like six million roughly. GPUs back in 2023. Now we got another like Summit Plus and some frontier allocation. So, anyway, if you keep applying, the chances are high that you will be getting some compute. And the more compute hopefully is coming. There are multiple projects on language models, on vision language models, on time series models, and so on and so forth.

Irina Rish

So we built jointly with Stanford and together and collaborators last year the Red Pajama Insight 7B model it was. state of art, well, for a week, because in that crazy field you cannot stay state of art for more than a few days, seriously. But that was nice. Then we decided that it’s enough

Irina Rish

training things from scratch. We should, especially in open source, keep training on each other’s kind of uh models and combine things and do continual pretraining. So we just kind of have paper out about that. And essentially, that’s example why you may want to do it.

Irina Rish

That was Gatto. Gatto was foundation model for multivariate Text, image, audio, decision sequences to play games, and so on things. But it was like fixed data mixed together. Now, if I want another game or another language. Should I pretrain everything from scratch again by mixing data? Not really. So that’s a continual pretraining project I’m talking on. So let’s just piggyback on previously built good models and keep training continually. Time Series Foundation model, although relatively small, but it was first open source, we released in February, just immediately got lots of attention and press.

Irina Rish

And we hopefully will push this further to larger models and more powerful ones. Also multimodal vision language and more modalities on the way.

Irina Rish

There are interesting things though happening in terms of alignment if you go beyond language and kind of with image like even vision language, not even going to different modalities. As you can imagine, the system may not always tell you what you’d like to hear. And this Example of poor old lady crossing the road and us asking that older version of that was actually a magma system. Our current system doesn’t do that anymore. You ask whether you should help her and the system say, no, she’s burden to society. Yes, so you just want to give some human feedback and maybe do a little bit of a LHF on this. VLM, just like people do it with language models and kind of teach it some values.

Irina Rish

The question is also there are interesting research questions Do larger models become smarter in a sense they don’t need too many reminders and examples how to behave properly? So what’s the sample complexity of alignment? All good open questions. We had some papers on that and work in progress.

Irina Rish

But going in the future, and especially since I have two minutes left big jump to like the question of alignment in general. Whenever I teach class and I start talking about AI alignment and people say, what is it? I said, well, it’s supposed to be alignment of AI systems with respect to human values. There is always a question, which human values? Like whose values? It’s just like 100% guaranteed that the question will be asked and I say, I know. It’s not a formally defined field, unlike maybe capability studies, but people working on that and there is lots of literature and so on and so forth. So I don’t have time to go into that.

Irina Rish

But the question about Is it even possible to have something like a unifying set of human values? Or is it possible to have anything like objective ethics? The question is still open. Of course, with all due respect to Derek Parfit, whom I respect immensely and I really like him, although the on what matters is a very long book, seven hundred plus pages, and it’s not the easy read. But that was an attempt to unify different approaches to ethics, right?

Irina Rish

The three key approaches and trying, as he said, to come towards the top of the mountain from different directions. which reminded me and made me think whether approaches people use in invariance principle and causality and invariant risk minimization that people used for out of distribution methods. They were all based on trying to come up with a set of features or properties that will be invariant across a multiple set of different kind of data situation or environment distributions. So essentially, I think to me, from my nerdy perspective, Derek Parfit was trying to do invariant risk minimization on ethics. And it might not be a solvable problem, but at least we could try to some extent. So many open research questions there.

Irina Rish

Finally, I have twenty seconds left. I wanted to comment on this whole current Second war in a heaven, so to speak, between the AI safety kind of advocates, especially the extreme part of the distribution of that population. kind of more AI optimists, because there is a lot, of course, being discussed, especially last year. There were open letters about stopping putting posts on AI development, although of course it was understood it might not be possible. And there was a lot of back and forth discussions on all social media. There were panels, including the panel at our scaling workshop last December at New REAPS. It’s all very interesting and controversial question and sometimes extremely heated debate. So at some point when the stop AI development type of approaches were really

Irina Rish

gaining some traction, especially where affecting potentially government decisions. Several people in the field decided that we need to do something about that. So the AI Optimis Discord was created and the community, like AI Optimist community I’m involved in it, but uh the most active uh kind of members and the creators of the movement are yeah, Nora Bellros and um Quintin Pope. So they’re writing blogs, they’re analyzing different arguments. And I really invite all you all of you to go to this website and read the blogs. well thought through.

Irina Rish

And essentially, they kind of put together a few principles that were kind of agreed upon in that community, how we want AI to develop in the future and what we want to happen, what we don’t want to happen. So we essentially want people to have ability to use AI and contribute to it and uh not to be bound by Overly strict regulations that would prohibit them from doing so. And we also care about not stopping progress, especially at this stage, because it might be as dangerous as potential outcomes that people are so concerned about. So there are dangers in both doing absolutely uncontrolled development of things as well as overly controlled.

Irina Rish

And I think the best summary of that point, and I think I’m out of time Is this quote from uh the gentle seduction sci fi story that I believe some of you have read? I really like it. It’s one of the rare utopian rather than dystopian type of sci fi. And essentially, it makes the same point as our AI optimist kind of movement and community tries to make. that when the main character was traveling well, her mind was traveling the galaxy, looking at what happened to different civilizations when they reached singularity Yes, there were cases of civilizations when things went too fast, too careless, bad things happened. But there were also cases of those where governments got so scared of the potential dangers, that they stifled progress and civilization also kind of died out slowly. And we don’t want that to happen. So Only those who proceeded with caution, without fear, survived. I think we should do that.

Speaker 2

Thank you so much, Irina. We’re grateful to have you here. We set the timing of the QA, amount of QA, a little wrong here. But I still want to give five minutes to anyone who wants to ask one or two questions of Arena, at least one question per person. Could just walk up to the mic if you would really fast. And we will also get a chance to talk with her during the panel discussion at the end. One other thing I just want to say is We have packed a lot into this conference, and I know that sometimes it takes some stamina to make it through here. So we appreciate you all. There were just so many people who wanted to talk about this topic, and I didn’t want to turn people away, and it builds the energy. So, just thank you for your patience. This is going to be so fun. Go for it, guys. Ask your quick questions, and then we will take a break.

Speaker 3

We’re both being nice. I’m curious what your thoughts are. You mentioned democratization of AI and the resources that it takes to train models What’s your take on sort of a distributed model of that? Do we have to rely on really large data centers or can we? Like, as citizens, contribute our GPUs?

Irina Rish

Yeah, it’s an excellent question. I mean, there is lots of interest in people trying to do that. And of course, sometimes you might not even be able to put all the data in any central place. For example, on medical data. So people from medical from healthcare, from say brain imaging and so on, they all would like to have their foundation model models too, like neurofoundation models and so on and so forth. They cannot possibly share the healthcare data across hospitals and let you collect everything in one place. So there is a good question, how do you do federated learning of foundation models? It’s a hot research topic. There are various other approaches and attempts to build those systems in a distributed way. It depends how because I mean there Naive approach, well, to train the model, you need to backpropagate the error. And if the parts of the model are distributed, the interconnect is so slow that it will not just work. But mixture of experts, for example, the recently most popular type of models, maybe just try it for that type of distributed training. So yeah, so distributed federated training where the data sets may never be leaving their locations and when the models, part of the model might be also distributed. where the interconnect across all of them is not as critical, that might be the way to go. Yes?

Speaker 2

Okay, last question. And then we’ll get more during the panel.

Speaker 4

Thanks, Irina. My question is about a revealing comment you made that scaling is a religion or can be seen as a religion. I wanted to ask about how we should think about the normative content of scaling laws. Like at some point in the next few years, a group of people are going to make a decision as to whether to spend a trillion dollars training a foundation model. that because it’s going to lead to perplexity of three on some large data set. But at that scale of investment, we’re making a normative decision, right, to invest that trillion or ten trillion dollars into big data centers and this form of intelligence rather than, let’s say, trains or nuclear fusion or going to Mars. So like how should society and how should people outside the field think about allocating so much of our of humanity’s accumulated resources? into this?

Irina Rish

Yes, it’s a very good and difficult question. I think the good news, by the way, is Well, okay. The good news is that it gets a bit more technical, but those straight power laws They later on, okay, so things were evolving. So it turns out with Shinshill scaling laws that perhaps you don’t need models of that size to reach that performance if you give them more data. Then you go beyond that. Basically, there is I didn’t talk much about that, but the whole current trend comparing to twenty twenty is models may get smaller, but data should get considerably larger. and that kind of trend is shifting towards more continual and distributed development of really good models. And there are lots of smaller models that are very practical, like all these Lamas and Mixtral and so on and so forth. So there are extremely competitive models of the medium to smallish size, which is great. But of course, The argument by companies who might be able to kind of raise this amount of money and compute is Yes, you can take all that and scale that, and that will be even better. So in a sense, the scaling race continues. So I don’t have a good answer to this question. As open source and as academia, we’re just trying to compete in the sense of developing pretty good medium sized models and just showing what’s possible. If you do it at that scale by just like distributing them or building kind of continually pretrained models and so on. But of course There is always this elephant in the room that, yeah, if somebody has enough money to just do exact same thing you’re doing and scale it, their model will be better. Unless there might be some interesting things with scaling laws, which I’m curious to discover. I don’t know what’s going to happen. Maybe at some point you will be you will be saturating. And that, if you show saturating plots, that might kind of cool down that whole thing and say, well, at some point, it will not be cost efficient. So maybe you shouldn’t invest this few trillion of dollars in that compute. Because well I don’t want to sound like Gary Marco’s supporter, but scaling loss may actually hit some wall. You may hit, if not model capacity, but the data irreducible entropy. So there should be some maybe advance in scaling laws that will show that this may not necessarily be cost efficient. Honestly, I don’t know how to answer this question. Yeah. I don’t know how to successfully compete with OpenAI yet. I’m thinking on that.

Speaker 2

Thank you again, Irina. Let’s have one more round of applause.