This is an essay with some weekend reflections on the current state of machine learning technology with a particular focus on LLMs aka AI and our current point in history. Before we jump into this exciting singularity thing, I’d like to mention that, as an essay, this is a more personal and less formal writing, sharing and highlighting some ideas that look important in that context. This is not a comprehensive industry report nor it was meant to be one, but I hope it would be an interesting reading both for Machine Learning Engineers and for a broader audience interested in the current AI uprise. my perspective on the Natural Language Understanding evolution There are three parts to the story: part briefly reminds us how we got to our current AGI state from a multilayer perceptron in just twelve years. The history section focuses on the latest achievements of and current industry trends. If you are deep in context and looking for some fresh ideas, skip to that part. The present day LLMs part presents some ideas on what could follow the current AGI stage. The mystery The history So, first of all, Machine Learning has been around for a while, about a decade or duodecennial, depending on whether you count from Tomas Mikolov’s word2vec or from Andrew Ng’s Machine Learning course on Coursera. Kaggle was launched in 2010, and Fei-Fei Li gathered in 2009. Not that long ago, you’d probably agree if you’re over 30. publication Imagenet Some people would argue that machine learning has been around much longer, but I am now speaking about industry adoption of deep learning algorithms aka the technology momentum, not about pure research. And here we are not touching the stuff like classic ML algorithms covered in scikitlearn, all the regression, clustering, and time series forecasting kinds of things. They are silently doing their important job but people do not call them AI, no hype around, you know. Why did that AI spring happen 12 years ago? Deep learning (training a multiple-layer neural network with errors back propagation) finally became feasible on an average GPU. In 2010 the simplest neural network architecture, a multi-layer perceptron, had beaten other algorithms in handwritten digit recognition (famous MNIST dataset), a et al. result achieved by Juergen Schmidhuber Since that point around 2010, the technology became more and more robust. There have been a few game-changing moments —said word2vec model release which brought semantic understanding to the world of Natural Language Processing (NLP), the public release of Tensorflow and Keras deep learning frameworks a little later, and of course, the invention of in 2017, which still is a SOTA neural network architecture, having expanded beyond the world of NLP. Why is that? Because Transformer has attention and is capable of handling sequences such as texts with O(n2) complexity which is enabled by the matrix multiplication approach allowing us to look at the whole input sequence. The second reason for Transformer’s success in my opinion is the flexible allowing us to train and use models jointly and separately (sequence-to-sequence or sequence-to-vector). Transformer Encoder-Decoder architecture The GPT family models (the Transformer Decoder) have made some noise going beyond the tech industry since already could produce fairly humanlike texts and was capable of the few-shot and some zero-shot learning. The last part is more important, the GPT-3 is even named “Language Models are Few-Shot Learners” — this ability of Large Language Models to quickly learn from examples was first stated by OpenAI in 2020. OpenAI GPT-3 paper But bang! ’s release has come with hype we’ve never seen before, finally drawing huge public attention. And now, the is going beyond that. ChatGPT GPT-4 Why is that? For the last 7 years, since neural networks started showing decent results, what we’ve been calling AI was actually a — our models were trained to solve some specific set of tasks — recognize objects, perform classification or predict the following tokens in the sequence. And people have only been dreaming of the — an capable of completing multiple tasks on a human level. narrow artificial intelligence AGI artificial general intelligence, Present day LLMs reasoning abilities are game changers In fact, what happened with the instruction-based LLMs tuning, or, as they call it in OpenAI, — over the provided information. And that changes things — before LLMs were closer to a reasonably good statistical parrot, but still very useful for a lot of applications such as text embeddings, vector search, chatbots, etc. But with instruction-based training, they effectively learn reasoning from humans. reinforcement learning from human feedback GPT-3.5+ models finally learned the ability to reason What exactly is reasoning? The ability to use the provided information in order to derive conclusions through some logical operations. Say A is connected to B and B is connected to C, so is A connected to C? GPT-4 features a much more complex on their official product page. The model’s ability to reason is so strong and flexible that it can produce a structured sequence of instructions or logical operations to follow in order to achieve a given goal using “common knowledge” or “common sense” along the way, not just the information provided in the prompt. reasoning example , with nodes containing entities and edges as predicates or relations of entities. This is a form of information storage that provides explicit reasoning abilities. At some point, I was involved in building a question-answering system which among other things used a knowledge graph to find the information asked — you just had to detect the intent, see if we have this kind of relations in the graph, check for the particular entities mentioned, and, if they existed, query this subgraph. In fact, this pipeline provided a translation of the query in natural language into a SPARQL query. Before LLMs with such reasoning abilities, the other tool well designed for reasoning was a knowledge graph Now you can provide this factual information to the model in plain text as the context part of your prompt and it will “learn” it in zero-shot and will be able to reason on that. Wow, right? And you are not limited to the number of entities and relation types contained in the graph. Plus, you have that “common sense”, the general understanding of the concepts of our world and their relations, which was the trickiest part of separating machine learning models from human cognition. We did not even notice how we became able to give instructions in natural language and they started working correctly without too explicit explanations. Reasoning plus knowledge are the two crucial components of intelligence. For the last 20 years, we’ve put roughly all human knowledge to the Internet in the form of Wikipedia, scientific publications, service descriptions, blogs, billions of lines of code and Stackoverflow answers, and billions of opinions in social media. Now we can reason with that knowledge. GPT-4 is the AGI These reasoning abilities are well demonstrated in the official OpenAI : tech report on GPT4 GPT-4 exhibits human-level performance on the majority of these professional and academic exams. Notably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers. According to the GPT-4 results on a number of human tests, we are somewhere around AGI — OpenAI even uses these words on their webpage, and a recent Microsoft 150+ pages with an in-depth study of GPT-4 capabilities on different domains named “Sparks of Artificial General Intelligence: Early experiments with GPT-4” carefully but explicitly claims that AGI is here: paper Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. and later: The combination of the generality of GPT-4’s capabilities, with numerous abilities spanning a broad swath of domains, and its performance on a wide spectrum of tasks at or beyond human-level, makes us comfortable with saying that GPT-4 is a significant step towards AGI. The reason for that claim is: Despite being purely a language model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, com- prehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more. And to nail it: Even as a first step, however, GPT-4 challenges a considerable number of widely held assumptions about machine intelligence, and <…>. Our primary goal in composing this paper is to share our exploration of GPT-4’s capabilities and limitations in support of our assessment that a technological leap has been achieved. exhibits emergent behaviors and capabilities whose sources and mechanisms are, at this moment, hard to discern precisely We believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond. I highly recommend that you spend some time with this study as behind these loud claims there is a very interesting analysis of how said models work and an extensive comparison of GPT-4 to ChatGPT results on a variety of non-trivial tasks from different domains. LLMs plus search not expected to be learned by the model while training, —retrieval plus ranking mechanism, no matter if you store your data as vector embeddings in some ANN index like or in an old school full-text index like Elastic — and then feed these search results to an LLM as a context, injecting it in a prompt. That’s kind of what Bing 2.0 and (now powered by ) searches do now. If we need to apply LLM’s reasoning abilities to make conclusions over some specific information we can use any kind of search Faiss Bard PaLM2 — if it is specific and complete, you can count on better answers than the vanilla ChatGPT provides. I have implemented this search + LLM call system with a architecture, where ChatGPT replaced the Reader model, and with the full-text Elastic search, in both cases, the overall quality of the system depends on the quality of the data you have in your index DPR Some even managed to make a Swiss knife around GPT, call it a vector database, and — my hat goes off! But due to the textual interface of GPT models, you can build anything around it with any tools you are familiar with, no adapters are needed. library raise a good round on that Model analysis One of the questions that could give a clue to further model advancements is . how these large models actually learn and where those impressive reasoning abilities are stored in model weights This week OpenAI has released a paper and an aiming to answer these questions by peeling away the layers of LLMs. The way it works — they observe the activity of some part of the model’s neural network frequently activated on some domain of knowledge, then a more powerful GPT-4 model writes its explanations on what this particular part or a neuron of the LLM being studied is responsible for and then they try to predict the original LLM’s output on a number of relevant text sequences with GPT-4, which results in a score being assigned to each of its explanations. “Language models can explain neurons in language models” open-source project However, this technique has some drawbacks. First, as the authors state, their method gave good explanations only to 1000 neurons out of around 300000 neurons having been studied. Here is the paper citation: However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic. This suggests we should change what we’re explaining. The second point is that this technique currently does not provide insights on how the training process could be improved. But it is a good effort in terms of model interpretability study. Maybe if the neurons studied would be united into some clusters based on their interdependencies and these clusters would demonstrate some behavioral patterns that could be changed due to different training procedures, that would give us some understanding of how certain model capabilities are correlated to training data and training policy. In some way, this clustering and differentiation could look like the brain’s segmentation into different areas responsible for particular skills. That could provide us with insights on how to efficiently fine-tune an LLM in order for it to gain some particular new skill. Agents Another trending idea is making an autonomous agent with a looped LLM — is full of experiments like , et al. The idea is to set a goal for such an agent and to provide it with some external tools such as other services’ APIs so it can deliver the desired result via a loop of iterations or chaining models. Twitter AutoGPT, AgentGPT, BabyAGI Last week Huggingface released in their famous Transformers library to: Agents “easily build GenerativeAI applications and autonomous agents using LLMs like OpenAssistant, StarCoder, OpenAI, and more”. (c) Phillip Schmid The library provides an interface to chain models and APIs capable of responding complex queries in natural language and supporting multimodal data (text, images, video, audio). The prompt in this case includes the agent’s description, a set of tools (mostly some other narrow case neural networks), some examples, and a task. Agents would facilitate model usage for non-engineers but are also a good start to building a more complex system on top of LLMs. And, by the way, here is the Natural Language API, a different kind of Internet to what you know. BTW, Twitter is going really these days around AI, everybody is building something on top of LLM models and showing it to the world — I have never seen so much enthusiasm in the industry. If you want to investigate what’s up — I’d recommend starting that rabbit hole dive with Andrew Karpathy’s recent tweet. crazy https://twitter.com/karpathy/status/1654892810590650376?embedable=true Coding co-pilots , powering Github copilot has been around for a while, and a few days ago as a Colab Pro subscriber I received a letter from Google, saying that in June they would (citing the letter) Codex start gradually adding AI programming features to Colab Among the first to appear: single and multi-line hints for code completion; natural language code generation, which allows you to send code generation requests to Google models and paste it into a notebook. By the way, last week Google announced family of models, among which there is Codey, Google’s specialized model for coding and debugging, that probably would be powering these announced features. PaLM 2 To conclude this section, I’d like to say that my personal choice of NLP over CV around 2016 was made due to the fact that . We even think with the concepts from our language, so the system is complex enough to define ourselves and the world around us. And that brings the possibility of creating a language-driven system with reasoning abilities and consciousness that are humanlike or even surpassing that level. . language is the universal and ultimate way people transfer information We’ve just scratched the surface of that true reasoning around half a year ago. Imagine where we are and what will follow The mystery If by any reason you are unfamiliar with Tim Urban, the author of the , read , dated 2015 — check out how this looked from the past, just 7 years ago, when there were NO LLMs around and no Transformer models either. I shall quote a few lines of his post here, just to remind you where we were 7 years ago. waitbutwhy blog his post on AGI Make AI that can beat any human in chess? Done. Make one that can read a paragraph from a six-year-old’s picture book and not just recognize the words but understand the meaning of them? Google is currently spending of dollars trying to do it. billions But after we achieve AGI, things would start moving at a much faster pace, he promises. This is due to the law of accelerated returns formulated by Ray Kurzweil: Ray Kurzweil calls human history’s Law of Accelerating Returns. This happens because more advanced societies have the ability to progress at a faster rate than less advanced societies — because they’re more advanced. Applying this law to current LLMs it is easy to go further and say that the ability to learn and reason over all the data saved in the Internet would bring this superhuman memory to human-level reasoning and soon the smartest people around would be outsmarted by the machine the same way as chess champion Kasparov was beaten by Deep Blue computer in 1997. This would bring us to Artificial Super Intelligence (ASI) but we do not know how it looks yet. Maybe we’d need another feedback loop for training it as the GPT-4 human feedback learning provides just human-level reasoning. It’s highly possible that the better models would teach the weaker ones and this would be an iterative process.**Just speculating — we’ll see. The thing Tim really outlines in the is that due to this law of accelerated returns, we might not even notice the point when our systems surpass AGI and that things would be a little out of our understanding then. second part of his post on AGI For now, just a small percentage of people who work in tech understand the real pace of the progress and the astonishing potential instruction-based LLMs tuning brings. Geoffrey Hinton is one of them, publicly speaking of such risks as job market pressure, fake content production, and malicious usage. What I find even more important is that he points out that capable of zero-shot learning of complex skills do. current systems might have a better learning algorithm than humans The concern with modern LLMs comes from the fact that while they provide a huge leverage in a lot of tasks, the abilities to work with these models —pre-train, fine-tune, do meaningful prompting, or incorporate them in digital products — is obviously unequal around the society, both in terms of training/usage costs and skills. Some people from twitter or huggingface community would argue that we have quite capable now as an alternative to OpenAI hegemony, but still, they are following the trend and are less powerful, plus they require certain skills to handle. And while OpenAI models are such a success, Microsoft and Google would invest even more into that research, to try and stop them. Oh, too, if they finally let the Metaverse go. open source LLMs Meta One of the most demanded skills nowadays is writing code – software engineering dominated the tech scene and salaries for the last 20 years. With the current state of the coding co-pilots it looks like a good chunk of the boilerplate code soon would be either generated or efficiently fetched and adapted, which would look the same way for a user, saving developers lots of time and maybe taking some job opportunities out of the market. There is another idea in that and beyond it sounding like . For now vanilla LLMs are still not autonomous agents and by no means incorporate any willpower — the two ideas that scare people. Just in case. Do not confuse the model’s training process involving reinforcement learning with human feedback, where the RL algorithm used is OpenAI’s , with the final model being just a Decoder part from the Transformer predicting token sequences. very good post on AGI AGI would be capable of autonomous self-improvement Proximal Policy Optimization Probably you’ve noticed that a few papers I’ve cited were released last week — I am sure the following weeks would bring new releases and ideas that I wish I had covered in this post, but that’s the sign of the time. , as t — like several a month while last year we’ve seen just a few big releases. Enjoy the ride! Seems like we are rapidly entering the new era of software and have made a few steps towards the singularity point he innovations in the machine learning industry are already happening at an unprecedented pace The next explosion would be when Musk connects us to LLMs through Neuralink. P.S. . Not a single OpenAI API call was made to write this text. I bet. P.P.S