You have probably heard of, and possibly even used the AI driven ChatGPT that people have been using to write all sorts of thing from stories to doctoral theses. But is it really that smart? Can it really write better text that a human author?
What is GPT?
GPT is an acronym for Generative Pre-trained Transformer. It is deep a learning neural network model that uses Internet data and other written material to generate any type of text. There have been five GPT models from Open AI.
InstructGPT was built based on GPT-3 technology. It was trained with humans in the loop using what they call Reinforcement Learning from Human Feedback, (RLHF). As a result, it is much better at following instructions that GPT-3. ChatGPT is built using a similar technique.
ChatGPT can generate text quickly that is nearly indistinguishable from human generated text. How it the world can a computer do that? In simple terms what GPT is doing is just trying to pick the next word in the sentence.
Actually, it is a little more complicated than just choosing the next word. In technical terms it is adding a token which could be a word or a part of a word. This is why it sometimes makes up words. More about tokens later.
So ChatGPT is trying to produce a “reasonable continuation” of whatever text it has so far.
By reasonable I mean what one might expect after seeing what people have written on billions of webpages and digitized books. At each step it builds a list of words with probabilities. The higher the probability the more often that word follows the previous text. You might expect that it would choose the word with the highest probability and some of the earlier attempts to generate human-like text did just that. However, that method generates some rather boring and often very repetitive text.
Perhaps someday we will have an understanding of why this works but if we sometimes, at random, pick one of the words with a lower probability we get a much more interesting text that seems quite like human generated text.
Here is where we get into a little bit of computer coding voodoo. There is a so-called “temperature” parameter, (more on parameters later) that determines how often one of the words with a lower probability is used. Through testing and experimentation, a temperature setting of 0.8 seems to work best. That is, it generates the most human-like text. There is no theory being used here. It is just what has been found to work. The concept of “temperature” is there because exponential distribution is being used but there is no “physical “connection, at least none that we know of.
So how does ChatGPT get these probabilities?
Basically, it determines the probability of each new word by reading lots and lots of text. Let’s try something a little smaller to start out with. We will just choose the next letter based on the letters that have been written. Without any other rule or parameters most of what we get is gibberish. If we add spaces to simulate real text, we get a few more actual words but most of it is still gibberish.
Next, let’s try adding two letters at a time, (2-gram). Since the letter Q is almost always followed by a U so the probability of adding just a Q without a U following it is zero. By adding two letters at a time with a few rules to guide the process we start to get more actual words in the gibberish. As we add progressively more letters, (n-gram) we get progressively more realistic words.
As we add more letters in the n-gram and more rules like, “I before E except before C …” we can generate more real word and less gibberish. Of course, these words are just jumbled together in pseudo sentences and they make no senses at all. But at least we are creating real words.
Let’s try that with words. There are about forty-thousand reasonably commonly used words in the English language. By “reading” through a few million books containing a few billion words we can get an estimate of how common each word is. Using this we could start generating “sentences” by picking words at random based on the probability built by the “reading” of all that text.
Of course, this would just generate more gibberish of mixed-up words. What can we do to improve that? Just like with individual letters we can start estimating the probabilities pair of words or even combinations of words. If we could use sufficiently long n-grams combinations we would eventually get to something like ChatGPT. Unfortunately, there is not anywhere close to enough text that has ever been written for us to deduce those probabilities.
In a crawl through the Internet there could be over a hundred billion words. If we include books that have been digitized, we might get another hundred billion words. With the forty thousand so called common words just the number of possible 2-gram combinations is 1.6 billion and the number of possible 3-gram words is 60 trillion. There is just no way we could estimate the probabilities just for those. By the time we get to essay length combinations of say twenty words the number of possibilities is greater than the number of particles in the universe. We will have to find another way to do this.
Models
We create a model. A collection of ideas that we connect to explain a particular phenomenon. Like the Bohr model of the atom or Mundell-Fleming model of the economy. The model may not be perfect but over time, through experimentation and modification we can produce something that is we can use. At the heart of ChatGPT is something called the Large Language Model. The LLM has been constructed to estimate these possibilities and it does a pretty good job.
Like the Natural Language Processing models, (NLP) commonly found in speech recognition and translation programs the LLM “understands” human speech. While the NLPs focus on the immediate context of words LLMs takes in large volumes of text to better understand the context.
There are no model-less models. Any model has so sort of underlying structure. A set of “knobs you can turn” These “knobs” or parameters are developed during the training of the model and there can be a lot of these parameters. Even the early versions of GPT have millions of parameters. This is why the model training is unsupervised, humans just can’t keep up. After the unsupervised training humans start to interact and refine the process. These parameters are a key part of GPT.
GPT-1 contained 117 million parameters and was pre-trained on a large amount of text data using an unsupervised learning technique. The training was not to a specific task, only to predict the next word in a sentence. It produced sentences, even whole paragraphs that were almost indistinguishable from human generated text. It was a groundbreaking development.
GPT-2 contained 1.5 billion parameters and was more sophisticated that the previous version. It was trained much the same way GPT-1. Although it was capable creating longer much more coherent text it could still run on a powerful PC.
GPT-3 contains 175 billion parameters and it was trained on an enormous amount of text from books, articles and web pages. It is capable of creating text that is nearly indistinguishable from human text. This was the basis for InstructGPT.
GPT-3 Turbo is what powers ChatGPT. It is based on GPT-3 architecture and it was trained on a large amount of conversational data which allows it to understand and respond to natural language queries.
GPT-4 is a multimodal large language model containing 100 trillion parameters. It was trained to predict the next token using both public data and licensed data from third party providers. It was then fine-tuned with reinforcement learning from human and AI feedback for human alignment and policy compliance. It has been made publicly available in a limited form via ChatGPT Plus with access to its commercial API via a waitlist. GPT-4 can take images as well as text as input.
Neural Nets
I have been told, (many times) that the more charts, graphs and equations you add to your writing the more people tend to lose interest and stop reading. So, I am going to attempt to explain neural networks in a few paragraphs without a bunch of charts, graphs or equations. This is an over simplification but I believe it will get the idea across.
Neural networks are collections of algorithms modeled loosely on the human brain. This brings up some questions because we don’t really know how the human brain works. Human brains are comprised of a great many neurons. Each neuron is connected to a number of other neurons through a complex linkage that we are only beginning to understand.
Likewise, the nodes in a neural network are interconnected to each other in a complex linkage of layers and vectors. Humans learn from repeating a process until we get the desired result. Neural networks do much the same thing. In both cases we have only a vague understanding of how those linkages develop.
There have been many advancements in the art of neural network training in recent years. Originally it was thought that things need to be highly structured so that the neural network did as little as possible. However, that proved to be difficult and not very productive. It works better, for most tasks, to just train the neural network on the end-to-end problem and let it “discover”, (often in ways we don’t understand) the necessary intermediate features like encoding, embedding, layers, grouping, weights, vectors, etc. for itself.
Neural networks attempt to follow that structure with nodes or neurons interconnected in layers that resemble the human brain. The patterns they recognize are numerical and contained in vectors. The real-world data is translated into these vectors. Neural networks help us cluster and classify the data into groups and layers of other data with similarities. Sort of a cluster and classification layer on top of the data you store and manage.
This grouping and classification is a key part of machine learning. The process of training the model often produces incorrect groupings and classifications at first. These are corrected by subsequent training. That is why it is so important to train the model using a large amount of data.
This is not as solid and scientific as traditional programming like C or Python. It is simply something that was constructed years ago and seems to work. If it seems to be some sort of a black-box” like function well, it is. Just as we don’t completely understand to working of the human brain that the neural network is modeled on, we don’t always understand how the neural network either. In some cases, even the engineers who designed and built it are unable to explain the methods that it employed to perform a task.
Essentially ChatGPT is constantly training using what OpenAI refers to as Reinforced Learning from Human Feedback, (RPHF).
Tokens
To be precise ChatGPT does not deal with words, rather with tokens. Tokens can be whole words or just parts of words like “ing” or “pre” or “ly”. Tokens make it easier for ChatGPT to deal with , compound and non-English words. It is also why it can sometimes invent new words.
But it is not just ChatGPT that is creating new words, Humans create new words from existing words and parts of words too. An ongoing process that keeps the people at Websters Dictionary on their toes. For example, “Frenemy”, “Gassed”, “Ginormous”, “Tweep” and “D’oh” are recent additions to our vernacular. Consider all the new words like “countdown” that have been injected into our daily lives by NASA. The people keeping the unabridged dictionary up to date have a lot of work to do.
In technical terms ChatGPT is just trying to find the next part of a word to reach the most reasonable continuation of the text it has so far. In most cases this works quite well, some people would say amazingly well. but there are a few down sides.
Limitations
However, it is not just ChatGPT that is guilty of these missteps in language. Humans often make many of these mistakes as well. Just watch the evening news or a political speech for a wealth of examples.
Learning
Experimenting with language translation which is much simpler that text generation there have been cases where the neural network was taught to translate language A to language B and also language A to language C. On its own the neural network “learned” to translate Language B to language C. Even the people running this model could not fully explain how that happened. The LLM used in ChatGPT is much more advanced than that.
Why does ChatGPT seem to be so good at language? I suspect that at some fundamental level language is simpler than it seems. While we all use language everyday and we are very dependent on it how many of us can honestly say that we are truly masters of language? ChatGPT has somehow, even with its rather straightforward neural network structure, apparently been able to capture the essence of our language and the thinking behind it. In its training ChatGPT has somehow discovered, (implicitly) whatever regularities in language, (and thinking) make this possible.
The success of ChatGPT gives us evidence that there may be new “laws of language” and therefore new “laws of thought” that are out there and have yet to be discovered. Inside ChatGPT’s neural network these laws are, (at best) implicit. So, if we can find a way to make those laws explicit there is the potential to do the kinds of things that ChatGPT does in much more direct, transparent and efficient ways.
What would these new laws be like? I don’t know but I think it will be fascinating to find out.
For more information visit OpenAI.Com. Perhaps you might want to try ChatGPT yourself, it’s free.
If you know someone that you think would enjoy this newsletter, forward it to them and ask them to join using the link at the bottom of the page.
And remember — always back it up!