A few years ago, scientists learned something remarkable about mallard ducklings. If one of the first things the ducklings see after birth is two objects that are similar, the ducklings will later follow new pairs of objects that are similar, too. Hatchlings shown two red spheres at birth will later show a preference for two spheres of the same color, even if they are blue, over two spheres that are each a different color. Somehow, the ducklings pick up and imprint on the idea of similarity, in this case the color of the objects. They can imprint on the notion of dissimilarity too.

What the ducklings do so effortlessly turns out to be very hard for artificial intelligence. This is especially true of a branch of AI known as deep learning or deep neural networks, the technology powering the AI that defeated the world’s Go champion Lee Sedol in 2016. Such deep nets can struggle to figure out simple abstract relations between objects and reason about them unless they study tens or even hundreds of thousands of examples.

To build AI that can do this, some researchers are hybridizing deep nets with what the research community calls “good old-fashioned artificial intelligence,” otherwise known as symbolic AI. The offspring, which they call neurosymbolic AI, are showing duckling-like abilities and then some. “It’s one of the most exciting areas in today’s machine learning,” says Brenden Lake, a computer and cognitive scientist at New York University.

Graphic showing that ducklings can learn the concepts of “same” and “different” objects based on the objects’ shape or color.

Ducklings exposed to two similar objects at birth will later prefer other similar pairs. If exposed to two dissimilar objects instead, the ducklings later prefer pairs that differ. Ducklings easily learn the concepts of “same” and “different” — something that artificial intelligence struggles to do.

Though still in research labs, these hybrids are proving adept at recognizing properties of objects (say, the number of objects visible in an image and their color and texture) and reasoning about them (do the sphere and cube both have metallic surfaces?), tasks that have proved challenging for deep nets on their own. Neurosymbolic AI is also demonstrating the ability to ask questions, an important aspect of human learning. Crucially, these hybrids need far less training data then standard deep nets and use logic that’s easier to understand, making it possible for humans to track how the AI makes its decisions.

“Everywhere we try mixing some of these ideas together, we find that we can create hybrids that are … more than the sum of their parts,” says computational neuroscientist David Cox, IBM’s head of the MIT-IBM Watson AI Lab in Cambridge, Massachusetts.

Each of the hybrid’s parents has a long tradition in AI, with its own set of strengths and weaknesses. As its name suggests, the old-fashioned parent, symbolic AI, deals in symbols — that is, names that represent something in the world. For example, a symbolic AI built to emulate the ducklings would have symbols such as “sphere,” “cylinder” and “cube” to represent the physical objects, and symbols such as “red,” “blue” and “green” for colors and “small” and “large” for size. Symbolic AI stores these symbols in what’s called a knowledge base. The knowledge base would also have a general rule that says that two objects are similar if they are of the same size or color or shape. In addition, the AI needs to know about propositions, which are statements that assert something is true or false, to tell the AI that, in some limited world, there’s a big, red cylinder, a big, blue cube and a small, red sphere. All of this is encoded as a symbolic program in a programming language a computer can understand.

Armed with its knowledge base and propositions, symbolic AI employs an inference engine, which uses rules of logic to answer queries. A programmer can ask the AI if the sphere and cylinder are similar. The AI will answer “Yes” (because they are both red). Asked if the sphere and cube are similar, it will answer “No” (because they are not of the same size or color).

In hindsight, such efforts run into an obvious roadblock. Symbolic AI can’t cope with problems in the data. If you ask it questions for which the knowledge is either missing or erroneous, it fails. In the emulated duckling example, the AI doesn’t know whether a pyramid and cube are similar, because a pyramid doesn’t exist in the knowledge base. To reason effectively, therefore, symbolic AI needs large knowledge bases that have been painstakingly built using human expertise. The system cannot learn on its own.

On the other hand, learning from raw data is what the other parent does particularly well. A deep net, modeled after the networks of neurons in our brains, is made of layers of artificial neurons, or nodes, with each layer receiving inputs from the previous layer and sending outputs to the next one. Information about the world is encoded in the strength of the connections between nodes, not as symbols that humans can understand.

Take, for example, a neural network tasked with telling apart images of cats from those of dogs. The image — or, more precisely, the values of each pixel in the image — are fed to the first layer of nodes, and the final layer of nodes produces as an output the label “cat” or “dog.” The network has to be trained using pre-labeled images of cats and dogs. During training, the network adjusts the strengths of the connections between its nodes such that it makes fewer and fewer mistakes while classifying the images. Once trained, the deep net can be used to classify a new image.

Deep nets have proved immensely powerful at tasks such as image and speech recognition and translating between languages. “The progress has been amazing,” says Thomas Serre of Brown University, who explored the strengths and weaknesses of deep nets in visual intelligence in the 2019 Annual Review of Vision Science. “At the same time, because there’s so much interest, the limitations are becoming clearer and clearer.”

Acquiring training data is costly, sometimes even impossible. Deep nets can be fragile: Adding noise to an image that would not faze a human can stump a deep neural net, causing it to classify a panda as a gibbon, for example. Deep nets find it difficult to reason and answer abstract questions (are the cube and cylinder similar?) without large amounts of training data. They are also notoriously inscrutable: Because there are no symbols, only millions or even billions of connection strengths, it’s nearly impossible for humans to work out how the computer reaches an answer. That means the reasons why a deep net classified a panda as a gibbon are not easily apparent, for example.

Photo of a panda, which a deep net correctly identifies with 57.7% confidence. When a tiny amount of white noise is added to the image, the deep net identifies it as a gibbon with 99.3% confidence.

Deep nets can be vulnerable to noise in the data. Here, a deep net correctly identifies an image of a panda (left). But adding a small amount of white noise to the image (indiscernible to humans) causes the deep net to confidently misidentify it as a gibbon.

Since some of the weaknesses of neural nets are the strengths of symbolic AI and vice versa, neurosymbolic AI would seem to offer a powerful new way forward. Roughly speaking, the hybrid uses deep nets to replace humans in building the knowledge base and propositions that symbolic AI relies on. It harnesses the power of deep nets to learn about the world from raw data and then uses the symbolic components to reason about it.

Researchers into neurosymbolic AI were handed a challenge in 2016, when Fei-Fei Li of Stanford University and colleagues published a task that required AI systems to “reason and answer questions about visual data.” To this end, they came up with what they called the compositional language and elementary visual reasoning, or CLEVR, dataset. It contained 100,000 computer-generated images of simple 3-D shapes (spheres, cubes, cylinders and so on). The challenge for any AI is to analyze these images and answer questions that require reasoning. Some questions are simple (“Are there fewer cubes than red things?”), but others are much more complicated (“There is a large brown block in front of the tiny rubber cylinder that is behind the cyan block; are there any big cyan metallic cubes that are to the left of it?”).

It’s possible to solve this problem using sophisticated deep neural networks. However, Cox’s colleagues at IBM, along with researchers at Google’s DeepMind and MIT, came up with a distinctly different solution that shows the power of neurosymbolic AI.

Schematics showing how symbolic AI and deep neural networks, the two main branches of artificial intelligence, operate. The third panel shows how the two combine to create a hybrid known as neurosymbolic AI.

A hybrid approach, known as neurosymbolic AI, combines features of the two main AI strategies. In symbolic AI (upper left), humans must supply a “knowledge base” that the AI uses to answer questions. Deep nets (upper right) are trained to arrive at correct answers. During training, they adjust the strength of the connections between layers of nodes. The hybrid uses deep nets, instead of humans, to generate only those portions of the knowledge base that it needs to answer a given question.

The researchers broke the problem into smaller chunks familiar from symbolic AI. In essence, they had to first look at an image and characterize the 3-D shapes and their properties, and generate a knowledge base. Then they had to turn an English-language question into a symbolic program that could operate on the knowledge base and produce an answer. In symbolic AI, human programmers would perform both these steps. The researchers decided to let neural nets do the job instead.

The team solved the first problem by using a number of convolutional neural networks, a type of deep net that’s optimized for image recognition. In this case, each network is trained to examine an image and identify an object and its properties such as color, shape and type (metallic or rubber).

The second module uses something called a recurrent neural network, another type of deep net designed to uncover patterns in inputs that come sequentially. (Speech is sequential information, for example, and speech recognition programs like Apple’s Siri use a recurrent network.) In this case, the network takes a question and transforms it into a query in the form of a symbolic program. The output of the recurrent network is also used to decide on which convolutional networks are tasked to look over the image and in what order. This entire process is akin to generating a knowledge base on demand, and having an inference engine run the query on the knowledge base to reason and answer the question.

The researchers trained this neurosymbolic hybrid on a subset of question-answer pairs from the CLEVR dataset, so that the deep nets learned how to recognize the objects and their properties from the images and how to process the questions properly. Then, they tested it on the remaining part of the dataset, on images and questions it hadn’t seen before. Overall, the hybrid was 98.9 percent accurate — even beating humans, who answered the same questions correctly only about 92.6 percent of the time.

Images showing the kinds of geometric objects used in the CLEVR challenge, plus sample questions that can be asked of AIs participating in the challenge.

In the CLEVR challenge, artificial intelligences were faced with a world containing geometric objects of various sizes, shapes, colors and materials. The AIs were then given English-language questions (examples shown) about the objects in their world.

Better yet, the hybrid needed only about 10 percent of the training data required by solutions based purely on deep neural networks. When a deep net is being trained to solve a problem, it’s effectively searching through a vast space of potential solutions to find the correct one. This requires enormous quantities of labeled training data. Adding a symbolic component reduces the space of solutions to search, which speeds up learning.

Most important, if a mistake occurs, it’s easier to see what went wrong. “You can check which module didn’t work properly and needs to be corrected,” says team member Pushmeet Kohli of Google DeepMind in London. For example, debuggers can inspect the knowledge base or processed question and see what the AI is doing.

The hybrid AI is now tackling more difficult problems. In 2019, Kohli and colleagues at MIT, Harvard and IBM designed a more sophisticated challenge in which the AI has to answer questions based not on images but on videos. The videos feature the types of objects that appeared in the CLEVR dataset, but these objects are moving and even colliding. Also, the questions are tougher. Some are descriptive (“How many metal objects are moving when the video ends?”), some require prediction (“Which event will happen next? [a] The green cylinder and the sphere collide; [b] The green cylinder collides with the cube”), while others are counterfactual (“Without the green cylinder, what will not happen? [a] The sphere and the cube collide; [b] The sphere and the cyan cylinder collide; [c] The cube and the cyan cylinder collide”).

Such causal and counterfactual reasoning about things that are changing with time is extremely difficult for today's deep neural networks, which mainly excel at discovering static patterns in data, Kohli says.

To address this, the team augmented the earlier solution for CLEVR. First, a neural network learns to break up the video clip into a frame-by-frame representation of the objects. This is fed to another neural network, which learns to analyze the movements of these objects and how they interact with each other and can predict the motion of objects and collisions, if any. Together, these two modules generate the knowledge base. The other two modules process the question and apply it to the generated knowledge base. The team’s solution was about 88 percent accurate in answering descriptive questions, about 83 percent for predictive questions and about 74 percent for counterfactual queries, by one measure of accuracy. The challenge is out there for others to improve upon these results.

This video shows a more sophisticated challenge, called CLEVRER, in which artificial intelligences had to answer questions about video sequences showing objects in motion. The video previews the sorts of questions that could be asked, and later parts of the video show how one AI converted the questions into machine-understandable form.

CREDIT: COURTESY OF CHUANG GAN

Good question

Asking good questions is another skill that machines struggle at while humans, even children, excel. “It’s a way to consistently learn about the world without having to wait for tons of examples,” says Lake of NYU. “There’s no machine that comes anywhere close to the human ability to come up with questions.”

Neurosymbolic AI is showing glimmers of such expertise. Lake and his student Ziyun Wang built a hybrid AI to play a version of the game Battleship. The game involves a 6-by-6 grid of tiles, hidden under which are three ships one tile wide and two to four tiles long, oriented either vertically or horizontally. Each move, the player can either choose to flip a tile to see what’s underneath (gray water or part of a ship) or ask any question in English. For example, the player can ask: “How long is the red ship?” or “Do all three ships have the same size?” and so on. The goal is to correctly guess the location of the ships.

Lake and Wang’s neurosymbolic AI has two components: a convolutional neural network to recognize the state of the game by looking at a game board, and another neural network to generate a symbolic representation of a question.

The team used two different techniques to train their AI. For the first method, called supervised learning, the team showed the deep nets numerous examples of board positions and the corresponding “good” questions (collected from human players). The deep nets eventually learned to ask good questions on their own, but were rarely creative. The researchers also used another form of training called reinforcement learning, in which the neural network is rewarded each time it asks a question that actually helps find the ships. Again, the deep nets eventually learned to ask the right questions, which were both informative and creative.

Lake and other colleagues had previously solved the problem using a purely symbolic approach, in which they collected a large set of questions from human players, then designed a grammar to represent these questions. “This grammar can generate all the questions people ask and also infinitely many other questions,” says Lake. “You could think of it as the space of possible questions that people can ask.” For a given state of the game board, the symbolic AI has to search this enormous space of possible questions to find a good question, which makes it extremely slow. The neurosymbolic AI, however, is blazingly fast. Once trained, the deep nets far outperform the purely symbolic AI at generating questions.

Graphic showing the simplified game of Battleship played by an artificial intelligent system.

The hybrid artificial intelligence learned to play a variant of the game Battleship, in which the player tries to locate hidden “ships” on a game board. In this version, each turn the AI can either reveal one square on the board (which will be either a colored ship or gray water) or ask any question about the board. The hybrid AI learned to ask useful questions, another task that’s very difficult for deep neural networks.

Not everyone agrees that neurosymbolic AI is the best way to more powerful artificial intelligence. Serre, of Brown, thinks this hybrid approach will be hard pressed to come close to the sophistication of abstract human reasoning. Our minds create abstract symbolic representations of objects such as spheres and cubes, for example, and do all kinds of visual and nonvisual reasoning using those symbols. We do this using our biological neural networks, apparently with no dedicated symbolic component in sight. “I would challenge anyone to look for a symbolic module in the brain,” says Serre. He thinks other ongoing efforts to add features to deep neural networks that mimic human abilities such as attention offer a better way to boost AI’s capacities.

DeepMind’s Kohli has more practical concerns about neurosymbolic AI. He is worried that the approach may not scale up to handle problems bigger than those being tackled in research projects. “At the moment, the symbolic part is still minimal,” he says. “But as we expand and exercise the symbolic part and address more challenging reasoning tasks, things might become more challenging.” For example, among the biggest successes of symbolic AI are systems used in medicine, such as those that diagnose a patient based on their symptoms. These have massive knowledge bases and sophisticated inference engines. The current neurosymbolic AI isn’t tackling problems anywhere nearly so big.

Cox’s team at IBM is taking a stab at it, however. One of their projects involves technology that could be used for self-driving cars. The AI for such cars typically involves a deep neural network that is trained to recognize objects in its environment and take the appropriate action; the deep net is penalized when it does something wrong during training, such as bumping into a pedestrian (in a simulation, of course). “In order to learn not to do bad stuff, it has to do the bad stuff, experience that the stuff was bad, and then figure out, 30 steps before it did the bad thing, how to prevent putting itself in that position,” says MIT-IBM Watson AI Lab team member Nathan Fulton. Consequently, learning to drive safely requires enormous amounts of training data, and the AI cannot be trained out in the real world.

Fulton and colleagues are working on a neurosymbolic AI approach to overcome such limitations. The symbolic part of the AI has a small knowledge base about some limited aspects of the world and the actions that would be dangerous given some state of the world. They use this to constrain the actions of the deep net — preventing it, say, from crashing into an object.

This simple symbolic intervention drastically reduces the amount of data needed to train the AI by excluding certain choices from the get-go. “If the agent doesn’t need to encounter a bunch of bad states, then it needs less data,” says Fulton. While the project still isn’t ready for use outside the lab, Cox envisions a future in which cars with neurosymbolic AI could learn out in the real world, with the symbolic component acting as a bulwark against bad driving.

So, while naysayers may decry the addition of symbolic modules to deep learning as unrepresentative of how our brains work, proponents of neurosymbolic AI see its modularity as a strength when it comes to solving practical problems. “When you have neurosymbolic systems, you have these symbolic choke points,” says Cox. These choke points are places in the flow of information where the AI resorts to symbols that humans can understand, making the AI interpretable and explainable, while providing ways of creating complexity through composition. “That’s tremendously powerful,” says Cox.

Editor’s note: This article was updated October 15, 2020, to clarify the viewpoint of Pushmeet Kohli on the capabilities of deep neural networks.