Google’s Gemini is the world’s most capable multimodal AI yet
Because of its ability to understand nuanced information, Gemini AI can solve or offer answers to questions that were previously impossible to have machine-solved without giving it more context or adding metadata
Earlier this year, Google AI's Brain division merged with DeepMind, a British-American artificial intelligence research lab that Google acquired in 2014. The first 'big' thing to come from this newly formed team, dubbed Google DeepMind, is the 'GPT-4 killer' Gemini.
Google's Gemini is a multimodal large language model (LLM) that is built on the PaLM 2 architecture, with improvements in efficiency, multimodal capabilities, and future-proofing for memory and planning.
In almost every standardised benchmark, Gemini knocks its contemporaries, including the widely praised GPT-4 from OpenAI, out of the equation. But what surprised everyone the most during its 6 December announcement was the fact that Gemini was the first AI model that outperformed human experts on Massive Multitask Language Understanding (MMLU).
That means, in this standardised method of testing an AI model's capabilities, Gemini is better at understanding, answering, and solving problems than humans who are considered the definitive experts in their respective fields.
But the initial shock of the 'Gemini Era' came from its monumental multimodal capabilities.
With a dataset of 540 billion words and code, 14 million images, and access to Google Search, Gemini, unlike other AI language models, can understand video and audio on top of text, pictures, and code.
More importantly, its multimodality is built from the ground up, which ensures the reasoning across textual, verbal, and nonverbal modes is seamless. That means, instead of typing a complex question that is harder to explain in words, a user would be able to simply show, explain, and illustrate the question through a video for go-to help, as shown in the demonstration video during the presentation.
From that interaction, Gemini could understand what is happening in a video, what is being said by the person on the video, or even nonverbal motions or cues like hand gestures to understand the context of that interaction.
Since it is trained to recognise, understand, and interpret text, video, and audio simultaneously, it can even identify motion and change its answer accordingly, making it particularly useful in applied mathematics, physics, engineering, statistics, and simulations.
Due to its ability to understand nuanced information, it can solve or offer answers to questions that were previously impossible to have machine-solved without giving it more context or adding metadata.
The first version of Gemini also knows, understands, and can generate code in programming languages like Python, C++, and Java. Since it uses AlphaCode 2, it has the ability to associate complex data to work simultaneously across different programming languages to generate high-quality code, making it the best AI model for coding. In fact, its generated code is better than 85% human programs, not to mention the fact that it can write a monolith of code in a few seconds, which would take a human hours or even days to finish.
To accommodate everyone and every environment, Google Gemini comes in three sizes.
Gemini Nano, the smallest of three, is Google's most efficient model for smaller on-device tasks.
Gemini Pro is the better and larger version, suitable for large-scale on-device executions and other tasks. A fine-tuned version of it has already been integrated into Bard for more advanced reasoning, understanding, and execution. Gemini Pro is also available to enterprise clients and developers via the Gemini API in Google Cloud Vertex AI or Google AI Studio.
Gemini Ultra is the largest and most capable of the three and can handle highly complex tasks that require advanced AI capabilities. Google plans to launch Bard Advanced, a new and richer AI experience, early next year with the capabilities of Gemini Ultra.
Gemini is also designed in a way that newer technologies like memory and planning can be easily integrated within the architecture of the model. This future-proofing and Google's plan to make parts of Gemini open-source for more collaborative innovation across the board makes it clear that Google wants Gemini to be an integral part of their decades in the future.