Time»Step - My Thoughts on Google I/O

The start of a new Era

Yesterday (at the time of writing) Google held a keynote presentation where they unveiled the current trajectory of the company (and, in many ways, the future in general).

I’m a data scientist, engineering director, and AI researcher. This is my take.

The Release

I’ll be discussing this keynote presentation:

In my opinion, these are the main highlights:

  1. Improvements in the Gemini family of multimodal models

  2. Agentic Systems

  3. Integration of AI products

  4. Generative Capabilities

  5. Safety

Google Gemini

There were a lot of exciting announcements, let’s start with the technology powering them all. Gemini is Google's family of multimodal AI models.

A “modality” in machine learning can be thought of as a form of data. Speech, text, images, video; these are all considered different modalities, each with their unique challenges that means you could only train on one (or maybe two) modalities at a time. For that reason, many current AI applications are actually a family of AI models that worked together to answer your questions.

The idea of multimodal models is to break down those barriers, and allow an AI model to understand numerous types of data. Because of recent advancements in AI, multimodality has allowed for AI models to be trained on numerous different modalities all at the same time.

One of the coolest things about multimodal models is the transfer of learning between modalities. If you have a model train on text, then it probably sees expressions like this in training on textual data:

There was not one, not two, but three reasons why…

As a result of training on text, the model would naturally start to learn about what “three” is. Then, the AI model might see an image like this in it’s training set:

For a multimodal model, the concept of “three” can be understood not just textually, and not just visually, but in both modalities simultaneously. This can, in theory, make the models understanding of “three” more robust than if it was trained on any single modality. This is why data scientists consider high degrees of multimodality to be critical in achieving next generation AI models.

If you want to learn a bit more about multimodal models, you can check out my article on Flamingo, which is where multi-modality really started making big strides.

The capability of Gemini in multimodal problems is impressive. In the keynote they showed off a demo of gemini reading code from a video and interpreting it.

This is not a trivial problem. The comprehension of dense text from an image is a known and challenging problem for multimodal systems, and it seems like Gemini handled it deftly.

During this demo they asked gemini a variety of challenging questions, and Gemini aced them all. This one was particularly cool:

What’s crazy is, this was all done via video. During the demo the user asked if Gemini happen to see their glasses, and Gemini remembered that they were by the red apple.

This functionality isn’t only useful in personal assistant applications, new design paradigms will allow this type of functionality to be applied to a wide array of products. This is, to a large degree, made possible with agentic systems.

Agentic Systems

Agents are a design paradigm around AI. Basically, instead of talking to an AI model directly, you wrap it in some code that can entice that model to plan, choose to take action and use tools, and reflect on the models previous actions.

The power of an agent is in tool use. Depending on what tools you give an agent access to, it can do drastically different tasks. Imagine giving an agent the following tools:

  • Ability to read emails

  • Ability to parse receipts

  • Ability to create a Google Sheet

  • Ability to edit a Google Sheet

With a properly designed agent, it could perform a whole variety of tasks.

  • Keep track of your emails that you haven’t replied to

  • Look through your invoices and keep track of expenses

  • And so many more applications

This limitless ability is why Google is leaning hard on agents. It seems like Google wants to integrate AI agents into basically all of their product offerings.

In one example they showed how an agent can look at a video, understand not only the question, but plan out a list of searches and then “do the googling for you” to get you an answer.

In another demo they showed how an agent could plan a weekly meal plan based on your needs, and change those recipes based on your recommendations.

In another demo they showed how agents can help you organize your emails by bridging your gmail with Google Sheets.

In another demo they showed how agents can be integrated into Google Teams to act as a virtual team mate that can answer questions and help finish tasks.

They even showed an example how an agent could return a pair of shoes for you. There were a lot of demos of agents doing cool stuff.

Google will be rolling out agents in every single one of their products over the coming months. In fact, if you have an Android phone, you can use Gemini now.

Between high degrees of multimodality, and agentic systems, one thing is clear.

Google Wants To Integrate AI into Everything

And I see why.

By building an agentic system that can connect people with high quality and valuable information, and automate difficult tasks to make people more productive, the potential value of these systems is impossible to even estimate. We’re, likely, looking at a fundamental shift in the way people use technology.

This is a new era. One of the things powering this new era is generative AI.

More Than Just Pictures

Generative systems are becoming faster and more performant, and they don’t just render cool pictures any more. Audio generation is becoming ridiculously powerful, and is opening up new doors in terms of how people interact with AI.

In one demo a parent asked Gemini to teach his son physics. The response from Gemini was unlike anything I’ve ever seen. To make it more engaging, the model created two voices and had them talk with each other podcast style about physics to make the lesson more engaging. His son could jump in, any time, and ask questions.

As Google leans more heavily on AI as a platform, new products will be able to leverage these amazing capabilities, creating a new era of technology as we know it.

Safety

Everyone likes to skip safety, which is stupid. If you’re at the helm of AI researcher, you’re a bad person if you don’t consider the harm these massive technologies can cause.

It seems like Google is doing a variety of things to try to encourage Safety:

  • Advanced Red Teaming: Trying to get these models to produce harmful content so they can design more robust systems to prevent it

  • AI watermarking: Making AI generated content identifiable, so people can better discern reality from AI generated images, audio, and video

They also showed how powerful AI can be in helping people:

  • Alpha Fold is allowing more robust gene research to push healthcare forward

  • Advanced AI flood forecasting is helping protect people from flooding around the globe

  • AI can help educate, encouraging an era of more discerning minds

basically, Google is saying “We’re trying our best to keep this safe, and we think it will be useful”. To a large extent, I agree with that, and I beleive Google.

As we transition to a new era of AI, new challenges will arise. The internet gave way to mass communication, as well as unprecedented rates of mental illness through social media. Big technological breakthroughs cut both ways, and it’s up to all of us to be diligent, discerning, skeptical, and optimistic about the future.