Gemini task automation is slow, clunky, and super impressive

Foto: The Verge AI
Nine minutes – that is exactly how long it took artificial intelligence to order dinner, which in the world of mobile technology seems like an eternity, yet represents a breakthrough moment in the development of personal assistants. Google Gemini, tested on Pixel 10 Pro and Galaxy S26 Ultra models, has just gained a task automation feature that allows it to take control of the screen and independently operate third-party applications. Although the solution is currently in beta and limited to selected rideshare and food delivery services, for the first time, users can see AI operating autonomously in real-world conditions rather than just in controlled demos. The system runs in the background, allowing the user to put the phone down while Gemini analyzes the menu, adds items to the cart (understanding, for example, that two half-portions make up one dish), or schedules a trip to the airport based on calendar data. While the process can be slow and occasionally clumsy – the assistant may overlook an item on a list several times – its effectiveness in finalizing orders is surprisingly high. For security reasons, the AI stops just before the payment button, requiring final human confirmation. For the global user, this signifies a paradigm shift: the phone ceases to be merely a tool we operate and becomes an autonomous agent performing tedious tasks for us. This marks the end of the era of simple voice commands and the beginning of the actual delegation of digital duties.
The vision of a digital assistant that relieves us of the tedious task of clicking through apps has been present in the promotional materials of tech giants for years, but reality has rarely lived up to the promises. What Siri or Google Assistant offered for a decade was more a set of simple voice scripts than autonomous action. However, the premiere of the new task automation in Gemini, tested on the flagship Pixel 10 Pro and Samsung Galaxy S26 Ultra, marks a turning point. Although the system is currently in beta and can be frustratingly slow, for the first time we are dealing with technology that actually takes the reins of the smartphone interface.
When AI takes control of the screen
The new Gemini feature is not just about generating text or summarizing emails. It is an attempt to create an AI Agent — software that understands the structure of mobile applications designed for humans and can navigate through them. In practice, this looks like the user issuing one general command, and Gemini begins "clicking" on our behalf. Currently, the system supports a limited number of services, focusing mainly on food delivery and transportation, such as Uber or Uber Eats.
- Autonomous navigation: The AI can independently scroll through menus, add products to the cart, and select delivery options.
- On-the-fly reasoning: The system demonstrates surprising logic — for example, when a menu only offers a "half portion," Gemini can add two items to fulfill an order for a full meal.
- Background work: Automation does not require constant user attention; the process can take place while we are doing something else, which is a key advantage over manual data entry.
Despite these advantages, the process is far from instantaneous. Ordering dinner, which takes a human two minutes, can take Gemini up to nine minutes. The system "thinks" about every step, analyzes the screen content, and sometimes gets lost in a maze of buttons, which resembles watching a novice smartphone user struggling to find the right icons.
Read also
The human interface barrier
The biggest challenge for Gemini is not a lack of computing power, but the fact that today's applications are optimized for the human eye and finger, not for AI algorithms. Pop-up ads, complex graphical layouts, or ambiguous dish naming (e.g., "set" instead of "plate") are traps that Gemini falls into regularly. Watching an AI model try to locate an appetizer that is right in the middle of the screen can be a painful experience for the observer.
This is a fundamental paradox: we are forcing the world's most advanced language models to interpret interfaces that are completely unnatural to them. AI doesn't need buttons, high-resolution photos, or promotional banners — it needs clean data.
Google's current approach, based on pure visual reasoning (reasoning approach), is treated as a temporary solution. The industry is moving toward standards such as the Model Context Protocol (MCP) or Android App Functions. These are intended to allow applications to share their functions directly with AI models, bypassing the visual layer. Until this happens, Gemini will be condemned to laboriously "clicking" through pixels, which will always generate delays and errors.
Context that changes the rules of the game
However, the true power of Gemini reveals itself when artificial intelligence connects the dots between different Google services. In scenario tests, the AI showed impressive initiative in travel planning. With only general flight information saved in the calendar, Gemini was able to independently check the departure time from an email, calculate the optimal travel time to the airport taking into account the user's location, and propose booking an Uber for a specific time.
This is precisely where the difference between old assistants and the new generation lies. Traditional systems required precise commands ("Book an Uber for 11:30"). Gemini understands the intent ("Get me to my flight tomorrow on time") and performs the analytical work itself. The fact that the system distinguishes between colloquial terms and official names in app menus makes the barrier between natural language and code almost invisible.
Halfway to AI agents
Google applies a safety switch in this case: automation stops just before the final payment button. The user must ultimately approve the transaction, which is the only sensible solution in the beta phase. Although the system rarely "goes rogue" and usually configures orders correctly, it does make mistakes resulting from a lack of access to location data or app permissions, requiring manual intervention in the first minutes of a task.
Despite its sluggishness, the new feature in the Pixel 10 Pro and Galaxy S26 Ultra is more than just a technological curiosity. It is proof that the operating system of the future will not be based on app icons that we have to open, but on a layer of an intelligent intermediary. Gemini's current slowness is the price for learning to navigate a world designed for humans.
One could venture to say that we are on the threshold of an era where the smartphone ceases to be a tool we operate and becomes a coordinator of our needs. Today's nine-minute wait for an AI to order a pizza is just a transitional stage. The moment developers start adapting their apps to MCP standards, these same operations will take seconds, and the user's role will be limited only to stating a wish and authorizing the payment with biometrics.
More from AI
Related Articles

New court filing reveals Pentagon told Anthropic the two sides were nearly aligned — a week after Trump declared the relationship kaput
23h
Writer denies it, but publisher pulls horror novel after multiple allegations of AI use
Mar 20
Microsoft rolls back some of its Copilot AI bloat on Windows
Mar 20





