GELab-Zero - GUI Agent for Mobile Devices

Open Benchmark

We conducted comprehensive evaluations of the GELab-Zero-4B-preview model across multiple open-source benchmarks, covering various dimensions including GUI understanding, localization, and interaction. Below are the comparison results with other open-source models.

Android World

62.10

Seed-VL-1.5

64.20

UI-TARS-1.5

66.40

GUI-Owl-7B

69.70

Gemini-2.5

73.30

Mobile-Agent-v3

                                        75.86
                                    
GELab-Zero-4B-preview

ScreenSpot-V2

90.50

MiMo-VL-7B-RL

91.60

UI-TARS-7B

92.40

GTA1-7B

92.70

ScaleCUA-7B

92.80

GUI-Owl-7B

                                        93.00
                                    
GELab-Zero-4B-preview

OSWorld-G

54.10

Jedi-7B

55.70

ScaleCUA-3B

55.90

GUI-Owl-7B

57.10

UI-TARS-72B

58.00

GUI-Owl-32B

                                        68.04
                                    
GELab-Zero-4B-preview

ScreenSpot-Pro

47.90

ScaleCUA-7B

49.60

UI-TARS-1.5-7B

52.90

GroundNext-7B

54.90

GUI-Owl-7B

58.00

GUI-Owl-32B

                                        60.53
                                    
GELab-Zero-4B-preview

MMBench-GUI-L2

73.70

ScaleCUA-3B

74.25

UI-TARS-72B

80.40

GroundNext-7B

82.97

GUI-Owl-32B

84.45

SeedVL-1.5

                                        85.81
                                    
GELab-Zero-4B-preview

Addressing GUI Agent Infrastructure Challenges

As AI experiences increasingly penetrate consumer-grade mobile terminals, mobile Agent research is at a critical juncture transitioning from "proof of concept" to "large-scale application." GUI-based approaches have emerged as the optimal solution at this stage, owing to their universal compatibility with all applications and zero-cost integration without requiring adaptation from app vendors, making them ideal for addressing the complex mobile ecosystem and enabling scalable Agent capabilities.

However, due to the highly fragmented nature of mobile application ecosystems, achieving functional GUI Agents across diverse brands and device models often encounters numerous engineering challenges: multi-device ADB connections, dependency installation, permission configuration, inference service deployment, and task orchestration with replay capabilities. This necessitates that Agent developers and MCP users invest significant effort in infrastructure work, diverting focus from strategic innovation.

To address these challenges, we open-source GELab-Zero to accelerate GUI Agent innovation and application deployment. It comprises two primary components:

Plug-and-Play Complete Inference Infrastructure - handling all the heavy lifting
Pre-trained Local GUI Agent Model - ready for immediate deployment

It provides a one-click deployment experience similar to open-source GUI Agent MCPs, with complete local deployment and full control over the inference pipeline. Key capabilities include:

Lightweight Local Inference

Supports 4B-scale models running on consumer-grade hardware, balancing low latency with privacy preservation

One-Click Multi-Terminal Deployment

Provides unified deployment pipeline with automatic environment dependency and device management

Distributed Task Orchestration

Distributes tasks across multiple devices with interaction trajectory recording for observability and reproducibility

Multi-Modal Agent Paradigm

Encompasses ReAct closed-loop, multi-agent collaboration, and scheduled task execution modes

These capabilities enable GELab-Zero to flexibly handle complex task flows in real-world scenarios and provide a solid foundation for subsequent extensions. For Agent developers, this infrastructure enables rapid testing of new ideas and strategies for interaction validation. For enterprise users, it allows direct reuse of this infrastructure to quickly integrate MCP capabilities into product business operations.

Showcase

Experience the power of GELab-Zero GUI Agent in action

Recommendation - Sci-Fi Movies

Task: Help me find any good recent sci-fi movies

Agent autonomously interprets subjective criteria ("good") and navigates movie browsing application to identify relevant sci-fi content

Recommendation - Travel Destination

Task: Help me find a place where I can take my kids on the weekend

Agent autonomously analyzes family-friendly activities and provides personalized recommendations

Practical Task - Claim Subsidy

Task: Claim meal vouchers on the enterprise welfare platform

Agent executes multi-step complex tasks on enterprise welfare platform, accurately identifies screen information, finds the meal voucher redemption entry in the APP and completes the meal voucher application

Practical Task - Metro Line Query

Task: Check if Metro Line 1 is operating normally, then navigate to the nearest entrance of Line 1 metro station

Agent queries metro operation status to assess current conditions

Complex Task - Multi-Item Shopping

Task: Go to the nearest Hema Fresh Store on Ele.me and purchase: Red strawberries 300g, Peruvian Bianca blueberries 125g (18mm diameter), seasonal fresh yellow potatoes 500g, sweet baby pumpkin 750g, Hema large grain shrimp sliders, 2 bottles of Hema pure black soy milk 300ml, Little Prince macadamia nut cocoa crisp 120g, Hema spinach noodles, Hema five-spice beef, 5 bags of Haohuan snail Liuzhou river snail rice noodles (extra spicy extra smelly) 400g, m&m's milk chocolate beans 100g

Successfully completed comprehensive shopping task with multiple specific items across categories

Complex Task - Information Retrieval

Task: Search for 'how to learn financial management' on Zhihu and view the first answer with over 10k likes

Agent autonomously navigates knowledge-sharing platforms and filters high-quality content based on specified criteria

Complex Task - Conditional Search

Task: Find a pair of white canvas shoes in size 37 on Taobao, priced under 100 yuan, then add the first item that meets the criteria to favorites

Agent demonstrates complex filtering capabilities, identifying products matching multiple specific criteria and executing favoriting action

Complex Task - Online Quiz

Task: Go to Baicizhan and help me complete the vocabulary learning task

Agent autonomously operates educational apps and completes interactive quizzes

Code & Infrastructure

Deploy and run GUI Agent inference locally with our lightweight infrastructure

4B Model Deployment

Deploy our optimized 4B parameter model locally on your machine

Low resource consumption
Fast inference speed
Easy setup

Mobile Connection

Seamlessly connect to your mobile device for real-time GUI control

Cross-platform support
Low latency
Secure connection

Inference Engine

Powerful inference infrastructure for GUI understanding and action generation

Real-time processing
Multi-modal understanding
Adaptive reasoning

Resources

GitHub Repository

Access the complete source code and documentation

HuggingFace Models

Download pre-trained models and datasets

Quick Start

# Clone the repository
git clone https://github.com/stepfun-ai/gelab-zero
cd gelab-zero

# Install dependencies
pip install -r requirements.txt

# To inference a single task
python examples/run_single_task.py

AndroidDaily: A Real-World Daily Life Benchmark

While mainstream benchmarks predominantly focus on productivity applications (e.g., email), users' daily high-frequency usage centers on life service applications (e.g., food delivery, ride-hailing, social media, payment). These scenarios better reflect the practical value of contemporary GUI Agents.

We present AndroidDaily: a multi-dimensional dynamic benchmark oriented toward real-world scenarios. We focus on empirical analysis across six core dimensions of modern life (Food, Transportation, Shopping, Housing, Information Consumption, Entertainment), prioritizing popular applications that dominate these categories. This ensures benchmark tasks feature real-world interaction outcomes (e.g., transaction payments, service bookings) with tight online-offline integration characteristics.

3146

Total Actions

Step-by-Step

Screenshots

Action Prediction

Type & Value

Static Testing Methodology

Contains 3146 actions. Provides task descriptions and step-by-step screenshots, requiring the Agent to predict action types and values (e.g., click coordinates, input text) at each step. Primarily evaluates numerical accuracy. This approach requires no complex engineering infrastructure, enabling rapid, cost-effective large-scale model iteration and testing.

Action Type Distribution in Static Testing (Total 3146 Actions)

CLICK 1354 (43.0%)

AWAKE 528 (16.8%)

COMPLETE 410 (13.0%)

TYPE 371 (11.8%)

INFO 305 (9.7%)

SLIDE 93 (3.0%)

WAIT 85 (2.7%)

AndroidDaily Static Benchmark Results

Comparison of model accuracy on the AndroidDaily static benchmark. GELab-Zero-4B-preview demonstrates exceptional performance with 73.4% accuracy, significantly outperforming other state-of-the-art models.

GELab-Zero-4B-preview 73.4%

0.734

UI-TARS-1.5 47.0%

0.470

Gemini-2.5-pro-thinking 36.6%

0.366

GPT-4o 19.6%

0.196

+26.4% improvement over UI-TARS-1.5

3.7x better than GPT-4o

#1 on AndroidDaily Static Benchmark

235

Total Tasks

Real Devices

Full Environment

End-to-End Benchmark Methodology

Conducted in fully functional test environments (e.g., real devices or emulators), the Agent must autonomously execute tasks from start to finish, with overall task success rate as the evaluation metric. This setup provides the highest ecological validity, authentically reflecting Agent comprehensive capabilities in complex environments.

Scenario Distribution in End-to-End Benchmark

Transportation

78 tasks (33.19%)

Ride-hailing, navigation, public transportation, etc.

Shopping & Payment

61 tasks (25.96%)

E-commerce shopping, payment, order management, etc.

Social Communication

43 tasks (18.3%)

Message sending, social interactions, etc.

Content Consumption

37 tasks (15.74%)

News reading, video watching, content bookmarking, etc.

Local Services

16 tasks (6.81%)

Food delivery, in-store services, etc.

Local-Deployable GELab-Zero-4B Masters Android Apps

Open Benchmark

Android World

ScreenSpot-V2

OSWorld-G

ScreenSpot-Pro

MMBench-GUI-L2

Addressing GUI Agent Infrastructure Challenges

Lightweight Local Inference

One-Click Multi-Terminal Deployment

Distributed Task Orchestration

Multi-Modal Agent Paradigm

Showcase

Recommendation - Sci-Fi Movies

Recommendation - Travel Destination

Practical Task - Claim Subsidy

Practical Task - Metro Line Query

Complex Task - Multi-Item Shopping

Complex Task - Information Retrieval

Complex Task - Conditional Search

Complex Task - Online Quiz

Code & Infrastructure

4B Model Deployment

Mobile Connection

Inference Engine

Resources

GitHub Repository

HuggingFace Models

Quick Start

AndroidDaily: A Real-World Daily Life Benchmark

3146

Step-by-Step

Action Prediction

Static Testing Methodology

Action Type Distribution in Static Testing (Total 3146 Actions)

AndroidDaily Static Benchmark Results

235

Real Devices

End-to-End Benchmark Methodology

Scenario Distribution in End-to-End Benchmark

Transportation

Shopping & Payment

Social Communication

Content Consumption

Local Services