First complete open-source GUI Agent with model + infrastructure by StepFun. Plug-and-play engineering setup included — no cloud dependencies, full privacy control
We conducted comprehensive evaluations of the GELab-Zero-4B-preview model across multiple open-source benchmarks, covering various dimensions including GUI understanding, localization, and interaction. Below are the comparison results with other open-source models.
As AI experiences increasingly penetrate consumer-grade mobile terminals, mobile Agent research is at a critical juncture transitioning from "proof of concept" to "large-scale application." GUI-based approaches have emerged as the optimal solution at this stage, owing to their universal compatibility with all applications and zero-cost integration without requiring adaptation from app vendors, making them ideal for addressing the complex mobile ecosystem and enabling scalable Agent capabilities.
However, due to the highly fragmented nature of mobile application ecosystems, achieving functional GUI Agents across diverse brands and device models often encounters numerous engineering challenges: multi-device ADB connections, dependency installation, permission configuration, inference service deployment, and task orchestration with replay capabilities. This necessitates that Agent developers and MCP users invest significant effort in infrastructure work, diverting focus from strategic innovation.
To address these challenges, we open-source GELab-Zero to accelerate GUI Agent innovation and application deployment. It comprises two primary components:
It provides a one-click deployment experience similar to open-source GUI Agent MCPs, with complete local deployment and full control over the inference pipeline. Key capabilities include:
Supports 4B-scale models running on consumer-grade hardware, balancing low latency with privacy preservation
Provides unified deployment pipeline with automatic environment dependency and device management
Distributes tasks across multiple devices with interaction trajectory recording for observability and reproducibility
Encompasses ReAct closed-loop, multi-agent collaboration, and scheduled task execution modes
These capabilities enable GELab-Zero to flexibly handle complex task flows in real-world scenarios and provide a solid foundation for subsequent extensions. For Agent developers, this infrastructure enables rapid testing of new ideas and strategies for interaction validation. For enterprise users, it allows direct reuse of this infrastructure to quickly integrate MCP capabilities into product business operations.
Experience the power of GELab-Zero GUI Agent in action
Task: Help me find any good recent sci-fi movies
Agent autonomously interprets subjective criteria ("good") and navigates movie browsing application to identify relevant sci-fi content
Task: Help me find a place where I can take my kids on the weekend
Agent autonomously analyzes family-friendly activities and provides personalized recommendations
Task: Claim meal vouchers on the enterprise welfare platform
Agent executes multi-step complex tasks on enterprise welfare platform, accurately identifies screen information, finds the meal voucher redemption entry in the APP and completes the meal voucher application
Task: Check if Metro Line 1 is operating normally, then navigate to the nearest entrance of Line 1 metro station
Agent queries metro operation status to assess current conditions
Task: Go to the nearest Hema Fresh Store on Ele.me and purchase: Red strawberries 300g, Peruvian Bianca blueberries 125g (18mm diameter), seasonal fresh yellow potatoes 500g, sweet baby pumpkin 750g, Hema large grain shrimp sliders, 2 bottles of Hema pure black soy milk 300ml, Little Prince macadamia nut cocoa crisp 120g, Hema spinach noodles, Hema five-spice beef, 5 bags of Haohuan snail Liuzhou river snail rice noodles (extra spicy extra smelly) 400g, m&m's milk chocolate beans 100g
Successfully completed comprehensive shopping task with multiple specific items across categories
Task: Search for 'how to learn financial management' on Zhihu and view the first answer with over 10k likes
Agent autonomously navigates knowledge-sharing platforms and filters high-quality content based on specified criteria
Task: Find a pair of white canvas shoes in size 37 on Taobao, priced under 100 yuan, then add the first item that meets the criteria to favorites
Agent demonstrates complex filtering capabilities, identifying products matching multiple specific criteria and executing favoriting action
Task: Go to Baicizhan and help me complete the vocabulary learning task
Agent autonomously operates educational apps and completes interactive quizzes
Deploy and run GUI Agent inference locally with our lightweight infrastructure
Deploy our optimized 4B parameter model locally on your machine
Seamlessly connect to your mobile device for real-time GUI control
Powerful inference infrastructure for GUI understanding and action generation
# Clone the repository
git clone https://github.com/stepfun-ai/gelab-zero
cd gelab-zero
# Install dependencies
pip install -r requirements.txt
# To inference a single task
python examples/run_single_task.py
While mainstream benchmarks predominantly focus on productivity applications (e.g., email), users' daily high-frequency usage centers on life service applications (e.g., food delivery, ride-hailing, social media, payment). These scenarios better reflect the practical value of contemporary GUI Agents.
We present AndroidDaily: a multi-dimensional dynamic benchmark oriented toward real-world scenarios. We focus on empirical analysis across six core dimensions of modern life (Food, Transportation, Shopping, Housing, Information Consumption, Entertainment), prioritizing popular applications that dominate these categories. This ensures benchmark tasks feature real-world interaction outcomes (e.g., transaction payments, service bookings) with tight online-offline integration characteristics.
Total Actions
Screenshots
Type & Value
Contains 3146 actions. Provides task descriptions and step-by-step screenshots, requiring the Agent to predict action types and values (e.g., click coordinates, input text) at each step. Primarily evaluates numerical accuracy. This approach requires no complex engineering infrastructure, enabling rapid, cost-effective large-scale model iteration and testing.
Comparison of model accuracy on the AndroidDaily static benchmark. GELab-Zero-4B-preview demonstrates exceptional performance with 73.4% accuracy, significantly outperforming other state-of-the-art models.
+26.4% improvement over UI-TARS-1.5
3.7x better than GPT-4o
#1 on AndroidDaily Static Benchmark
Total Tasks
Full Environment
Conducted in fully functional test environments (e.g., real devices or emulators), the Agent must autonomously execute tasks from start to finish, with overall task success rate as the evaluation metric. This setup provides the highest ecological validity, authentically reflecting Agent comprehensive capabilities in complex environments.
78 tasks (33.19%)
Ride-hailing, navigation, public transportation, etc.
61 tasks (25.96%)
E-commerce shopping, payment, order management, etc.
43 tasks (18.3%)
Message sending, social interactions, etc.
37 tasks (15.74%)
News reading, video watching, content bookmarking, etc.
16 tasks (6.81%)
Food delivery, in-store services, etc.