ByteDance: UI-TARS 7B

OpenAI • text • vision

Provider IDbytedance/ui-tars-1.5-7b

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

Quick Summary

Best For:

High-volume, low-latency tasks where cost efficiency is paramount

Pricing:

$0.00/1M input tokens, $0.00/1M output tokens

Context Window:

128,000 tokens

Key Differentiator:

Cost-optimized for high-volume usage

Specifications

Context Window

128,000 tokens

Max Output Tokens

2,048 tokens

Streaming

Yes

JSON Mode

Vision

Yes

Tier

Affordable

Capabilities

text

vision

ByteDance: UI-TARS 7B

Best For:

Pricing:

Context Window:

Key Differentiator:

Social

Legal