On October 22, local time, Anthropic announced a significant upgrade for Claude 3.5, introducing the Claude 3.5 Haiku and Claude 3.5 Sonnet versions. The Claude 3.5 Sonnet, in particular, boasts enhanced programming capabilities and a new feature called "computer use," enabling it to operate a computer similarly to a human.
With the computer use function, Claude 3.5 Sonnet can now follow user commands to move the cursor on the screen, click on relevant areas, and input information via a virtual keyboard, mimicking how people interact with their computers. In OSWorld assessments, it achieved a score of 14.9%. While this falls short of the human benchmark of 70-75%, it significantly outperforms other AI models, which hover around 7.7%.
The head of developer relations at Anthropic stated that the addition of computer use marks a new paradigm in human-computer interaction and represents a foundational capability that AI models should possess. The Claude 3.5 Sonnet is now available for use, along with the beta version of the computer use feature.
Claude 3.5 Sonnet excels in various areas, particularly in tasks related to agent coding and tool usage, where it has made notable advancements. In the SWE-bench Verified test, its performance surged from 33.4% to 49.0%, surpassing all publicly available models, including OpenAI's o1-preview and systems designed specifically for agent coding.
However, Claude's operations still need improvement, as its response time can be slow, and it is prone to errors, particularly with common tasks such as dragging and zooming. Furthermore, its method of observing the screen resembles continuous screenshot stitching, which risks missing brief actions or notifications. During demo recordings, it has been known to accidentally stop screen recording or browse unrelated photos.
Despite these issues, Claude's current capabilities create anticipation for the future. The ability for AI to operate computers signifies a groundbreaking approach to artificial intelligence development, promising to streamline tasks like software development in the future.