francedot/acu: A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools.

Klenance
6 Min Read

  • Reinforcement Learning for Long-Horizon Interactive LLM Agents (Feb. 2025)

    • Novel RL approach (LOOP) for training IDAs directly in target environments
    • 32B parameter agent outperforms OpenAI o1 by 9 percentage points on AppWorld
  • Large Action Models: From Inception to Implementation (Dec. 2024)

    • Comprehensive framework for developing LAMs that can perform real-world actions beyond language generation
    • Details key stages including data collection, model training, environment integration, grounding and evaluation
  • Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (Dec. 2024)

    • Novel reward-guided navigation approach
  • SpiritSight Agent: Advanced GUI Agent with One Look (Dec. 2024)

    • Single-shot GUI interaction approach
  • AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs (Dec. 2024)

    • Novel approach for automatic GUI functionality annotation
  • Simulate Before Act: Model-Based Planning for Web Agents (Dec. 2024)

    • Novel model-based planning approach using LLM world models
  • Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Dec. 2024)

    • Novel autonomous skill discovery framework for web agents
    • Code
  • Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents (Dec. 2024)

    • Novel framework for contextualizing web pages to enhance LLM agent decision making
  • Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL (Dec. 2024)

    • Novel value-based offline RL approach for training VLM device-control agents
  • Magentic-One (Nov. 2024)

    • Multi-agent system with orchestrator-led coordination
    • Strong performance on GAIA, WebArena, and AssistantBench
  • Agent Workflow Memory (Sep. 2024)

    • Novel workflow memory framework for agents
    • Code
  • The Impact of Element Ordering on LM Agent Performance (Sep. 2024)

    • Novel study on element ordering’s impact on agent performance
    • Code
  • Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)

    • Novel reasoning and learning framework
    • Website
  • OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models (Aug. 2024)

    • Open platform for web-based agent deployment
    • Code
  • Agent-e: From autonomous web navigation to foundational design principles in agentic systems (Jul. 2024)

    • Hierarchical architecture with flexible DOM distillation
    • Novel denoising method for web navigation
  • Apple Intelligence Foundation Language Models (Jul. 2024)

    • Vision-Language Model with Private Cloud Compute
    • Novel foundation model architecture
  • Tree search for language model agents (Jul. 2024)

    • Multi-step reasoning and planning with best-first tree search
    • Novel approach for LLM-based agents
  • DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)

    • Novel reinforcement learning approach
    • Code
  • Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)

    • Multi-agent collaboration for mobile device operation
    • Code
  • Octopus Series: On-device Language Models for Computer Control (Apr. 2024)

    • v4: Graph of language models with functional tokens integration (Apr. 2024)
    • v3: Sub-billion parameter multimodal model for edge devices (Apr. 2024)
    • v2: Super agent for Android and iOS (Apr. 2024)
    • v1: Function calling of software APIs (Apr. 2024)
    • Website
    • Code
  • AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent (Apr. 2024)

    • Novel approach for real-world web navigation and bilingual benchmark
    • Code
  • Cradle: Empowering Foundation Agents towards General Computer Control (Mar. 2024)

    • Focus on general computer control using Red Dead Redemption II as a case study
    • Code
  • Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)

    • Novel Chain-of-Action-Thought framework for Android interaction
    • Code
  • ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)

    • Vision-language model for computer control
    • Code
  • OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)

    • Vision-Language Model for PC interaction
    • Code
  • UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)

    • Specialized for Windows OS interaction
    • Code
  • CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation (Feb. 2024)

    • Novel comprehensive environment perception (CEP) approach for exhaustive GUI perception
    • Introduces conditional action prediction (CAP) for reliable action response
  • Intention-inInteraction (IN3): Tell Me More! (Feb. 2024)

    • Novel benchmark for evaluating user intention understanding in agent designs
    • Introduces model experts for robust user-agent interaction
  • Dual-view visual contextualization for web navigation (Feb. 2024)

    • Novel approach for automatic web navigation with language instructions
    • Key: HTML elements, visual contextualization
  • ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)

    • Specialized for mobile UI and infographics understanding
    • Novel approach for visual interface comprehension
  • GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024)

    • Demonstrates GPT-4V capabilities for web interaction
    • Code
  • Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)

    • Visual perception for mobile device interaction
    • Code
  • WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (Jan. 2024)

    • End-to-end approach for web interaction
    • Code
  • CogAgent: A Visual Language Model for GUI Agents (Dec. 2023)

    • Works across PC and Android platforms
    • Code
  • AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)

    • Focused on smartphone interaction
    • Code
  • LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)

    • Novel approach to web navigation
    • Code
  • AndroidEnv: A Reinforcement Learning Platform for Android (May 2021)

    • Reinforcement learning platform for Android interaction
    • Code
  • Source link

    Share This Article
    Leave a comment

    Leave a Reply

    Your email address will not be published. Required fields are marked *