New Tsinghua Study: Fine-Tuning Diffusion Models with Human Feedback Without Reward Models

Overview of Recent Advances in Large Language Models and Their Applications

1. Large Language Models as Effective Teachers for Reinforcement Learning Agents

Recent research shows that large language models (LLMs) greatly aid in solving complex sequence decision tasks by providing detailed instructions. However, using these LLM-based agents in dynamic real-world settings presents challenges, including their limited problem-solving abilities and the high costs associated with implementation.

To overcome these challenges, a novel framework has been proposed that utilizes LLM-based teacher agents to train smaller, specialized student agents. This method allows for the transfer of knowledge from the teacher LLMs to local student models, leading to efficient training with less data. Notably, subsequent training with environmental feedback resulted in student agents outperforming their teacher counterparts. Experiments in intricate MiniGrid environments confirmed significant improvements in sample efficiency and performance, surpassing established benchmarks.

Paper: Large Language Model is a Good Policy Teacher for Training Reinforcement Learning Agents

2. Survey of Multimodal Large Language Models

While recent LLMs excel at text-based tasks, they often struggle with other data types such as images and audio. Multimodal models address these limitations by integrating diverse data types, leading to a more comprehensive understanding.

This survey clarifies the concept of multimodality and traces the evolution of multimodal algorithms. It reviews various multimodal applications developed by leading tech companies, providing a practical guide on technical aspects. The findings summarize current algorithms and common datasets, facilitating research and evaluation. The study also discusses applications of multimodal models and the challenges they face, aiming to enhance understanding of their potential across multiple domains.

Paper: Multimodal Large Language Models: A Survey

3. Visual In-Context Prompting

Contextual prompting in LLMs is recognized for enhancing zero-shot capabilities, yet its application in visual tasks is limited. Current visual prompting techniques focus mainly on object identification through segmentation, which is insufficient for prevalent tasks like open segmentation and object detection.

This study introduces a versatile visual context prompting framework that supports both segmentation and detection tasks. Researchers developed a multifunctional prompting encoder based on an encoder-decoder architecture, accommodating diverse prompt types including strokes, boxes, and points. The framework allows multiple reference image segments as context. Training on the COCO and SA-1B datasets produced impressive results, achieving a 57.7 PQ on the COCO dataset and a 23.2 PQ on the ADE20K dataset.

Paper: Visual In-Context Prompting

4. Soulstyler: Guiding Image Style Transfer via Large Language Models

Image style transfer is vital in computer graphics and vision, yet existing methods typically require reference stylized images, complicating the independent styling of specific objects.

To address this, the "Soulstyler" framework was introduced, allowing users to guide object stylization in images using simple text descriptions. This framework leverages an LLM to interpret text, identifying targets and styles. By integrating a CLIP-based semantic visual embedding encoder, the model effectively aligns textual and visual content. Additionally, a novel localization loss ensures that style transfer applies only to chosen objects, maintaining the background's original style. Experimental results confirm the model's capability to style target objects based on textual input without altering backgrounds.

Paper: Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object

5. Innovative Research: Fine-Tuning Diffusion Models with Human Feedback Without Reward Models

Training a reward model can be resource-intensive due to the need for large datasets, optimal architectures, and manual hyperparameter tuning. The Direct Preference Optimization (DPO) method has shown effectiveness in fine-tuning LLMs without requiring a reward model. However, diffusion models often need substantial GPU memory for their denoising processes, complicating the application of DPO.

This research introduces Direct Preference for Denoising Diffusion Policy Optimization (D3PO), enabling direct fine-tuning of diffusion models without a reward model. Theoretical analysis indicates that while D3PO bypasses reward model training, it effectively employs the best-performing reward models trained on human feedback data. This method is resource-efficient, needing minimal computational power. Experiments demonstrate that using target relative sizes as proxies for human preference yields results comparable to true rewards, improving image quality and safety.

Paper: Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

6. AlignCoT: Enhancing LLM Performance through "Native" Style Alignment

Prompt engineering significantly influences LLM performance. Chain-of-thought (CoT), a widely used prompting technique, relies on manually created examples. However, the impact of the textual style of these examples on LLM outputs has not been thoroughly studied.

This research proposes AlignCoT, a method enhancing LLM reasoning by aligning context examples with the LLM's native style—its inherent stylistic features identifiable through original zero-shot scenarios. AlignCoT integrates seamlessly with other prompting techniques for further performance improvements. Comprehensive experiments across various benchmarks demonstrated that AlignCoT substantially improves performance compared to human-created context examples. For instance, it improved GPT-3.5-turbo's performance on GSM8K by +2.5%. Furthermore, when combined with advanced prompting techniques, AlignCoT consistently achieved superior results.

Paper: Speak Like a Native: Prompting Large Language Models in a Native Style

7. Diffusion-DPO: Aligning Diffusion Models with Human Preferences

LLMs are fine-tuned through Reinforcement Learning from Human Feedback (RLHF), leveraging human comparisons to align closely with user preferences. This research introduces Diffusion-DPO, a method that directly optimizes diffusion models based on human preference data without requiring reward models.

Adapted from the recently developed direct preference optimization (DPO), Diffusion-DPO serves as a simpler alternative to RLHF, refining policies to satisfy human preferences in classification tasks. By reframing DPO considering the diffusion model’s probabilistic concepts, the study establishes a differentiable objective. Using a dataset of 851K crowdsourced preference pairs from Pick-a-Pic, Diffusion-DPO fine-tunes the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model. Evaluations indicated that the fine-tuned model outperformed both the original SDXL-1.0 and larger models with additional refinements, enhancing visual appeal and contextual alignment.

Paper: Diffusion Model Alignment Using Direct Preference Optimization

8. LEO: An Embodied Generalist Agent in a 3D World

Despite advancements in machine learning models aimed at creating generalist agents, their limited understanding and interaction with 3D environments hinder real-world task execution and further advancements toward general intelligence.

This research presents LEO, a multimodal, multitask generalist agent designed for perception, localization, reasoning, planning, and action in 3D environments. LEO utilizes a shared architecture based on large language models (LLMs), integrating goals across two training stages: 3D visual-language alignment and 3D visual-language-action instruction fine-tuning. The dataset used features object-level and scene-level multimodal tasks, requiring deep understanding and interaction within 3D settings. Rigorous experiments showcased LEO's exceptional capabilities across various tasks, including 3D captioning, question answering, embodied reasoning, navigation, and robotic manipulation.

Paper: An Embodied Generalist Agent in 3D World

Most people like

Find AI tools in YBX