Nous Research, a private applied research group recognized for its contributions to the large language model (LLM) field, has introduced a new vision-language model called Nous Hermes 2 Vision, available on Hugging Face.
This open-source model builds on the earlier OpenHermes-2.5-Mistral-7B and extends its capabilities by allowing users to input images and extract text information from visual content. However, shortly after its launch, users reported excessive hallucination issues, prompting the company to rebrand the project as Hermes 2 Vision Alpha. A more stable version with fewer glitches is expected soon.
Nous Hermes 2 Vision Alpha
Named after the Greek messenger of the gods, Hermes, this vision model is crafted to navigate the complexities of human discourse with remarkable precision. It integrates the visual data provided by users with its learned knowledge, enabling it to deliver detailed, natural language responses. For example, the co-founder of Nous, known as Teknium on X, shared a screenshot demonstrating the model's ability to analyze an image of a burger, assessing its health implications.
Distinct Features of Nous Hermes 2 Vision
While ChatGPT, grounded in GPT-4V, also supports image prompting, Nous Hermes 2 Vision sets itself apart with two primary enhancements:
1. Lightweight Architecture: Instead of relying on traditional 3B vision encoders, Nous Hermes 2 Vision employs SigLIP-400M. This not only simplifies the model's architecture, making it lighter, but also enhances performance on vision-language tasks.
2. Function Calling Capability: The model has been trained on a custom dataset featuring function calling. Users can use a The model was also trained on additional datasets, including LVIS-INSTRUCT4V, ShareGPT4V, and dialogues from OpenHermes-2.5.
Challenges Ahead
While Nous Hermes 2 Vision is available for research and development, early feedback indicates that it still has significant issues. Following its release, co-founder Quan Nguyen acknowledged problems related to hallucinations and the model's tendency to generate excessive EOS tokens, leading to its alpha designation.
“I see people talking about ‘hallucinations,’ and yes, the situation is concerning. I was aware of this since the underlying LLM is uncensored. I plan to release an updated version by the end of the month to address these issues,” Nguyen wrote on X.
In response to inquiries about the model's problems, further questions remained unanswered at the time of this writing. However, Nguyen mentioned that the function calling feature performs well when users provide a clear schema and indicated that he might develop a dedicated function-calling model based on user feedback.
To date, Nous Research has released 41 open-source models within its Hermes, YaRN, Capybara, Puffin, and Obsidian series, showcasing a variety of architectures and capabilities.