Why LLMs Can't Surpass a 1970s Technique Yet Remain Valuable: Here's What You Need to Know

This year, the MIT Data to AI Lab explored the use of large language models (LLMs) for anomaly detection in time series data, a task traditionally handled by different machine learning (ML) tools. Anomaly detection is crucial in industries for monitoring heavy machinery and identifying potential issues before they escalate. We designed a framework utilizing LLMs and compared their performance against ten other methods, including state-of-the-art deep learning techniques and the classic autoregressive integrated moving average (ARIMA) model from the 1970s. Surprisingly, LLMs underperformed against most models, even the ARIMA model, which outclassed them in seven out of eleven datasets.

For those who envision LLMs as all-encompassing solutions, these results may seem discouraging—though they reflect the existing limitations of AI. However, two key findings were notably unexpected. First, LLMs managed to outperform some models, including certain transformer-based deep learning methods, which took us by surprise. More importantly, LLMs demonstrated impressive capabilities by performing anomaly detection with zero-shot learning—meaning they operated without prior examples or any fine-tuning. Using GPT-3.5 and Mistral LLMs in their standard forms, we showed that LLMs can efficiently detect anomalies without the need to develop specialized models for each signal, significantly streamlining the process.

Current anomaly detection methods involve training and deploying ML models in a two-step process that can be complex and cumbersome. Operators often lack the experience with ML, leading to questions about retraining frequency, data input, and signal management. These barriers frequently hinder the deployment of trained models. LLMs, by contrast, allow operators to control anomaly detection through simple API queries, enabling them to easily add or remove signals and toggle the service without reliance on other teams. This autonomy may facilitate broader adoption of LLMs in industrial settings.

While LLMs have sparked a reevaluation of anomaly detection, they still lag behind state-of-the-art deep learning models and even the ARIMA model in performance. This discrepancy could stem from our decision not to fine-tune the LLMs or create a foundational model explicitly designed for time series applications. To improve anomaly detection accuracy, we must tread carefully to maintain the advantages inherent to LLMs.

This means we should avoid:

1. Fine-tuning existing LLMs for specific signals, as this would compromise their zero-shot capabilities.

2. Developing a foundational LLM for time series with a fine-tuning layer for each new type of machinery, as this would revert us to the complexities of training models for every signal.

For LLMs to effectively compete in anomaly detection or other ML tasks, they must foster innovative approaches or unlock new possibilities. The AI community needs to establish safeguards to ensure that efforts to improve LLMs do not undermine their foundational benefits.

In classical ML, establishing robust practices like train, test, and validate took nearly two decades. Even with such methods, matching model performance in real-world scenarios remains a challenge due to issues like label leakage and data biases. To avoid returning to convoluted practices, we must define clear parameters for enhancing LLM capabilities in anomaly detection.

Kalyan Veeramachaneni is the director of the MIT Data to AI Lab and co-founder of DataCebo.

Sarah Alnegheimish is a researcher at the MIT Data to AI Lab.

DataDecisionMakers

DataDecisionMakers is a platform for experts and professionals involved in data work to share insights and innovations. To learn more about cutting-edge ideas and best practices in data technology, visit us at DataDecisionMakers and consider contributing your insights.

Most people like

Find AI tools in YBX