ChatGPT was found to be more accurate than human analysts in predicting whether a company's earnings would rise or fall, an academic paper found, Accounting Today reported.
The large language model (LLM) tool was correct about 60 percent of the time, as opposed to its human counterparts, who were correct about 53 percent of the time, according to the paper, published by academics at the University of Chicago's Booth School of Business revealed, using a ChatGPT model fine-tuned for financial analysis.
The researchers investigated whether an LLM can successfully perform financial statement analysis in a way similar to a professional human analyst. They provided standardized and anonymous financial statements to GPT4— a version of ChatGPT released in March 2023—and instructed the model to analyze them to determine the direction of future earnings. Even without any narrative or industry-specific information, the researchers found that the LLM outperformed financial analysts in its ability to predict earnings changes. The LLM exhibited a relative advantage over human analysts in situations when the analysts tend to struggle.
“Our results suggest that LLMs may take a central role in decision-making,” they wrote in their abstract.
The financial statements sampled a period spanning from 1982 to 2021 with 39,533 observations from 3,152 distinct firms. The researchers' target variable across all models was a directional change in future earnings, up or down. They tested the bot to predict next year's earnings one month after earnings reports, as well as three months and six months after.
For the human predictions, the researchers used consensus analyst forecasts. Like the bot, these too were measured one month after earnings reports, three months out and six months out. They found that human analysts were accurate 52.71 percent of the time one month later, 55.95 percent of the time three months later, and 56.68 percent of the time six months later.
ChatGPT scored only 49 percent without chain-of-thought-prompting, which TechTarget defines as "a prompt engineering technique that aims to improve language models' performance on tasks requiring logic, calculation and decision-making by structuring the input prompt in a way that mimics human reasoning." This model was referred to in the paper as a "naive" model. But with chain-of-thought prompting, this rate increased to 60.31 percent, well above that of human analysts. The results were on par with the 60.45 percent accuracy rate of a specialized neural network trained on the same data.
"Although one must interpret our results with caution, we provide evidence consistent with large language models having human-like capabilities in the financial domain," the study concluded. "General purpose language models successfully perform a task that typically requires human expertise and judgment and do so based on data exclusively from the numeric domain. Therefore, our findings indicate the potential for LLMs to democratize financial information processing and should be of interest to investors and regulators."