Will AI Replace Traditional Actuarial Modeling?

By Igor Nikitin, ASA, MAAA
CEO, Nice Technologies LLC

This article explores the current state of AI and demonstrates that cost-effective replacement of traditional actuarial models with AI is unlikely in the next decade due to the need for better algorithms and computing hardware.

Several people have asked whether AI will replace traditional actuarial modeling. Will we be able to ask a ChatGPT-like tool to run a quarterly valuation or prepare an annual statutory report? In short, I don’t think so for the next decade. AI will certainly make modeling more efficient, but it will not replace traditional modeling in the foreseeable future. Here are the reasons:

AI models are inherently less cost effective.

Traditional models calculate exactly what is needed while AI models will always have overhead. In the current generation of Large Language Models (LLM) that overhead is very substantial. Although it may shrink with technological advances, it will likely remain non-trivial. We can measure this computational efficiency in terms of computing costs.

Let’s say we have a traditional model that evaluates 1 million policies. Let’s say it can run in 1 hour on a high-end machine. The hourly cost of a top-tier, compute-optimized machine on Amazon Web Services (AWS) is about $10. This gets us a very impressive 192 cores, 384GB of memory, and 50GB network. One hour on such an AWS instance should easily handle a million policies. Thus, our hypothetical traditional model costs about $10 per run.

Now, let’s consider the cost of ChatGPT as a proxy for an AI-based modeling approach. ChatGPT charges per million tokens of input and output. A token roughly equals one word or a small integer. It’s worth noting that floating point numbers typically count as three tokens. The table below summarizes the costs for the best and most cost-effective models as of November 2024:

Let’s ignore all current limitations and assume ChatGPT can take all inputs, produce the required outputs, and do it perfectly — unrealistic, but useful for this cost comparison. We will base the comparison on the best model and the most cost-effective model available today.

Cost of input.

A million policies with 30 fields each, stored as a CSV, would be roughly 120 million tokens:

(3 tokens per floating point number x 30 fields per policy + 29 comma separators) x 1 million policies = 120 million tokens

A 100MB assumptions file stored as a CSV would be roughly 25 million tokens. Since 100MB equals approximately 100 million characters, and ChatGPT averages about 4 characters per token, we have:

100 million characters / 4 characters per token = 25 million tokens

Therefore, input costs would be either $181 for the best model or $10.80 for the most cost-effective model.

Cost of input Best model GPT-4o Most cost-effective model GPT-4o mini

25 million input tokens $181 $10.80

Cost of output.

Now, consider model outputs. If we want only a summary of results — for example, 10 monthly vectors over 50 years — this would be roughly 24,000 tokens:

(50 years x 12 months x 3 tokens per floating point number + 599 separators) x 10 vectors = 24,000 tokens

If we want one number per policy (e.g., a reserve), that would be about 3 million tokens:

3 tokens per floating point number x 1 million policies = 3 million tokens

If we want one monthly vector of 50 years per policy (e.g., expected cash flows), that would be 2.4 billion tokens:

(50 years x 12 months x 3 tokens per floating point number + 599 separators) x 1 million policies = 2.4 billion tokens

The table below translates the output tokens into the cost of output.

Cost of output	Best model GPT-4o	Most cost-effective model GPT-4o mini
10 monthly vectors for 50 years	$0.12	$0.0072
1 floating point number per policy	$45	$2.70
1 monthly vector for 50 years per policy	$12,000	$720

Cost of intermediate calculations.

A 2023 paper from the University of Maryland shows that “chain-of-thought” processing significantly improves LLM accuracy in mathematical tasks and appears to be part of more recent ChatGPT versions.¹ This means we must allow the LLM to produce intermediate steps, which we must pay for. In actuarial modeling, intermediate results are often monthly vectors of floating point numbers.

From our calculations, one vector per policy, for 1 million policies, costs between $720 and $12,000. A typical policy calculation might need between 50 and 1,000 intermediate vectors, depending on benefit optionality. For reserve projections or economic scenarios, the number of intermediate vectors could grow dramatically. Thus, per run, we should add between $36,000 and $12 million for intermediate reasoning costs.

Cost of intermediate output	Best model GPT-4o	Most cost-effective model GPT-4o mini
50 vectors per policy	$600,000	$36,000
1,000 vectors per policy	$12,000,000	$720,000

50 vectors x $720 per vector = $36,000
1,000 vectors x $12,000 per vector = $12,000,000

Now let’s put together a total cost per model run assuming we aggregate results into 10 monthly vectors:

Total cost per model run with 1 million policies	Traditional model on best AWS instance	Best model GPT-4o	Most cost-effective model GPT-4o mini
50 intermediate vectors per policy	$10	$600,181	$36,011
1,000 intermediate vectors policy	$10	$12,000,181	$720,014

This analysis shows that the same model can run for $10 using traditional techniques, compared to anywhere between $36,000 and $12 million using LLMs. AI will always be significantly more expensive than traditional actuarial modeling on a per run basis. As of today, the cost difference is enormous. By comparison, employing a skilled modeling actuary at an annual salary of $100,000 to $300,000 is far more cost-effective than relying on an LLM-based approach in the current environment.

Explaining AI output requires a traditional model.

AI outputs for complex tasks like actuarial modeling are hard to verify due to their opacity and the sheer volume of calculations. How can you be sure the AI made no mistakes? You’d want to see the formulas and the transformations from input to output. This is exactly what a traditional model provides. But if you need a traditional model to verify the AI results, why use the AI model at all, given its high costs?

This problem will be solved when society and the actuarial profession become comfortable fully trusting AI. For better or worse, I don’t anticipate this happening in the foreseeable future.
There are significant technical hurdles to overcome.
1. Current LLMs have insufficient context awareness.
  
  Our analysis so far has ignored context limitations for simplicity, but it’s a major issue. The current generation of LLMs can handle a context of about 128,000 tokens, or roughly 43,000 floating point numbers. Everything (inputs and outputs) must fit within this limit, or the model will produce results without full awareness of the data.
  
  Recall that our 1 million policy inforce file was about 120 million tokens. We would need to split this into at least 900 pieces (about 1,000 policies at a time) just to feed it into the LLM. Realistically, we’d need even smaller chunks to fit assumptions and outputs as well.
  
  A simple mortality improvement assumption could also approach the context limit. For instance, if we cover 20 calendar years for ages 20 through 100, with two genders, four income bands, and three occupations, that’s just under 40,000 floating point numbers—nearly the entire context capacity of the best LLMs today.
  
  81 ages x 20 improvement calendar years x 2 genders x 4 income bands x 3 occupation types = 38,880 floating point numbers
  
  To perform a valuation-type run, an LLM would need a much larger context or be limited to evaluating policies in very small groups, perhaps 1-10.
  
  To handle the entire 1 million inforce at once, we would need a context of around 2–5 billion tokens, or 15,000 – 20,000 times larger than today’s best model. The complexity of LLM processing grows roughly with the square of the context size. Doubling the context quadruples complexity, so increasing it 15,000 times would be about 225 million times more complex — and likely 225 million times more costly.
  
  Currently, the most promising approach to addressing this issue is Retrieval-Augmented Generation (RAG). This method dynamically retrieves only the relevant data — such as evaluating 10 policies at a time and retrieving only the assumptions specific to those policies. While RAG reduces the number of input tokens and alleviates context size limitations, it does not reduce the number of output tokens and therefore has little effect on total run cost. Moreover, it introduces significant complexity in terms of chunking data, retrieving assumptions, and aggregating results. In essence, RAG solves the context size issue by introducing significant engineering costs to develop and maintain the infrastructure required for its implementation.
  
  Other approaches are being explored to handle large context sizes, but none promise the orders-of-magnitude improvements we need. Radically better hardware and algorithms will be essential to achieve the desired “just ask ChatGPT” solution. Something like quantum computing, which is expected to be practical by 2040 “if all goes well” — and that’s a big “if”.
2. The current generation of LLMs struggle with complex mathematics.
  
  The current generation of LLM’s is not great at mathematical problems. A 2024 paper by Apple demonstrates and concludes that “the performance of LLMs deteriorating as question complexity increases.“². There are attempts to solve this issue by pairing LLM with a formal programming language, such as Google DeepMind AlphaProof.³ AlphaProof recently solved 4 out of 6 International Mathematical Olympiad high school problems, but it uses a more complex iterative training approach. This would greatly increase operating costs while still not guaranteeing perfect answers.

Conclusion

To cost-effectively replace traditional actuarial modeling with AI, the AI would need to reliably produce results at a cost lower than both the salaries of actuarial modeling staff and traditional cloud computing expenses. This is extremely unlikely within the next decade, as it requires major improvements in AI algorithms and computing power. For the foreseeable future, traditional models developed with AI assistance remain the most practical and cost-effective approach to actuarial modeling.

References

Chen, J., Chen, L., Huang, H., & Zhou, T. (2023). When do you need Chain-of-Thought Prompting for ChatGPT? [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2304.03262
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2410.05229
Google DeepMind. (2024, July 25). AI achieves silver-medal standard solving International Mathematical Olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/

Instance name	On-Demand hourly rate	vCPU	Memory	Storage	Network performance
c7a.48xlarge	$9.85344	192	384 GiB	EBS Only	50000 Megabit
https://aws.amazon.com/ec2/pricing/on-demand/

ChatGPT Batch API pricing as of 11/21/2024	Best model GPT-4o	Most cost-effective model GPT-4o mini
Cost per million input tokens	$1.25	$0.075
Cost per million output tokens	$5.00	$0.30
https://openai.com/api/pricing/

Cost of input	Best model GPT-4o	Most cost-effective model GPT-4o mini
25 million input tokens	$181	$10.80