A look inside our Company

Get the latest news. Contact us to learn more about Eucon and our digital solutions.

Contact us

DEALBREAKER DATA: Why AI models should be more about data than algorithms

Artificial intelligence (AI) and the associated technology Machine Learning (ML) have long since arrived in the insurance industry, and their enormous potential is on everyone’s lips. Thus, seemingly impossible things – such as fully automatic image recognition – appear to be within the realm of possibility. The application of AI to a relevant set of claim types and classes in day-to-day business is often a completely different matter, though. Nevertheless, the interest in these supposedly new technologies such as AI and ML is great – and rightly so. This is evidenced not only by the wealth of articles on the subject that dominate the insurance industry’s trade journals; the topic of AI is also on the agenda of relevant industry events in the insurance sector. But in which areas do AI-based models already work successfully today, and above all: What does it take to put them into operation? Find out what it truly takes for AI-based models to succeed. An article by Michael Rodenberg, Managing Director Eucon Digital GmbH.

DEALBREAKER DATA: Why AI models should be more about data than algorithms


The financial and insurance industry is already among the “digitally advanced” sectors, and a further digitalization surge is expected in the coming years. This is also suggested by investments in flexible IT infrastructures, which are needed to tackle digital transformation. At €4.7bn, insurers’ IT expenditure in 2018 was higher than ever before, with investments being made primarily in application and system development, including the replacement of legacy systems, and in server and cloud solutions. The way for digital transformation is thus being paved at this very moment, with AI seen as the biggest driver. AI experts in particular have high expectations regarding AI, and expect positive effects on competitiveness, flexibility, product quality, quality of work as well as on productivity and efficiency. AI releases enormous potential in the insurance industry as well. In claims management, AI-based models – especially Machine Learning – are already performing reliably today and will improve the customer experience in the long run.                                                   


If the current hype leads you to believe that Machine Learning is a young technology, you are sorely mistaken. ML has already been used for decades in special application areas such as Optical Character Recognition (OCR). The first ML application that became known to a broad public was the spam filter in the 1990s. Today, ML is at the core of many cutting-edge technologies and performs well, for example, in smartphone speech recognition or the ranking of search results on the Internet.

But what can ML contribute to the insurance industry? Enormous potentials can be leveraged, for example, with AI-based models for process automation. Especially the so-called “OK case prediction” in claims management ensures accelerated processes and more efficient handling of claims, allowing processing times to be reduced by up to 80 % by minimizing the need for manual checking. The manual processing of claims documents costs time and money and is not required for every claim. However, the decisive – and often difficult – question is for which claims the check by claim handlers, appraisers or relevant service providers can be ”skipped”. Yet, if exactly the wrong proportion of cases is examined manually, the insurer leaves savings potential untapped. Intelligent claims prediction offers a solution to this problem. With the help of machine learning algorithms, the probability of whether a manual check is useful or the claim can even be approved automatically can be determined at an early stage. Specialists can concentrate on the complex cases – customers get feedback faster and all subsequent processes are accelerated as well.

But how do AI-based models work? What is the basis for a successful AI-based model like the “OK case prediction”? Unfortunately, software that contains these algorithms is not sufficient for productive use in day-to-day business.


Machine Learning is essentially about two things: the learning algorithm and the data used to train it. Possible sources of error here are either selecting a bad algorithm or using bad data.

When asked about the importance of the algorithm in comparison to data, researchers Banko and Brill obtained impressive results at Microsoft in 2001. They illustrated that for a complex task such as language disambiguation, the machine learning algorithm is less important than the data. Thus, different, sometimes very primitive algorithms produced similarly good results, provided they were trained with sufficient data. “However, these results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development.” Halevy/Norvig/Pereira also assessed the superiority of data in the 2009 article “The Unreasonable Effectiveness of Data”.


Data are crucial for the performance capacity of AI-based models. However, all data are not equal. In order to be able to train the ML model, data must be available in the appropriate quality to ensure successful functioning. An AI-based model learns from examples and data from completed cases, which is why historical data are needed for training. For example, the OK case prediction uses claims that were previously assessed by experts during a manual check. In order to be able to use these claims as a training set for the model, a technical understanding of the manual review process is required. Data scientists gain this understanding by exchanging information with the reviewing experts. Understanding and correctly interpreting data are key prerequisites for deciding which data to include in the training set so as to create a high-performing model. For special areas, it is also already possible for models to train themselves based on the latest data. Here, data scientists come into play to evaluate the model, adopting the model or developing it further in manual mode.

In addition to training data, which are used for model creation and training, independent validation and test data are also required to evaluate the model. In general, two factors are in turn decisive for the quality of data: Data quantity and data quality.


How great is the amount of data that is required for a meaningful and stable AI model? Many would like a simple answer to this, but unfortunately there is no general rule of thumb. Each case is individual. The amount of data required depends on various factors such as the data quality and complexity of the task the model is supposed to map. Even simple tasks require thousands of data sets in order for machine learning methods to work reliably. For more complex issues, such as image or speech recognition, millions of examples may be required. In addition, not only must a sufficiently large amount of data be available for model development and validation, but also for the ongoing evaluation of the model. This is because model performance decreases over time, making quality assurance indispensable during productive use of the AI model. This is due to changing environmental conditions and thus also to changing loss patterns: “Since the models are trained with historical data, each model begins to age immediately after creation and deviates more and more from reality as it ages. This means that model performance must be continuously evaluated in order to detect changes – and then intervene,” says Dr. Antje Fitzner, Data Scientist at Eucon Digital. Also, processes already assessed by the model cannot be used for continuous further development. Using these cases would amplify existing errors, as the model increasingly confirms itself. Therefore, a great deal of always new data is needed to keep it up to date and adapt it to changes.

Only a prior analysis can answer the question of how much data is needed, though. By analyzing the performance of a model as the amount of training data decreases, it is possible to evaluate the minimum amount of data required to obtain a reliable model. “When conducting this test for one of our models for the OK case prediction, we found that the performance remained relatively constant up to a reduction to about 30% of the initial data volume. So a third of the data volume was already sufficient for a reliable model,” reports Janera Kronsbein, Product Manager for AI-based solutions at Eucon Digital. This outcome was possible thanks to the excellent data situation: “Given our history as a digitalization partner in claims management, we can incorporate millions of data records in the ML model of the OK case prediction. The database is updated daily with new data.”

Insurers too sometimes have large amounts of data, but often the information is not available in the quality required for AI-based models. And this brings us to the next important factor: data quality.


Even the largest data pool is useless if the quality is not right. The quality of the available data is key to the machine learning process, since the AI system learns from these data and examples. AI has to use the data to identify similarities that apply to OK cases, for instance. Only when the system has reliably learned these similarities can it automatically process new cases in the future.

Some insurers have entered data into their systems that are hardly usable. Others have large amounts of data in data lakes or warehouses. However, these are often not immediately usable for training AI models – at least not without prior data preparation or a professional extraction of technical data. Almost one in three insurance companies rates the quality of their captured and processed customer data as low or rather low. Above all, incomplete and duplicate customer data is a persistent problem. However, complete, structured and accurate (i.e. error-free) data sets are a key requirement for the application of AI, because the model blindly relies on the quality and accuracy of the data used for training.

Yet representativeness and age of the data are also important when it comes to data quality.

To begin with the aspect of representativeness: The training data used must represent the complete and thus the broad spectrum of data and processes. Relevant characteristics in the context of the OK case prediction, for example, are aspects such as type of damage (tap water/storm etc.) or the amount (in €). In the above example, data variety is given by the large number of losses from different areas, so that all factors are represented.

Now to the age of the data: The question as to how old data sets may be in order to guarantee an up-to-date model must also be examined beforehand in pretesting. The impact of age is influenced by various factors such as the dependence of the input data on general and seasonal trends. For the concrete use case of OK case prediction, no data sets older than 3-5 years are used. “Our analyses have shown that by including several cycles, we compensate for short-term fluctuations and at the same time map the most recent developments for long-term trends,” says Janera Kronsbein.


AI-based models can achieve a great deal using machine learning algorithms. In claims management, they ensure more efficient claims handling and accelerated processes. A caveat: The results achieved are only as good as the data used to train a model. Apart from an understanding of data and its correct interpretation, one thing is particularly important here: a large amount of qualitative data.


This article was first published in German in issue no. 3/2020 "Schadenmanagement mittels Daten und KI" of the "Themendossier" industry magazine of Versicherungsforen Leipzig.

Written by Eucon Digital GmbH