Fraud Detection with Hadoop - dummies

Fraud Detection with Hadoop

By Dirk deRoos

The sheer volume of transactions makes it harder to spot fraud because of the volume of data, ironically, this same challenge can help create better fraud predictive models — an area where Hadoop shines.

In today’s interconnected world, the sheer volume and complexity of transactions makes it harder than ever to find fraud. What used to be called “finding a needle in a haystack” has become the task of “finding a specific needle in stacks of needles.”

Traditional approaches to fraud prevention aren’t particularly efficient. For example, the management of improper payments is often managed by analysts auditing what amounts to a very small sample of claims paired with requesting medical documentation from targeted submitters. The industry term for this model is pay and chase: Claims are accepted and paid out and processes look for intentional or unintentional overpayments by way of post-payment review of those claims.

So how is fraud detection done now? Because of the limitations of traditional technologies, fraud models are built by sampling data and using the sample to build a set of fraud-prediction and -detection models. When you contrast this model with a Hadoop-anchored fraud department that uses the full data set — no sampling — to build out the models, you can see the difference.

The most common recurring theme you see across most Hadoop use cases is that it assists business in breaking through the glass ceiling on the volume and variety of data that can be incorporated into decision analytics. The more data you have (and the more history you store), the better your models can be.

Mixing nontraditional forms of data with your set of historical transactions can make your fraud models even more robust. For example, if a worker makes a worker’s compensation claim for a bad back from a slip-and-fall incident, having a pool of millions of patient outcome cases that detail treatment and length of recovery helps create a detection pattern for fraud.

As an example of how this model can work, imagine trying to find out whether patients in rural areas recover more slowly than those in urban areas. You can start by examining the proximity to physiotherapy services. Is there a pattern correlation between recovery times and geographical location?

If your fraud department determines that a certain injury takes three weeks of recovery but that a farmer with the same diagnosis lives one hour from a physiotherapist and the office worker has a practitioner in her office, that’s another variable to add to the fraud-detection pattern.

When you harvest social network data for claimants and find a patient who claims to be suffering from whiplash is boasting about completing the rugged series of endurance events known as Tough Mudder, it’s an example of mixing new kinds of data with traditional data forms to spot fraud.

If you want to kick your fraud-detection efforts into a higher gear, your organization can work to move away from market segment modeling and move toward at-transaction or at-person level modeling.

Quite simply, making a forecast based on a segment is helpful, but making a decision based on particular information about an individual transaction is (obviously) better. To do this, you work up a larger set of data than is conventionally possible in the traditional approach. Only (a maximum of) 30 percent of the available information that may be useful for fraud modeling is being used.

For creating fraud-detection models, Hadoop is well suited to

  • Handle volume: That means processing the full data set — no data sampling.

  • Manage new varieties of data: Examples are the inclusion of proximity-to-care-services and social circles to decorate the fraud model.

  • Maintain an agile environment: Enable different kinds of analysis and changes to existing models.

Fraud modelers can add and test new variables to the model without having to make a proposal to your database administrator team and then wait a couple of weeks to approve a schema change and place it into their environment.

This process is critical to fraud detection because dynamic environments commonly have cyclical fraud patterns that come and go in hours, days, or weeks. If the data used to identify or bolster new fraud-detection models isn’t available at a moment’s notice, by the time you discover these new patterns, it could be too late to prevent damage.

Evaluate the benefit to your business of not only building out more comprehensive models with more types of data but also being able to refresh and enhance those models faster than ever. The company that can refresh and enhance models daily will fare better than those that do it quarterly.

You may believe that this problem has a simple answer — just ask your CIO for operational expenditure (OPEX) and capital expenditure (CAPEX) approvals to accommodate more data to make better models and load the other 70 percent of the data into your decision models.

You may even believe that this investment will pay for itself with better fraud detection; however, the problem with this approach is the high up-front costs that need to be sunk into unknown data, where you don’t know whether it contains any truly valuable insight.

Sure, tripling the size of your data warehouse, for example, will give you more access to structured historical data to fine-tune your models, but they can’t accommodate social media bursts. Traditional technologies aren’t as agile, either. Hadoop makes it easy to introduce new variables into the model, and if they turn out not to yield improvements to the model, you can simply discard the data and move on.