The Geometry of Fraud Detection

Fraud detection is about finding data points that are outside the normal range of probability, often called outliers. In this blog, I look at some simple statistical ways to find outliers and also try to visualize what an outlier would look like if we were to use virtual area or virtual volume as geometric representations to find outliers. An understanding of these fundamental statistical concepts will lead to better fraud detection.

Probability

Let’s look at some simple mathematical concepts about the basics of outliers before discussing the main points for this article. The most basic question to ask about an outlier is what is the probability that it will occur? Let’s take a look at a near normal distribution using a sample bell curve.

Note: Credit for the area under the bell curve graph goes to Josh Starmer of StatQuest.

The idea is that the middle of the bell curve represents the mean of all events, while the downward curve away from the middle represents various standard deviations away from the mean. The area under the curve is the probability that an event will occur. Notice as we move away from the colorful area, there is less of a probability the event occurs. These are outliers as their probability for occuring is a small percentage compared to the higher percentage in the colorful area. If this bell curve represents possible values for fraud detection, it is the outliers that are most likely flagged as possible fraud.

Percentiles

We can build upon the outlier use case and use percentiles to look for indicators of fraud. In Splunk, one can use the eventstats command to find the 90th percentile for any numerical variable for all events. In theory, if the variable’s value is in the 90th percentile, then it is an outlier. If there are two variables of interest, both within the 90th percentile of their respective dataset, then the chances of declaring the event as fraudulent increases. For example, suppose we have transaction data containing total amounts transferred and the frequency of the number of transfers done within a day. If a customer is within 90th percentile of both the total amount transferred and their frequency is also over the 90th percentile, then this looks like suspicious behavior and it should be flagged as such. In SPL, this would look like the following.

index=transactions sourcetype=transfers | eventstats perc90(Amount) as percAmount perc90(Frequency) as percFrequency|where Amount>percAmount and Frequency>percFrequency |eval RiskScore_HighFreqAmount=30 | table Customer Amount Frequency percAmount percFrequency RiskScore_HighFreqAmount

An arbitrary risk score is added to the end to send to a risk index for further processing. For the purposes of this example the risk score is arbitrary, but your risk score in real life is based on your own usage and normalization techniques according to what suits your use case. In this example, all customers who are over the 90th percentile for amount and frequency are flagged. Because the amount can be magnitudes larger than the frequency value, we can use the log function to bring the amount value within the same magnitude as the frequency. There are other ways to do this such as using a StandardScaler function in machine learning, but we will keep it simple by using log here.

index=transactions sourcetype=transfers | eval Amount=log(Amount)|eventstats perc90(Amount) as percAmount perc90(Frequency) as percFrequency|where Amount>percAmount and Frequency>percFrequency |eval RiskScore_HighFreqAmount=30 | table Customer Amount Frequency percAmount percFrequency RiskScore_HighFreqAmount

Now, the Amount field is at a more reasonable magnitude, but the results are the same. Let us move on to what could be a better way to look for an outlier.

Compute the Area

We know the area of the rectangle is length times width. What if we view our two variables as measurements and the multiplication of the two produces a virtual area? In this case, we multiply the log(Amount) by the Frequency. This produces the virtual area and we can use the lowest possible 90th percentiles to find the lowest possible area such that the two numbers multiplied stay within the virtual area, in which case this is still considered an outlier. Let’s continue this with our example to get the minimum area that uses the 90th percentile of both variables.

index=transactions sourcetype=transfers |eval Amount=log(Amount)|eventstats perc90(Amount) as percAmount perc90(Frequency) as percFrequency|where Amount>percAmount and Frequency>percFrequency |eval area = Amount * Frequency|stats min(area) as min_area

With this in mind, can we assume that any combination of Frequency * Amount that is greater than 20.87 can be considered suspicious, because our training dataset told us this involved numbers from each variable in the 90th percentile? If so, then, we no longer need to use 90th percentiles to find possible fraud. We can now say that any virtual area less than 20.87 will be dismissed, but if any combination of Frequency * Amount is greater than that number, it will lead to a risk score. This heuristic is different from the previous one, but it allows for an easier calculation such that an extreme value in just one variable will not skew the results. Our new search will be as follows.

index=transactions sourcetype=transfers | eval Amount=log(Amount)|eval area=Amount * Frequency|where area > 20.87|eval RiskScore_HighFreqAmount=30 | table Customer Amount Frequency area RiskScore_HighFreqAmount|sort - area

The first thing to notice is that the search is simpler. It’s based on any measurement greater than the virtual area is considered suspicious. The next thing to notice is one more customer (English) was listed here as their original measurements were not both above the 90th percentile, but the combination of the two variables is enough to go over our threshold.

Why bother to do this? Think about the case where one variable is almost at the highest percentile while the other variable is at the 89th percentile. If we were simply comparing to see if both variables were above the 90th percentile, which in this case they are not, we would miss the potential outlier. However, the multiplication of both variables easily surpasses our boundaries for a risk indicator. This allows us to be more flexible in discovering possible fraud as one variable may be close to a threshold, while another easily surpasses it.

Visually, any rectangular shape can be used to compute the logical area to see if it is above a threshold.

Any combination that increases the area beyond our threshold will be given a risk score.

Using Cubic Volumes

What if we used three variables instead of two? We’ll use wire transfer data in the next example and add a suspicious country score. All countries will get an initial score of 1 and if they are suspicious, they will get a score of 2. A really suspicious country will get a score of 3. Our formula will now be amount * Frequency * suspicious. This will compute a virtual cubic volume. If the virtual cubic volume is over a predefined threshold, this entity should get a risk score just like we did before. In the next SPL example, we artificially provide a suspicious risk rating using a case statement for each Country in the dataset with a default rating of 1. (The assignment of 1, 2, and 3 could have been done in any way preferred by the writer of the SPL as this example is just for illustration.)

index=transactions sourcetype=wire_transfer |eval amount=log(amount) | iplocation destIP|rename Country as DestCountry|eval suspicious = case(DestCountry=="United States", 1, DestCountry=="Ghana", 2, DestCountry=="North Korea", 3, 1=1, 0) |eval volume=amount * Frequency * suspicious| where volume > 40| eval RiskScore_Volume=30| table customer, suspicious, amount, Frequency, DestCountry, volume, RiskScore_Volume |sort - volume

The new things in this search are the use of iplocation to find the country name of the destination IP address, the use of the case function to assign scores to countries, and finally using a virtual volume to look for suspicious behavior. In deployment, instead of using a case function, it is better to use a lookup to determine the risk rating as there are over 200 countries in the world. The use of 40 is an arbitrary predetermined threshold of fraud. We could have used the percentile method from above to show all events that had where Frequency>p_Frequency and suspicious>=p_suspicious and amount>p_amount to help compute the threshold, but my dataset was not aligned to the point where events had all 3 variables in the highest percentiles at the same time. That is why I arbitrarily chose 40 as the threshold after examining some of the highest values of these variables. This threshold should be fine tuned over time with more training data. One could argue that using three variables instead of two can lead to less false positives, if a proper threshold is used. Visually, this method can be shown with these sample cubes.

We can take this one step further and use the multiplication of four variables instead of three as long as we have faith in that 4th variable to accurately find the probability of fraud. Although, this can not be visualized due to our limitations of three dimensional space, mathematically, any number of dimensions can be used. Frankly, I think the more dimensions that are used after four makes it difficult to tell if any one variable is influencing the results without having to resort to machine learning to figure out the correlation of each variable to all the others.

Since multiple variables are being used to create a number that can be compared to a predetermined threshold, the question then comes up, are risk scores even needed and can a judgment of possible fraud be made on the spot? This depends on the efficacy of each variable for predicting fraud. A large training set of real world tested data should be used to see at which point false positives or false negatives come into play too often. The best thing to do is continue to use risk scores with each result to add to a risk index for further summation to accurately predict fraud. After understanding what works and does not work, after a few weeks of testing with real data, risk scores may be abandoned for some use cases where we are certain it is fraud as a judgment of fraud can be made immediately by the multiplication of variables. For instance, if someone is withdrawing thousands of dollars from two different ATMs at the same time, we can be certain this is fraud. The usage of multiplying high percentile variables as indicators of fraud is an implicit way to calculate risk. In the traditional way, each variable would be part of the rule and the rule would be given a risk score because of the outlier value of the variable. Then, all of the entity’s risk scores would be summarized to compare to a threshold to determine fraud. In contrast, this virtual geometric area or volume method described here is doing the same thing by implicitly using the outlier values of variables to ascertain risks all at once making the approach simpler.

Conclusion

Fraud is based on outliers as they show a low probability of occurring. If multiple variables show a very high percentile of occurring at the same time, which should not happen in normal situations, then this could be an indication of fraud. Another way to calculate this is to multiply each influencing variable in the dataset by each other to form a N dimensional shape. This will allow an extremely high value of a variable to manifest itself in the resulting multiplication as its high value influences the final result. To not let any one variable’s high magnitude affect the final outcome, consider using the log function to reduce the magnitude of a high magnitude variable to the same magnitudes as other variables. If the encompassing value of the N dimensional shape is over a threshold value, then consider this event an outlier as the entity (customer in our case) is a candidate for fraud. A risk score can then be added to this set of actions. In certain high probability fraud use cases, we can abandon risk scores entirely if we are reasonably sure the N variables that are used to create the encompassing shape are enough to predict fraud and their high percentiles always lead to positive fraud detection. This should cautiously be done with the use case in mind after testing against real world data that has already been known to be fraudulent or non-fraudulent. However, to be conservative, it may make sense to use the logical geometric area or volume of multiple high percentile variables as indicators of risk rather than definite indicators of fraud as we should have a plethora of tested use cases for positive fraud indicators before making absolute decisions.

Nimish Doshi

Nimish is Director, Technical Advisory for Industry Solutions providing strategic, prescriptive, and technicalperspectives to Splunk's largest customers, particularly in the Financial Services Industry. He has been an active author of Splunk blog entries and Splunkbase apps for a number of years.

The Geometry of Fraud Detection | Splunk (2024)

FAQs

What is the concept of fraud detection? ›

Fraud detection is the process of identifying fraudulent activities or attempts. It is important to have a detection system in place to prevent fraud from happening and to protect businesses and consumers from the financial losses that can result from these activities.

Learn More Now ›

What are the theories of fraud detection? ›

The Fraud Triangle Theory formulated by Donal R. Cressey stated the three reasons for committing frauds viz, opportunity, pressure, and rationalisation.

Keep Reading ›

What are the determinants of fraud detection? ›

According to the study's findings, the effectiveness of internal control (β = –0.932; p < 0.05) and the effectiveness of internal auditors (β = 1.149; p < 0.05) both have an impact on fraud detection. However, ongoing professional commitment (β = 0.069; p > 0.05) has no impact.

Tell Me More ›

What is the most common fraud detection method? ›

Fraud Detection by Tip Lines

One of the most successful ways to identify fraud in businesses is to use an anonymous tip line (or website or hotline). According to the Association of Certified Fraud Examiners (ACF), tips are by far the most prevalent technique of first fraud detection (40 percent of instances).

Discover More ›

What are the analytical techniques for fraud detection? ›

In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, the sounds like function, regression analysis, clustering analysis, and gap ...

Learn More Now ›

Why is fraud detection hard? ›

Detecting financial fraud requires analyzing data patterns over time. That noted, sophisticated fraudsters will use tactics that aren't necessarily detected by looking at a single set of data. They could even use artificial intelligence (AI) or machine learning to complicate the fraud.

Find Out More ›

What are the 4 pillars of fraud? ›

Four Key Pillars of Dealing With Fraud

Prevention. To help prevent fraud from occurring, make sure the plan has the proper controls in place. ...
Detection. Use the right tools to detect fraud in a timely manner. ...
Investigation. Conduct thorough investigations—regardless of how small the fraud might seem to be. ...
Resolution.

Jun 28, 2021

Learn More Now ›

What is Benford's law in fraud detection? ›

Benford's Law is a very simple statistical law that can be used to detect the probability of fraud in any given dataset. In some cases, big problems can be solved in a very simple way. It doesn't require high-level algorithms, coding, models, etc.

Explore More ›

What is the Pentagon fraud theory? ›

Pentagon fraud is a concept that explains the factors that cause someone to commit fraud, namely pressure, opportunity, rationalization, capability, and arrogance. In this study, the pressure factor is proxy by using financial targets and external pressure.

View Details ›

Which algorithm is used in fraud detection? ›

Fraud Detection Machine Learning Algorithms Using Logistic Regression: Logistic Regression is a supervised learning technique that is used when the decision is categorical. It means that the result will be either 'fraud' or 'non-fraud' if a transaction occurs.

What triggers fraud detection? ›

What triggers a Fraud Alert? Transactions that we have identified as potential fraud, including transactions outside your normal activity, trigger the alerts.

Learn More ›

What are the KPI for fraud detection? ›

KPIs that can assess the scope of your fraud problem

Fraud rate. ...
Dispute rate. ...
Approval vs. ...
Authorization rate or transaction approval rate. ...
Total decline rate. ...
Checkout latency due to fraud checks. ...
Customer complaints related to fraud and false declines. ...
Response time.

More items...

Mar 6, 2024

Learn More ›

What type of fraud is most difficult to detect? ›

While there is no one type of fraud that will always be more difficult to detect than others, one particularly challenging type of fraud to detect is synthetic fraud.

Discover More Details ›

What is the basic of fraud detection? ›

The fraud detection process consists of gathering user and transaction data, feeding it to risk rules, and automatically approving or declining actions based on the results. For instance, an IP address is a data point.

Explore More ›

What data is needed for fraud detection? ›

Machine Learning and AI

The models use transaction data to analyze patterns and anomalies to detect unusual activities, and user behavior data to identify deviations from typical usage patterns that enable it to flag potentially fraudulent activities.

Discover More ›

What is the basic concept of fraud? ›

“Fraud” is any activity that relies on deception in order to achieve a gain. Fraud becomes a crime when it is a “knowing misrepresentation of the truth or concealment of a material fact to induce another to act to his or her detriment” (Black's Law Dictionary).

Keep Reading ›

What is the objective of fraud detection? ›

Fraud detection is a set of activities undertaken to prevent money or property from being obtained through false pretenses. Fraud detection is applied to many industries, such as banking and insurance. In banking, fraud includes forging checks or using stolen credit cards.

Discover More Details ›

What is the concept of fraud prevention? ›

The goal of fraud prevention is to reduce the incidence of fraud and related consequences. Prevention strategies may focus on deterring potential fraudsters, detecting fraudulent activity, or resolving incidents of fraud. Prevention strategies are often implemented as part of a larger risk management program.

See Details ›

What is the concept of fraud investigation? ›

A fraud investigation is an examination of evidence to determine if someone deliberately deceived others to unfairly advance their own interests. In terms of Trust and Safety, it involves identifying marketplace rules violations and other behaviors that could pose risks to users and the system.

Find Out More ›

The Geometry of Fraud Detection | Splunk (2024)

Probability

Percentiles

Compute the Area

Using Cubic Volumes

Conclusion

FAQs

What is the concept of fraud detection? ›

What triggers fraud detection? ›