Machine Learning for Fraud Detection - Proposal
- Name: Saurabh Batra
- Mattermost nick: saurabh
- Web Page: http://saurabhbatra96.github.io/
- Resume: http://saurabhbatra96.github.io/public/cv.pdf
- Location: India
- Typical working hours: 12 PM - 10 PM UTC+5:30
The project aims to build a new open-source fraud detection system. The 2 major steps involved are:
- experimenting with various anomaly detection techniques (see the ML section at the end) to figure out which one provides a required balance of precision (% of detected frauds which are actually fraudulent) and recall (% of all frauds detected);
- providing the technique as an independent web service (like https://www.mediawiki.org/wiki/ORES) which can entertain requests to ascertain the authenticity of transactions.
- The web service uses the feedback from its decisions (new correct detection/wrong detection corrected by a human) to train the underlying model, improving its accuracy in the future.
- Use something like LIME (https://github.com/marcotcr/lime) to provide a justification as to why our classifier chose to mark a transaction as fraud.
- CiviCRM extension to interface directly with the web service.
Previous experience I've already worked with Eileen for about an year back in 2016 which included a GSoC project for CiviCRM and have discussed the proposal with Adam.
Possible Mentor(s) Eileen McNaughton , Adam Wight
I’m going to divide the work into 2 major phases:
Experimentation phase (May - mid June)
The experimentation phase will majorly consist of trying out the proposed techniques on the current dataset and comparing how they perform against each other and against the current fraud detection system. Tentative tasks include:
- (Week 1) Dataset procurement and cleaning
- (Week 1-2) Reading up and applying feature selection to the dataset
- (Week 2-5) Reading up and applying anomaly detection techniques; comparing precision and recall scores; deciding on the best technique for the web service
Architectural phase (June - August)
The architectural phase involves integrating the best-performing technique with a web service. Tentative tasks include:
- (Week 6) API design for the web service
- (Week 6-7) Setting up the bare-bones architecture for the web service
- (Week 7-8) Implement the API (or at least the important parts of it)
- (Week 9-10) Integrate the API into WMF transaction workflow
I'm currently a final year B.Tech. Computer Science & Engineering at IIT Guwahati, India. I started contributing to CiviCRM in 2015 and ended up doing a GSoC project with Eileen in 2016. This project is going to be priority number one during my summer break as I don't have any pressing commitments during the same time.
For the past year I've been working on a thesis project on data science and information retrieval which involves machine learning techniques similar to the ones I want to use here. In addition to that I have considerable experience working with open source organizations - I was an active contributor to CiviCRM and a GSoC participant back in 2016. Also, I'm comfortable adapting to new tech stacks and getting "code-ready" in a short period of time thanks to my internship at Google in 2017.
Machine Learning Techniques for Anomaly Detection
- Autoencoders: Autoencoders are neural nets that try to learn the underlying patterns in data in an unsupervised way. Outliers to these patterns are detected as anomalies. More details: https://shiring.github.io/machine_learning/2017/05/01/fraud.
- Logistic Regression: Logistic regression tries to find the best (yet reasonable) fitting model to describe the relationship between a dependent variable (fraud/not fraud) and a set of independent variables (features). Outliers to these patterns are detected as anomalies.
- Supervised Learning using Classifiers: The problem with using supervised learning is that if for ex. a SVM guessed that transactions were never fraudulent, it would’ve been correct ~99.6% of the times on WMF’s transactions from 2017. A workaround is that we under-sample normal transactions such that frauds are not underwhelmingly less as compared to number of normal transactions. An ensemble of classifiers (think something which combines the outputs of multiple classifiers and then classifies the transaction as fraud/not fraud) should work even better than singular classifiers.
- An interesting one (just read the dataset description and conclusions if you don’t want to go through the entirety of it): http://www.wipro.com/documents/comparative-analysis-of-machine-learning-techniques-for-detecting-insurance-claims-fraud.pdf
- Radar is a proprietary software that does exactly what we’re trying to achieve: https://stripe.com/radar