Building a statistical analysis tech stack

Published in

AzimoLabs

4 min readJun 19, 2018

In “A data-driven approach to finding the truth” we introduced our statistical analysis project, Delivery Times Intelligence (DTI). In this article we’ll take a deep dive into the tech stack used to power DTI.

Azimo sends money from people in Europe to their loved ones in more than 190 countries. The goal of this project was to provide customers with accurate predictions of when their money will arrive.

Because there are so many variables in our product (sending country, receiving country, sending currency, receiving currency, delivery method and local banking systems to name just a few), it was necessary to do substantial exploratory data analysis before moving on to building software.

We use Jupyter Notebook for data analysis, excellent browser software that was designed for data science. Jupyter’s block structure enables code execution and data visualisation in the form of text or graphs, making prototyping lightweight and fast.

Additionally, blocks are capable of rendering Markdown and HTML, which enables us to add documentation and comments to our research as we work. Jupyter’s base language is Python, which is the most common language used in data science and machine learning today.

It is for this reason that we made Python the core language of our project. By following basic programmatic principles during the data exploration process, it’s possible to then move that code into reusable and testable classes, and to store the code in one place. The code can then be shared between our project and our Jupyter Notebook analysis without the need for code translation.

Our Python project consists of independent scripts that can be treated as atomic blocks. When assembled, they form pipelines which are used by three modules to carry out different tasks. Modules can be launched once or periodically as a service.

The code is dockerized and stored within the Gitlab registry. From there, the code can be manually built on Jenkins and then sent to HashiCorp Nomad, which will host it as a microservice on Azimo infrastructure AWS, according to the schedule specified by a developer. The service runs as a Rocket runtime and has its credentials provided by HashiCorp Vault.

I will summarise each of the different modules below:

The calculation model is the cornerstone of the system. At the beginning of each day, it crunches the delivery time data of historical completed transfers in order to obtain delivery time statuses for the current day. The majority of these operations are based on statistical analysis and good old-fashioned maths, though some simple unsupervised machine learning algorithms help to clean the data and pick the right samples for our calculations.

Delivery times are estimated for the most popular payout countries and aggregated by the sending country of the client, the time the transaction was created, the delivery method, payment time and all permutations of the above. Results go through a selection process and are accessible to the mobile and web applications via a Firebase API endpoint.

Additionally, the module calculates numerical thresholds for transfers that are judged to be taking too long. Those values are stored in Amazon’s Relational Database Service. Transaction IDs that are flagged as anomalies are streamed to a Kafka topic to be used by other systems. The performance of all corridors can be monitored thanks to automatically generated HTML reports, created out of Jupyter Notebooks and collected in Google Cloud Storage.

The hosting module regularly checks for changes to Google Cloud Storage and builds a simple data dashboard (hosted on the NGINX web server) out of the HTML reports it finds. The dashboard is protected by with a VPN, which allows anyone who is interested to analyse the research and daily results remotely and securely.

The monitoring module focuses only on transfers that are currently in progress. It consumes data in real time from a Kafka topic, which is responsible for streaming transaction events. Transfers taking too long are flagged as an anomaly. Remember the anomaly threshold values we mentioned earlier? The monitoring module calls the calculation module for those values in order to judge which transfers should be flagged.

As the topic only sends information about a transfer when its status changes, states are save locally and synchronised with Amazon’s Relational Database Service to provide cover in case the service dies. Decisions are streamed back to a different topic. From there, they can be retrieved by other systems responsible for alerts, real-time analysis and visualisation.

Thanks for reading this sneak peak at our DTI technology stack, a rapidly evolving project in which many changes are already planned. In our next post about DTI, we’ll be explaining exactly how delivery times calculation works from a mathematical perspective. Stay tuned! 🤓

Towards financial services available to all

We’re working throughout the company to create faster, cheaper, and more available financial services all over the world, and here are some of the techniques that we’re utilizing. There’s still a long way ahead of us, and if you’d like to be part of that journey, check out our careers page.

Building a statistical analysis tech stack

Towards financial services available to all

Written by Kamil Krzyk