Statistics on Big Data with AWS

An illustration of the project Scalable Statistics with Large Datasets — Amazon.com

Sponsored by: Amazon.com
Project Lead: Dr. Ricardo Acevedo Cabra
Scientific Lead: M.Sc. Aurelien Ouattara
Term: Summer semester 2019

In Amazon EU Supply Chain Data Science team, we are improving our customer experience and costs by answering complex questions using enhanced analytical methods. With the computational developments and accessibility improvements made by Amazon Web Service, and through our efforts to move away from averages and look into more granular information, we are now extensively using large and extra-large datasets (more than hundreds of millions of observations) to perform our analyses and develop more precise statistical models.

While dealing with big datasets has historically been the focus of Machine Learning methods, causality inference (which lies within statistical and econometrical methods) remains necessary for taking business decisions. Today, the programming languages that have been optimized for handling big data (such as SPARK or Tensorflow) usually only support statistics models through their ‘machine learning form’ (i.e. focusing on prediction and error rates) and lack econometrics/statistics-specifics outputs such as Covariance Matrix, T-statistics, p-value etc..

Our project will thus be to fully utilize the developments in big data handling incorporated in SPARK and Tensorflow to manage statistical analyses at scale. We will thus develop full econometrics / statistics packages in SPARK and Tensorflow for different statistical methods, using state of the art AWS architecture and technologies.