Scalable Statistics with Large Datasets
This project was for summer term 2020, you CAN NOT apply to this project anymore!
- Sponsored by: Amazon
- Project Lead: Dr. Ricardo Acevedo Cabra
- Scientific Lead: M.Sc. Aurelien Ouattara
- Term: Summer semester 2020
Results of this project are explained in detail in the final documentation and presentation.
In Amazon EU Global Transportation Services team, we are improving our customer experience and costs by answering complex questions using enhanced analytical methods. With the computational developments and accessibility improvements made by Amazon Web Service, and through our efforts to move away from averages and look into more granular information, we are now extensively using large and extra-large datasets (more than hundreds of millions of observations) to perform our analyses and develop more precise statistical models.
While dealing with big datasets has historically been the focus of Machine Learning methods, causality inference (which lies within statistical and econometrical methods) remains necessary for taking business decisions. Today, the programming languages that have been optimized for handling big data (such as SPARK or Tensorflow) usually only support statistics models through their ‘machine learning form’ (i.e. focusing on prediction and error rates) and lack econometrics/statistics-specifics outputs such as Covariance Matrix, T-statistics, p-value etc..
Our project will thus be to fully utilize the developments in big data handling incorporated in SPARK and Tensorflow to manage statistical analyses at scale. We will thus develop full econometrics / statistics packages in SPARK and Tensorflow for different statistical methods, using state of the art AWS architecture and technologies.