The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed. With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data.
Course Goals
The main goal of this course is to provide students with a unique opportunity to acquire conceptual background and mathematical tools applicable to Big Data Analytics and Real Time Computation. The course will briefly review specific challenges of Big Data Analytics, such as problems of extracting, unifying, updating, and merging information and specific needs in processing data, which should be highly parallel and distributed. With these specific features in mind we will then study more closely a number of mathematical tools for Big Data analytics, such as regression analysis, linear estimation, calibration problems, real time processing of incoming (potentially infinite) data. We will see how these approaches can be transformed to conform to the Big Data demands. We will also discuss why most of widely used algorithmic languages are not quite appropriate for solving such problems and outline alternative approaches.
Course Ideas
Within traditional approaches to information processing we have to collect all the data in one array and apply a processing algorithm to it. If raw data is distributed among many sites and its total volume is large it immediately leads to certain technical problems:
In the course it is shown that instead of collecting together all the raw data and process it all at once we can naturally split the whole process into simple highly independent pieces. Specifically:
It turns out that often information in canonical form has a fixed size, which does not depend on the amount of raw information used to produce it. As a result all the separate steps of extracting canonical information from raw data, combining it and obtaining a final result do not require excessive amounts of memory or computing power. After two pieces of canonical information are combined into one they can be immediately discarded. Extracting and combining pieces of canonical information can be performed on different computers without any need of synchronization. That provides a wide range of natural options for massive parallel distributed computing.