Book Description
Big Data is here to stay, as more and more companies see the value of storing data generated internally or not. But as with every new technology, itβs not enough to use it if no value is generated from it. Analyzing these datasets is a fundamental step into extracting the locked value in data. In this process, Python has been the most used programming language to process and analyze data, with its easy of use and very rich ecosystem and powerful libraries, and itβs still growing.
This course will cover an introduction to data manipulation in Python using Pandas, with generation of statistics, metrics, and plots. The next step is to do the analysis but now distributed on several computers, using Dask. Data aggregation for plots when all data does not fit into memory will be addressed. For really large problems and datasets, an introduction of Hadoop (HDFS and YARN) will be presented. The rest of the course will focus into Spark and its interaction with the previous tools presented.
By the end of the course, the student will be able to bootstrap its own Python environment, read large files and more data than can fit into memory, connect to Hadoop systems and manipulate data from there, generating statistics, metrics and graphs that represent the information in the dataset.
This approach differs from the more common approaches to Big Data problems that usually try to solve this problem using MapReduce or SQL-over-HDFS tools, such as Hive or Impala. The approach of building from the small case to the distributed one is different, using the similar interfaces between the presented stack to make it easier to understand and achieve the final goal.