Big Data Analysis with Python

The goal of this course is to learn how to use Python and Spark to ingest, process, and analyze large volumes of data with different structures to generate insights and useful metrics from the data, walking through real-life examples and use cases.

Preview Online

Code Files

Big Data Analysis with Python

Ivan Marin, Sarang VK, Ankit Shukla

January 2019

Quick links: > What will you learn?

This title is available to pre-order now and is expected to be published in January 2019.

Packt Subscription

FREE

$9.99/m after trial

eBook

$19.60

RRP $27.99

Save 29%

Print + eBook

$34.99

RRP $34.99

What do I get with a Packt subscription?

Exclusive monthly discount - no contract
Unlimited access to entire Packt library of 6500+ eBooks and Videos
120 new titles added every month, on new and emerging tech

What do I get with an eBook?

Download this book in EPUB, PDF, MOBI formats
DRM FREE - read and interact with your content when you want, where you want, and how you want
Access this title in the subscription reader

What do I get with Print & eBook?

Get a paperback copy of the book delivered to you
Download this book in EPUB, PDF, MOBI formats
DRM FREE - read and interact with your content when you want, where you want, and how you want
Access this title in the subscription reader

What do I get with a Video?

Download this Video course in MP4 format
DRM FREE - read and interact with your content when you want, where you want, and how you want
Access this title in the subscription reader

$0.00

$19.60

$34.99

$0 p/m after trial

RRP $27.99

RRP $34.99

Subscription

eBook

Print + eBook

Subscribe Now

Frequently bought together

Big Data Analysis with Python

$ 27.99

$ 19.60

Big Data Analysis with Python

Jan 2019

403 pages

$ 19.60

Data Visualization with Python

$ 35.99

$ 25.20

Data Visualization with Python

Jan 2019

460 pages

$ 25.20

Buy 2 for $44.80
Save $19.18

Add to Cart

Book Details

ISBN 139781789955286

Paperback403 pages

Book Description

Big Data is here to stay, as more and more companies see the value of storing data generated internally or not. But as with every new technology, it’s not enough to use it if no value is generated from it. Analyzing these datasets is a fundamental step into extracting the locked value in data. In this process, Python has been the most used programming language to process and analyze data, with its easy of use and very rich ecosystem and powerful libraries, and it’s still growing.

This course will cover an introduction to data manipulation in Python using Pandas, with generation of statistics, metrics, and plots. The next step is to do the analysis but now distributed on several computers, using Dask. Data aggregation for plots when all data does not fit into memory will be addressed. For really large problems and datasets, an introduction of Hadoop (HDFS and YARN) will be presented. The rest of the course will focus into Spark and its interaction with the previous tools presented.

By the end of the course, the student will be able to bootstrap its own Python environment, read large files and more data than can fit into memory, connect to Hadoop systems and manipulate data from there, generating statistics, metrics and graphs that represent the information in the dataset.

This approach differs from the more common approaches to Big Data problems that usually try to solve this problem using MapReduce or SQL-over-HDFS tools, such as Hive or Impala. The approach of building from the small case to the distributed one is different, using the similar interfaces between the presented stack to make it easier to understand and achieve the final goal.

What You Will Learn

Read and transform data into different formats using Python
Read large volumes of data on disk and manipulate it to generate basic statistics and metrics
Handle distributed computing tasks over a cluster or local machines interconnected by a network
Convert data from different sources to efficient formats for storage or querying, like Parquet
Process, transform and aggregate data to generate clean datasets ready to be used in statistical analysis, visualization, and machine learning
Explore data visually, enabling other analysts and decision makers to act on information extracted from data

Authors

Ivan Marin

Ivan Marin is a Systems Architect and Data Scientist working at Daitan Group, a Campinas based software company. He designs Big Data systems for large volumes of data, and implements Machine Learning pipelines end to end using Python and Spark. He is also an active organizer of Data Science, Machine Learning and Python in São Paulo and has given Python for Data Science courses at university level.

Sarang VK

Sarang VK in his current role as a data scientist, his responsibilities include identifying data sources, data preparation, development, and evaluation of predictive and optimization models for setting up production level machine learning / statistical solutions with back-end and front-end developments. Alongside, he supports pre-sales, stakeholder communication, requirement gathering, scoping, and solutions.

His strengths are Machine / Deep Learning, SQL, Predictive Analytics, Time-Series, Simulation Modelling, Optimization, Image/Text Analytics, NLP, Python, R, Spark, TensorFlow, Keras, h2o, SAP-PAL, AWS, SAP Predictive Factory, Azure, Financial Analytics, Supply Chain, Banking and Insurance, Retail/Customer Analytics, Trading Analytics, Healthcare Analytics, RPA, IPA.