Data Science with Spark
This training empowers you as a data scientist to use Spark to do data science and apply machine learning models. You will learn how to use Spark in combination with Python and Jupyter notebooks and develop skills to use Spark from the command line, notebooks, and scripts. This training also teaches you to work with Data Frames, apply machine learning algorithms in Spark, and use Pipelines and Cross Validation.
Audience Profiel: Data Science with Spark Training
You will benefit from the Data Science with Spark Training if:
You are interested in learning what Spark is and how to use Spark, to do data science and apply machine learning models. Moreover, this training is for you if you want to learn more about H20, GraphX, and other libraries built on top of Spark.
Achievements upon completion
Through instructor-led discussion and interactive, hands-on exercises, you will master the tools that
Spark offer to perform large-scale data science.
- What Spark is
- How to use it in combination with Python and Jupyter notebook
- What RDD’s are and how to use them
- How stages and tasks influence your Spark jobs
- What is the best data format to use with Spark
- What Data Frames are and how to use them
- How to convert between pandas and Spark DataFrames
- How to use the Spark's built-in machine learning libraries to do regression,
classification, clustering, and ALS
- How to use MLlib Pipelines and how to do cross validation
- The differences between the R, Python, and Scala libraries
You will have hands-on experience in:
- Using Spark from the command line, notebooks, and from scripts
- Loading and saving RDD's using Parquet
You will have the skills to:
- Work with Data Frames
- Apply machine learning algorithms in Spark
- Use Pipelines and Cross Validation
- Use GraphX
!Please note, that you need to bring your own laptop for this training.
This laptop should meet the following requirements:
- At least 8GB RAM
- 25GB of free hard disk space
- USB Port accessible