Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. This three-day training, taught in English, empowers data scientists to use Spark with Python from the command line, notebooks, scripts and from Jupyter notebooks.
Through instructor-led discussion and interactive, hands-on exercises, you will master the tools that Spark offers to perform large-scale data science.
Q: Is Data Science with Spark training right for me?
- Yes - if you are interested in learning about Spark and how to use it
- Yes - if you perform data science and want to apply machine learning models
Q: What will I achieve by completing this training?
Master the tools that Spark offers to perform large-scale data science.
You will learn:
- All about Spark and the capabilities it offers
- How to use Spark in combination with Python and Jupyter notebook
- How stages and tasks influence your Spark jobs
- The best data format to use with Spark
- The definition of data frames and how to use them
- How to convert between pandas and Spark data frames
- How to use Spark's built-in machine-learning libraries to do regression, classification, clustering, and ALS
- How to use Spark Structured Streaming
You will gain hands-on experience in:
- Using Spark from the command line, notebooks, and from scripts
- Loading and saving DataFrames using Parquet
- Machine Learning with Spark
- Spark Streaming
You will develop skills to:
- Work with data frames
- Apply machine-learning algorithms in Spark
- Use pipelines and cross-validation
Q: What else should I know?
You will need to bring your own laptop for this training with the following requirements:
- 8GB RAM minimum
- 25GB of free hard disk space
- SSH client installed
- Ability to install software