Apache Spark is a powerful open-source processing engine built around speed, ease of use, and advanced analytics. This three-day training, taught in English, empowers data scientists to use Spark with Python. Through instructor-led discussion and interactive, hands-on exercises, you will master the tools that Spark offers to perform large-scale data science.
Q: Is Data Science with Spark training right for me?
- Yes - if you are interested in learning about Spark and how to use it
- Yes - if you are a data science practitioner and want to apply your skills at scale
- Yes - if you want to learn how to use Spark’s machine learning and streaming capabilities
Q: What will I achieve by completing this training?
Master the tools that Spark offers to perform large-scale data science.
You will learn:
- All about Spark and the capabilities it offers
- How to use Spark in combination with Python and Jupyter notebook
- How you can optimize your Spark jobs with stages and tasks
- The data formats to use with Spark
- What Spark DataFrames are and how you should use them
- How to convert between pandas and Spark DataFrames
- How to use Spark's built-in machine-learning libraries to do regression, classification, and recommenders
- How to use Spark Structured Streaming
You will gain hands-on experience in:
- Using Spark from the command line, notebooks, and from scripts
- Loading and saving DataFrames using CSV, Parquet and Apache Hive
- Machine Learning with Spark
- Spark Streaming
You will develop the skills to:
- Work with Spark DataFrames
- Apply machine-learning algorithms in Spark
- Using streaming algorithms
Q: What else should I know?
- Know the basics of programming in Python
- Familiar with the basics of data manipulation, SQL, etc.
You will need to bring your own laptop for this training with the following requirements:
- 8GB RAM minimum
- 25GB of free hard disk space
- SSH client installed
- Ability to install software