Are you interested in this course? Please let us know.
 Book nowWaitinglist
Prices are displayed without VAT by default.
  • Global training info
  • Category Data Science
  • Price (excl. VAT)
  • Language {{course.language}}
  • Duration 3 Days
  • Time 09:00 - 17:00
  • Lunch Included

Data Science with Spark

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. This three-day training, taught in Dutch and English, empowers data scientists to use Spark from the command line, notebooks, and scripts and in combination with Python and Jupyter notebooks. Through instructor-led discussion and interactive, hands-on exercises, you will master the tools that Spark offers to perform large-scale data science.

Audience Profile Data Science with Spark

You will benefit from Data Science with Spark training if:

  • you are interested in learning about Spark and how to use it
  • you perform data science and want to apply machine learning models
  • you want to learn more about H20, GraphX, and other libraries built on top of Spark

Achievements Upon Completion

Benefit

Master the tools that Spark offers to perform large-scale data science.

You will learn:

  • all about Spark and the capabilities it offers
  • how to use Spark in combination with Python and Jupyter notebook
  • the definition of RDDs and how to use them
  • how stages and tasks influence your Spark jobs
  • the best data format to use with Spark
  • the definition of data frames and how to use them
  • how to convert between pandas and Spark data frames
  • how to use Spark's built-in machine-learning libraries to do regression, classification, clustering, and ALS
  • how to use MLlib Pipelines
  • how to do cross validation
  • the differences between the R, Python, and Scala libraries

You will gain hands-on experience in:

  • using Spark from the command line, notebooks, and from scripts
  • loading and saving RDDs using Parquet
  • MLlib
  • H20

You will develop skills to:

  • work with data frames
  • apply machine-learning algorithms in Spark
  • use pipelines and cross-validation
  • use GraphX

Additional Information

Requirements

You will need to bring your own laptop for this training with the following requirements:

  • 8GB RAM minimum
  • 25GB of free hard disk space
  • USB Port accessible

http://training.xebia.com/data-science/data-science-with-spark