Spark Transformations And Actions Cheat Sheet

Spark Transformations And Actions Cheat Sheet Free
Spark Action Vs Transformation
Spark Actions And Transformations

Well there are 100s of blogs that talks on the topic, this is a quick reference cheat sheet for my day to day work needs, consolidated from different sources, so this will get updated as I come across new stuff that aids my work:) For those who wanted to have a understanding on the Spark internals hit this link. Cheat Sheet Depicting Deployment Modes And Where Each Spark Component Runs Spark Apps, Jobs, Stages and Tasks An anatomy of a Spark application usually comprises of Spark operations, which can be either transformations or actions on your data sets using Spark’s RDDs, DataFrames.

Spark Cheat Sheet Spark RDD Spark operators are either lazy transformation transforming RDDs or actions triggering the computation. Import/Export myRDD = textFile(f) Read f into RDD myRDD.saveAsTextFile(f) Store RDD into le f. MyRDD = sc.parallelize(l) Transform list l into RDD. Transformations on one RDD without shu e. The transformations are only computed when an action requires a result to be returned to the driver program. With these two types of RDD operations, Spark can run more efficiently: a dataset created through map operation will be used in a consequent reduce operation and will return only the result of the the last reduce function to the driver.

This post updates a previous very popular post 50+ Data Science, Machine Learning Cheat Sheets by Bhavya Geethika. If we missed some popular cheat sheets, add them in the comments below.

Cheatsheets on Python, R and Numpy, Scipy, Pandas

Data science is a multi-disciplinary field. Thus, there are thousands of packages and hundreds of programming functions out there in the data science world! An aspiring data enthusiast need not know all. A cheat sheet or reference card is a compilation of mostly used commands to help you learn that language’s syntax at a faster rate. Here are the most important ones that have been brainstormed and captured in a few compact pages.

Mastering Data science involves understanding of statistics, mathematics, programming knowledge especially in R, Python & SQL and then deploying a combination of all these to derive insights using the business understanding & a human instinct—that drives decisions.

Here are the cheat sheets by category:

Cheat sheets for Python:

Python is a popular choice for beginners, yet still powerful enough to back some of the world’s most popular products and applications. It's design makes the programming experience feel almost as natural as writing in English. Python basics or Python Debugger cheat sheets for beginners covers important syntax to get started. Community-provided libraries such as numpy, scipy, sci-kit and pandas are highly relied on and the NumPy/SciPy/Pandas Cheat Sheet provides a quick refresher to these.

Python Cheat Sheet by DaveChild via cheatography.com
Python Basics Reference sheet via cogsci.rpi.edu
OverAPI.com Python cheatsheet
Python 3 Cheat Sheet by Laurent Pointal

Cheat sheets for R:

The R's ecosystem has been expanding so much that a lot of referencing is needed. The R Reference Card covers most of the R world in few pages. The Rstudio has also published a series of cheat sheets to make it easier for the R community. The data visualization with ggplot2 seems to be a favorite as it helps when you are working on creating graphs of your results.

At cran.r-project.org:

At Rstudio.com:

R markdown cheatsheet, part 2

Others:

DataCamp’s Data Analysis the data.table way

Cheat sheets for MySQL & SQL:

For a data scientist basics of SQL are as important as any other language as well. Both PIG and Hive Query Language are closely associated with SQL- the original Structured Query Language. SQL cheatsheets provide a 5 minute quick guide to learning it and then you may explore Hive & MySQL!

SQL for dummies cheat sheet

Cheat sheets for Spark, Scala, Java:

Apache Spark is an engine for large-scale data processing. For certain applications, such as iterative machine learning, Spark can be up to 100x faster than Hadoop (using MapReduce). The essentials of Apache Spark cheatsheet explains its place in the big data ecosystem, walks through setup and creation of a basic Spark application, and explains commonly used actions and operations.

Dzone.com’s Apache Spark reference card
DZone.com’s Scala reference card
Openkd.info’s Scala on Spark cheat sheet
Java cheat sheet at MIT.edu
Cheat Sheets for Java at Princeton.edu

Cheat sheets for Hadoop & Hive:

Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. Explore the Hadoop cheatsheets to find out Useful commands when using Hadoop on the command line. A combination of SQL & Hive functions is another one to check out.

Cheat sheets for web application framework Django:

Django is a free and open source web application framework, written in Python. If you are new to Django, you can go over these cheatsheets and brainstorm quick concepts and dive in each one to a deeper level.

Django cheat sheet part 1, part 2, part 3, part 4

Cheat sheets for Machine learning:

We often find ourselves spending time thinking which algorithm is best? And then go back to our big books for reference! These cheat sheets gives an idea about both the nature of your data and the problem you're working to address, and then suggests an algorithm for you to try.

Machine Learning cheat sheet at scikit-learn.org
Scikit-Learn Cheat Sheet: Python Machine Learning from yhat (added by GP)
Patterns for Predictive Learning cheat sheet at Dzone.com
Equations and tricks Machine Learning cheat sheet at Github.com
Supervised learning superstitions cheatsheet at Github.com

Cheat sheets for Matlab/Octave

MATLAB (MATrix LABoratory) was developed by MathWorks in 1984. Matlab d has been the most popular language for numeric computation used in academia. It is suitable for tackling basically every possible science and engineering task with several highly optimized toolboxes. MATLAB is not an open-sourced tool however there is an alternative free GNU Octave re-implementation that follows the same syntactic rules so that most of coding is compatible to MATLAB.

Cheat sheets for Cross Reference between languages

Related:

(Part 1) Cluster Mode

This post covers cluster mode specific settings, for client mode specific settings, see Part 2.

The Problem

One morning, while doing some back-of-an-envelope calculations, I discovered that we could lower our AWS costs by using clusters of fewer, powerful machines.

More cores, more memory, lower costs – it’s not every day a win/win/win comes along.

As you might expect, there was a catch. We had been using the AWS maximizeResourceAllocation setting to automatically set the size of our Spark executors and driver.

maximizeResourceAllocation allocates an entire node and its resources for the Spark driver. This worked well for us before. Our previous cluster of 10 nodes had been divided into 9 executors and 1 driver. 90% of our resources were processing data while 10% were dedicated to the various housekeeping tasks the driver performs.

However, allocating an entire node to the driver with our new cluster design wasted resources egregiously. A full 33% of our resources were devoted to the driver, leaving only 67% for processing data. Needless to say, our driver was significantly over-allocated.

Clearly, maximizeResourceAllocation wasn’t going to work for our new cluster. We were going to have to roll up our sleeves and manually configure our Spark jobs. Like any developer, I consulted the sacred texts (Google, Stack Overflow, Spark Docs). Helpful information abounded, but most of it was overly general. I had difficulty finding definite answers as to what settings I should choose.

The Solution

While calculating the specifics for our setup, I knew that the cluster specs might change again in the future. I wanted to build a spreadsheet that would make this process less painful. With a generous amount of guidance gleaned from this Cloudera blogpost, How to Tune Your Apache Spark Jobs Part 2, I built the following spreadsheet:

If you would like an easy way to calculate the optimal settings for your Spark cluster, download the spreadsheet from the link above. Below, I’ve listed the fields in the spreadsheet and detail the way in which each is intended to be used.

A couple of quick caveats:

The generated configs are optimized for running Spark jobs in cluster deploy-mode
The generated configs result in the driver being allocated as many resources as a single executor.

Configurable Fields

The fields shown above are configurable. The green-shaded fields should be changed to match your cluster’s specs.It is not recommended that you change the yellow-shaded fields, but some use-cases might require customization. More information about the default, recommended values for the yellow-shaded fields can be found in the Cloudera post.

Number of Nodes

The number of worker machines in your cluster. This can be as low as one machine.

Memory Per Node (GB)

The amount of RAM per node that is available for Spark’s use. If using Yarn, this will be the amount of RAM per machine managed by Yarn Resource Manager.

Cores Per Node

The number of cores per node that are available for Spark’s use. If using Yarn, this will be the number of cores per machine managed by Yarn Resource Manager.

Memory Overhead Coefficient

Recommended value: .1

The percentage of memory in each executor that will be reserved for spark.yarn.executor.memoryOverhead.

Executor Memory Upper Bound (GB)

Recommended value: 64

The upper bound for executor memory. Each executor runs on its own JVM. Upwards of 64GB of memory and garbage collection issues can cause slowness.

Executor Core Upper Bound

Spark transformations and actions cheat sheet printable

Recommended value: 5

The upper bound for cores per executor. More than 5 cores per executor can degrade HDFS I/O throughput.I believe this value can safely be increased if writing to a web-based “file system” such as S3, but significant increases to this limit are not recommended.

OS Reserved Cores

Recommended value: 1

Cores per machine to reserve for OS processes. Should be zero if only a percentage of the machine’s cores were made available to Spark (i.e. entered in the Cores Per Node field above).

OS Reserved Memory (GB)

Recommended value: 1

The amount of RAM per machine to reserve for OS processes. Should be zero if only a percentage of the machine’s RAM was made available to Spark (i.e. entered in the Memory Per Node field above).

Parallelism Per Core

Recommended value: 2

The level of parallelism per allocated core. This field is used to determine the spark.default.parallelism setting. Generally recommended setting for this value is double the number of cores.

Note: Cores Per Node and Memory Per Node could also be used to optimize Spark for local mode. If your local machine has 8 cores and 16 GB of RAM and you want to allocate 75% of your resources to running a Spark job, setting Cores Per Node and Memory Per Node to 6 and 12 respectively will give you optimal settings. You would also want to zero out the OS Reserved settings. If Spark is limited to using only a portion of your system, there is no need to set aside resources specifically for the OS.

Reference Table

Spark Transformations And Actions Cheat Sheet Free

Once the configurable fields on the left-hand side of the spreadsheet have been set to the desired values, the resultant cluster configuration will be reflected in the reference table.

There is some degree of subjectivity in selecting the Executors Per Node setting that will work best for your use case, so I elected to use a reference table rather than selecting the number automatically.

A good rule of thumb for selecting the optimal number of Executors Per Node would be to select the setting that minimizes Unused Memory Per Node and Unused Cores Per Node while keeping Total Memory Per Executor below the Executor Memory Upper Bound and Core Per Executor below the Executor Core Upper Bound.

For example, take the reference table shown above:

Spark Action Vs Transformation

Executors Per Node: 1
- Unused Memory Per Node: 0
- Unused Cores Per Node: 0
- Warning: Total Memory Per Executor exceeds the Executor Memory Upper Bound
- Warning: Cores Per Executor exceeds Executor Core Upper Bound
- (That row has been greyed out since it has exceeded one of the upper bounds)
Executors Per Node: 5
- Unused Memory Per Node: 0
- Unused Cores Per Node: 1
- Warning: Cores Per Executor exceeds the Executor Core Upper Bound.
Executors Per Node: 6
- Unused Memory Per Node: 1
- Unused Core Per Node: 1
- Total Memory Per Executor and Cores Per Executor are both below their respective upper bounds.
Executors Per Node: All others
- Either exceed the Executor Memory Upper Bound, exceed the Executor Cores Upper Bound, or waste more resources than Executors Per Node = 6

Executors Per Node = 6 is the optimal setting.

Spark Configs

Now that we have selected an optimal number of Executors Per Node, we are ready to generate the Spark configs with which we will run our job. We enter the optimal number of executors in the Selected Executors Per Node field. The correct settings will be generated automatically.

spark.executor.instances

(Number of Nodes Selected Executors Per Node) - 1*

This is the number of total executors in your cluster. We subtract one to account for the driver. The driver will consume as many resources as we are allocating to an individual executor on one, and only one, of our nodes.

spark.yarn.executor.memoryOverhead

Equal to Overhead Memory Per Executor

The memory to be allocated for the memoryOverhead of each executor, in MB. Calculated from the values from the row in the reference table that corresponds to our Selected Executors Per Node.

Spark Actions And Transformations

spark.executor.memory

Equal to Memory Per Executor

The memory to be allocated for each executor. Calculated from the values from the row in the reference table that corresponds to our Selected Executors Per Node.

spark.yarn.driver.memoryOverhead

Equal to spark.yarn.executor.memoryOverhead

The memory to be allocated for the memoryOverhead of the driver, in MB.

spark.driver.memory

Equal to spark.executor.memory

The memory to be allocated for the driver.

spark.executor.cores

Equal to Cores Per Executor

The number of cores allocated for each executor. Calculated from the values from the row in the reference table that corresponds to our Selected Executors Per Node.

spark.driver.cores

Equal to spark.executor.cores

The number of cores allocated for the driver.

spark.default.parallelism

spark.executor.instances spark.executor.cores * Parallelism Per Core*

Default parallelism for Spark RDDs, Dataframes, etc.

Using Configs

Now that we have the proper numbers for our configs, using them is fairly simple. Below, I’ve demonstrated 3 different ways the configs might be used:

(Part 1) Cluster Mode

The Problem

The Solution

Configurable Fields

Number of Nodes

Memory Per Node (GB)

Cores Per Node

Memory Overhead Coefficient

Recommended value: .1

Executor Memory Upper Bound (GB)

Recommended value: 64

Executor Core Upper Bound

Recommended value: 5

OS Reserved Cores

Recommended value: 1

OS Reserved Memory (GB)

Recommended value: 1

Parallelism Per Core

Recommended value: 2

Reference Table

Spark Transformations And Actions Cheat Sheet Free

Spark Action Vs Transformation

Spark Configs

spark.executor.instances

(Number of Nodes * Selected Executors Per Node) - 1

spark.yarn.executor.memoryOverhead

Equal to Overhead Memory Per Executor

Spark Actions And Transformations

spark.executor.memory

Equal to Memory Per Executor

spark.yarn.driver.memoryOverhead

Equal to spark.yarn.executor.memoryOverhead

spark.driver.memory

Equal to spark.executor.memory

spark.executor.cores

Equal to Cores Per Executor

spark.driver.cores

Equal to spark.executor.cores

spark.default.parallelism

spark.executor.instances * spark.executor.cores * Parallelism Per Core

Using Configs

Add to spark-defaults.conf

Pass as software settings to an AWS EMR Cluster

Pass as args with spark-submit

(Number of Nodes Selected Executors Per Node) - 1*

spark.executor.instances spark.executor.cores * Parallelism Per Core*