NERSCPowering Scientific Discovery Since 1974

Spark Distributed Analytic Framework

Description and Overview

Apache Spark™ is a fast and general engine for large-scale data processing.

Availibility

Spark is Available on Edison in CCM mode. Any NERSC user with access to Edison can access Spark. 

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
Spark edison applications/ debugging 1.0.0 spark/1.0.0 2014-07-10
 Spark data analytic framework

How to Use Spark

Follow the steps below to use spark, note that the order or the commands matters. DO NOT load the spark module until you are inside a batch job.  

Interactive mode

Submit an interactive batch job: 

qsub -I -q ccm_int -l walltime=00:30:00 -l mppwidth=48

Once you are inside the job, do the following (make sure you have the -V option):

module load spark
ccmlogin -V

Now you will land on a CCM compute node. You can start spark with the normal Spark command:

start-all.sh

To connect to the Scala Spark Shell, do:

spark-shell --master $SPARKURL

To connect to the Python Spark Shell, do:

pyspark --master $SPARKURL

to shutdown the Spark cluster, do:

stop-all.sh

 

Further Information       

Official documentations are available from the Apache Spark Web Page.