NERSCPowering Scientific Discovery Since 1974

Spark Distributed Analytic Framework

Description and Overview

Apache Spark™ is a fast and general engine for large-scale data processing.

How to Use Spark

Follow the steps below to use spark, note that the order or the commands matters. DO NOT load the spark module until you are inside a batch job.  

Interactive mode

Submit an interactive batch job: 

qsub -I -q ccm_int -l walltime=00:30:00 -l mppwidth=48

Once you are inside the job, do the following (make sure you have the -V option):

module load spark
ccmlogin -V

Now you will land on a CCM compute node. You can start spark with the normal Spark command:

start-all.sh

You may want to wait for the Spark cluster to start up and get into a usable state before submitting a job (or running a shell). The following command will wait until 100% of workers come on line, and wait a maximum of 200 seconds. Note that in large jobs (hundreds of nodes), some workers may never come online due to system variations. You are adviced to use a percentage less than 95% if you are submitting a large job. 

waitforstartup.sh 100 200

To connect to the Scala Spark Shell, do:

spark-shell --master $SPARKURL

To connect to the Python Spark Shell, do:

pyspark --master $SPARKURL

to shutdown the Spark cluster, do:

stop-all.sh

Batch mode

Submit the following batch script, change number of cores/time/queue accordingly. Only use ccm_queue for long runs, use ccm_int for debug runs:

#!/bin/bash -l

#this file is run.pbs

#PBS -q ccm_queue
#PBS -l mppwidth=48
#PBS -l walltime=00:30:00
#PBS -e mysparkjob_$PBS_JOBID.err
#PBS -o mysparkjob_$PBS_JOBID.out

cd $PBS_O_WORKDIR

module load spark

env | tee $HOME/.ssh/environment

ccmrun sh $PWD/runspark.sh

This assumes you to have the following runspark.sh file:

#!/bin/bash -l

start-all.sh

#wait for 100% of all workers to come online, give up after 200 seconds
waitforstartup.sh 100 200
RESULT=$?

if [ "$RESULT" == "0" ]
then
spark-submit --master $SPARKURL ./testspark.py
else
echo "ERROR: Spark didn't start correctly, giving up for now, please try again!"
fi

#No need to do stop-all.sh, since all processes will be killed at job exit.  
To submit the job, you can do:
qsub run.pbs

Further Information                                     

Official documentations are available from the Apache Spark Web Page. 

 

Availibility

Spark is Available on Edison in CCM mode. Any NERSC user with access to Edison can access Spark. 

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
Spark edison applications/ debugging 1.0.0 spark/1.0.0 2014-07-10
 Spark data analytic framework
Spark edison applications/ debugging 1.0.2 spark/1.0.2 2014-08-16 2014-08-16
 Spark data analytic framework
Spark edison applications/ debugging 1.1.0 spark/1.1.0 2014-09-21 2014-10-31
 Spark data analytic framework
Spark edison applications/ debugging 1.1.0-shm spark/1.1.0-shm 2014-10-28
 Spark data analytic framework
Spark edison applications/ debugging 1.2.1 spark/1.2.1 2015-02-17 2015-02-18
 Spark data analytic framework
Spark edison applications/ debugging 1.2.1-breeze spark/1.2.1-breeze 2015-03-13
 Spark data analytic framework
Spark edison applications/ debugging 1.2.1-scratch spark/1.2.1-scratch 2015-03-31 2015-04-02
 Spark data analytic framework
Spark edison applications/ debugging 1.3.1 spark/1.3.1 2015-04-27 2015-05-15
 Spark data analytic framework
Spark edison applications/ debugging scratch spark/scratch 2015-03-30
 Spark data analytic framework