NERSCPowering Scientific Discovery Since 1974

Spark Distributed Analytic Framework

Description and Overview

Apache Spark™ is a fast and general engine for large-scale data processing.

Availibility

Spark is Available on Edison in CCM mode. Any NERSC user with access to Edison can access Spark. 

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
Spark edison applications/ debugging 1.0.0 spark/1.0.0 2014-07-10
 Spark data analytic framework
Spark edison applications/ debugging 1.0.2 spark/1.0.2 2014-08-16 2014-08-16
 Spark data analytic framework
Spark edison applications/ debugging 1.1.0 spark/1.1.0 2014-09-21 2014-10-31
 Spark data analytic framework
Spark edison applications/ debugging 1.1.0-shm spark/1.1.0-shm 2014-10-28
 Spark data analytic framework

How to Use Spark

Follow the steps below to use spark, note that the order or the commands matters. DO NOT load the spark module until you are inside a batch job.  

Interactive mode

Submit an interactive batch job: 

qsub -I -q ccm_int -l walltime=00:30:00 -l mppwidth=48

Once you are inside the job, do the following (make sure you have the -V option):

module load spark
ccmlogin -V

Now you will land on a CCM compute node. You can start spark with the normal Spark command:

start-all.sh

To connect to the Scala Spark Shell, do:

spark-shell --master $SPARKURL

To connect to the Python Spark Shell, do:

pyspark --master $SPARKURL

to shutdown the Spark cluster, do:

stop-all.sh

Batch mode

Submit the following batch script, change number of cores/time/queue accordingly. Only use ccm_queue for long runs, use ccm_int for debug runs:

#!/bin/bash -l

#this file is run.pbs

#PBS -q ccm_queue
#PBS -l mppwidth=48
#PBS -l walltime=00:30:00
#PBS -e mysparkjob_$PBS_JOBID.err
#PBS -o mysparkjob_$PBS_JOBID.out

cd $PBS_O_WORKDIR

module unload altd
module load ccm

module load spark/1.1.0

export SPARK_LOCAL_DIRS=/dev/shm

env > ~/.ssh/environment

ccmrun sh $PWD/runspark.sh

This assumes you to have the following runspark.sh file:

#!/bin/bash -l

start-all.sh
spark-submit --master $SPARKURL ./testspark.py
stop-all.sh 
To submit the job, you can do:
qsub run.pbs

Further Information               

Official documentations are available from the Apache Spark Web Page.