NERSCPowering Scientific Discovery for 50 Years

Data Competition

Data Competition

The data competition will be held on the morning of Wednesday, September 20 (9 am - noon), during the NERSC Data Day and NUG 2017.

Participants can pick from one of the two available data analysis challenges: NERSC Job Analysis or Astronomy Catalogs. Participants could start to work on the challenge problems from two weeks before the actual data competition day on Wednesday, September 20, when each participant is given up to 5 minutes to present a few slides about your analysis technologies and results. A winner per challenge will be picked by the NERSC Data Competition Committee based on the following criteria:

  • How well do you use NERSC Data Analytics Tools Stack?   (You are encouraged to explore tools described at https://www.nersc.gov/users/data-analytics/data-analytics-2)
  • What is the scalability time to solution of the analysis method?
  • What is the accuracy of the analysis method?
  • What is the quality of visualization? 

Please contact Helen He (yhe@lbl.gov) or Debbie Bard (djbard@lbl.gov) to indicate the interest to participate the data competition, and we will add you to a Slack channel for further communications and answer any questions you may have.

Data Challenge #1: NERSC Job Analysis 

Archived running jobs information provides us abundant opportunities to analyze running jobs patterns, system utilization, batch scheduler characteristics, job submission strategies, and more.

For this challenge, users are given the historical Slurm jobs information on Cori and Edison from 1/1/2016 to 8/31/2017, stored in the CSV format. You are free to analyze both Cori and Edison, or just one system. The data sets and a brief README file are available at /global/project/projectdirs/mpccc/data_competition .

You are free to work on any one or more of the following open questions using Data Analytics and Machine Learning tools or to come up with your own creative questions and answers!

  • Baseline running jobs statistics (impressive visualizations are encouraged!)
  • What are the best times of day and week to submit a job? (define and tell us your favorite size of particular interest)
  • What are the best size and length of jobs to submit? Is it better to bundle small jobs to a large job or to submit many small jobs?
  • How often do users overestimate wall time requests?
  • Is there a single job characteristic that minimizes queue wait time?

FYI. Some of the NERSC Jobs analysis results are available at:

  • Queues: https://my.nersc.gov/queues.php?machine=cori&full_name=Cori
  • Completed Jobs: https://my.nersc.gov/completedjobs.php
  • Job Size Chart: https://my.nersc.gov/jobsize.php
  • Queue Backlog: https://my.nersc.gov/backlog.php
  • Queue Wait Times: http://www.nersc.gov/users/computational-systems/queues/queue-wait-times/
  • Job Completion: https://my.nersc.gov/jobcompletion.php

Please contact yhe@lbl.gov (Helen He) to indicate the interest to work on this challenge and for any questions.

Data Challenge #2: Astronomy Catalogs

This is a galaxy:     This is a star:       But what is this? 

galaxy.png       star.png      what.png


Astronomical images provide some of the richest (and most beautiful!) image data available to scientists. But identifying what we’re looking at in these images poses a real challenge for astronomers. Distant galaxies and faint stars both appear in these images as a handful of bright pixels against a noisy background, and astronomers have spent decades developing methods to distinguish between them. The best methods rely on measures of object size, shape and color. In this challenge, we supply a catalog of measurements made by the Sloan Digital Sky Survey from over 1 million astronomical objects observed over the ten-year telescope survey, and we ask you to develop a method to distinguish between stars and galaxies (and other object types) using a machine learning method.

For more background, please see the description of this (private) Kaggle competition:

https://inclass.kaggle.com/c/galaxy-star-separation

A “test” dataset will be provided on the day - your classifier will be evaluated on this testing set. Note that we are interested in how well optimised your *training* code performs, not your classifier! Think about how to visualize your results - which variables are more important? Can you show a ROC curve? 

You can download the csv file of the data here (updated Sept 15th. Note that the previous version of this training data did not have quasars classified separately, but were included in the "other" category. If you have been working with this older dataset, we will provide you with the appropriate evaluation dataset on the day). Each line is one object; the first line gives the names of the measurements associated with each object. The last variable on each line is the "truth" - this is the class of the object, based on spectroscopic data. The first variable ("type") is the estimate of this class from the simple SDSS classifier. See if you can do better than this classifier! 

The objects types are defined as follows:

  • 1: Star
  • 2: Galaxy
  • 3: Quasar
  • 0: Other

 

Variable Name

Description

type   Output from SDSS classifier (note that this is NOT the "truth" variable, but rather a benchmark to compare your results to).: 

 ra

 Right Ascension (coordinate: https://en.wikipedia.org/wiki/Equatorial_coordinate_system#Use_in_astronomy
dec   Declination (coordinate: https://en.wikipedia.org/wiki/Equatorial_coordinate_system#Use_in_astronomy

psfMag_u, psfMagErr_u 

psfMag_g, psfMagErr_g

psfMag_r, psfMagErr_r

psfMag_i, psfMagErr_i 

psfMag_z, psfMagErr_z

 PSF magnitude and error in the 5 wavelength filter bands, u/g/r/i/z. 

http://skyserver.sdss.org/dr1/en/proj/advanced/color/sdssfilters.asp 

modelMag_u, modelMagErr_u 

modelMag_g, modelMagErr_g

modelMag_r, modelMagErr_r

modelMag_i, modelMagErr_i

modelMag_z, modelMagErr_z 

Model magnitudes in the 5 filter bands:

http://www.sdss.org/dr12/algorithms/magnitudes/

 petroRad_u, pertroRadErr_u

 petroRad_g, pertroRadErr_g

 petroRad_r, pertroRadErr_r 

 petroRad_i, pertroRadErr_i

 petroRad_z, pertroRadErr_z

 

 Radius of object in the 5 filter bands: 

http://skyserver.sdss.org/dr1/en/help/docs/algorithm.asp?key=mag_petro 

q_u, qErr_u

q_g, qErr_g

q_r, qErr_r

q_i, qErr_i

q_z, qErr_z

 Stokes parameter q - a measure of ellipticity. 

 

 

u_u, uErr_u

u_g, uErr_g

u_r, uErr_r

u_i, uErr_i

u_z, uErr_z

 

Stokes parameter u - a measure of ellipticity. 

 

 mE1_u, mE1_g, mE1_r, mE1_i, mE1_z

mE2_u, mE2_g, mE2_r, mE2_i, mE2_z

Elipticity parameters mE1 and mE2 in each of the 5 filter bands:  http://www.sdss.org/dr13/algorithms/classify/#photo_adaptive 
 class ! TRUTH PARAMETER ! Spectroscopic class

 

Evaluate your classifier using this dataset! (Note that if you used the "old" version of the training data you will want to use this dataset to evaluate your network) The columns are the same as the training dataset. 

Data Competition Award Winners

Participants presented their analysis results and methods on the morning of September 20 during the Data Hackathon and Data Competition event.  A panel of NERSC Data Competition judges selected the following winners:

Challenge #1: Juliette Ugirumurera and Liza Rebrova
Challenge #2: Yisha Sun and Grzegorz Muszynski

 

DataCompetitionAward2017

Left to Right: Jialin Liu, Helen He, Juliette Ugirumurera, Rebecca Hartman-Baker, Yisha Sun, Grzegorz Muszynski, Debbie Bard

Challenge-winning code

The winning code for the Astronomy challenge can be found in this github repository, belonging to Yisha Sun. The code uses TensorFlow to set up and train the network. The script tf_script_yisha.py sets up the TensorFlow model and carries out the training. Further analysis of the results was performed using  the scripts confusion_matrix.py (to plot the confusion matrix), plot_features_importance.py (to determine the importance of each of the features used in the network) and roc_curve.py (to plot the ROC curve). 

The winning code for the SLURM log data challenge can be found in this iPython notebook, which belongs to Juliette Ugirumurera. The code uses SciKitLearn to construct, train and evaluate the network. 

For both of these challenges (as in most machine learning problems), note that a significant amount of work was required to clean the dataset and scale the variables of interest. These codes give a nice demonstration of how to do this for the two datasets provided for this challenge!