NERSCPowering Scientific Discovery Since 1974

2013 PDSF User Meeting Minutes

December 3

Attending

Shusu, Ernst, Mike, Craig, Iwona, Larry, Lisa

Outages/Downtimes

November 12 - 15: Mendel nodes offline for recabling

Various Dates: Rolling upgrades of various PDSF interactive nodes

Upcoming Downtimes

January: Possible project outage

December 16: Eliza 3, 8, 9 will no longer be accessible

Other Issues

New Mendel rack is in place, new interactives are open to a few beta testers.

Amount of available scratch now tracked properly by queue system.

Multicore jobs can be run on PDSF.

Larry gave an update on the Mendel maintenance. Everything is stable, based on the lack of errors in the logs and no more connection hiccups.

Slides

You can find the slides shown at the meeting here.

November 5

Attending

Alex, Mike, Jeff, Iwona, Larry, Lisa

Outages/Downtimes

October 8: Project offline for inode expansion

Upcoming Downtimes

November 6: Rolling upgrade of PDSF interactive nodes

November 12: Mendel compute nodes (48 of PDSF's 205) will be offline for a week for recabling

Other Issues

20 new Mendel nodes have been delivered. They will likely be permanently added to the cluster after the recabling finishes on 11/19.

The interactive nodes will be replaced with newer Mendel nodes after the holidays. We will be going from 4 nodes to 3, but keeping the number of cores constant.

Since 10/29 Iwona and Tony have been doing a rolling upgrade of the kernel and GPFS on the computes.

Benchmarking is online. You can find the link here: PDSF Monitoring

cvmfs is available via parrot on carver. Please see the minutes for instructions.

The NERSC AFS cell is going away in a few months. Since STAR still needs this, Iwona and Jeff will investigate alternatives.

ALICE is running on only cvmfs as of 1 week ago.

ALICE and STAR's project space is available and they will begin migrating there around the end of November.

Slides

You can find the slides shown at the meeting here.

October 1

Attending

Craig, Jeff, Yuen-Dat, Iwona

Outages/Downtimes

August 28: Project degraded

Upcoming Downtime

Mid-November: Maintenance and Mendel increase

Other Issues

Since Lisa was out of town the meeting was hosted by Iwona

Slides

The slides shown at the meeting are here.

September 3

Attending

Craig, Alex, Yuen-Dat, Iwona, Lisa

Outages/Downtimes

August 12: Network issues

August 20: Partial outage. Mendel nodes upgrade

Upcoming Downtimes

October 10 (tentative): All day, maintenance

Future: Mendel node upgrade (partial outage)

Slides

The slides shown at the meeting are here.

August 6

Attending

Larry, Jeff, Alex, Craig, Lisa

Outages/Downtimes

July 26: eliza18 degraded

July 30: maintenance

July 31: /project offline for all of NERSC

Upcoming Downtimes

August 20: Mendel maintenance

Other Issues

There are several new groups that are considering joining PDSF on a trial basis.

ESNet workshop in 2 1/2 weeks, some networking measurements may be need from PDSF

Slides

The slides shown at the meeting are here.

July 16

Attending

Mike, Craig (phone), Lisa

Outages/Downtimes

July 5: Univa Grid Enging Update

Upcoming Downtimes

July 30th maintenance.

August 20th tentative downtime for new Mendel nodes.

Other Issues

Other Daya Bay production site will be offline from July 17 to July 24, so PDSF uptime will be important

Slides

The slides shown at the meeting are here.

July 2

Attending

Alex, Mike, Yuen-Dat, Chang Hyon, Jeff (phone), Lisa

Outages/Downtimes

June 19: Univa Grid Enging Update

June 28: External voltage spike

Upcoming Downtimes

July 30th maintenance.

Other Issues

Jobs in Eqw state will be cleared after 1 business day.

New account creation for PDSF will follow existing NERSC notification procedure. This will result in all responsible people (PI, PI Proxy, Project Manager) getting an email telling them to approve the new users.

Benchmarking files will be put in each eliza. The web interface is being developed.

Slides

The slides shown at the meeting are here.

 

June 18

Attending

Craig, Alex, Chang Hyon, Iwona (phone), Larry (phone), Lisa

Outages/Downtimes

None

Upcoming Downtimes

July 30th maintenance.

Other Issues

Non-functioning modules are cleaned out of SL53. Please email Lisa if you need more modules.

Instructions for dot file migration were mailed to users, they are the new defaults.

Some new functionality has been added to the NERSC pages. The completed jobs summary page has been fixed, you can see it here . You can see the login node usage here and completed jobs on PDSF here.

PDSF Steering Committee Meeting tentatively scheduled for the end of July, there will be a doodle poll.

Lisa is working on providing benchmarking information for the login nodes, project, and the elizas.

Slides

The slides shown at the meeting are here.

 

June 4

Attending

Mike, Alex, Jeff, Chang Hyon, Iwona (phone), Lisa

Outages/Downtimes

May 21st maintenance.

Upcoming Downtimes

None planned.

Other Issues

Old and decrepeit modules are going to be cleaned out of SL53 on 6/5/13. Please email Lisa if something you need is suddenly missing.

Instructions for dot file migration will be mailed to PDSF users on Monday 6/10/13.

Some new functionality has been added to the NERSC pages. You can see the login node usage here and completed jobs on PDSF here.

Slides

The slides shown at the meeting are here.

 

May 21

Attending

Mike, Jeff, Hiroshi, Yuen-Dat, Yushu, Lisa

Outages/Downtimes

No signifcant outages in the past 2 weeks.

Upcoming Downtimes

May 21st to upgrade GPFS on elizas, interactive nodes.

Other Issues

Jeff asked about making 32 bit libraries available for STAR software on carver. Lisa referred him to consult@nersc.gov.

Yuen-Dat asked about compiling fluka on PDSF. There is already a ticket on this and it sounds like a solution has been found. Lisa will follow up.

Slides

The slides shown at the meeting are here.


May 7

Attending

Mike, Jeff, Keith, Yuen-Dat, Alex, Iwona, Lisa

Outages/Downtimes

No signifcant outages in the past 2 weeks.

Upcoming Downtimes

May 8th rolling upgrade of compute nodes to new kernel/GPFS, may impact cluster performance.

Downtime proposed for May 21st to upgrade GPFS on elizas, interactive nodes, and possibly ALICE upgrades.

Other Issues

Lisa takes over from Yushu. Thanks to Yushu for all his hard work.

Eliza11 has limited io slots, Iwona will look into.

Jeff asked for failure rate of xrootd disks. Lisa will look into this. Jeff mentioned that it would be nice to automate this process.

The transition to a more unified dot file system was discussed. Lisa will write up documentation on how to migrate to the new files.

Slides

The slides shown at the meeting are here.