2013 PDSF User Meeting Minutes
December 3
Attending
Shusu, Ernst, Mike, Craig, Iwona, Larry, Lisa
Outages/Downtimes
November 12 - 15: Mendel nodes offline for recabling
Various Dates: Rolling upgrades of various PDSF interactive nodes
Upcoming Downtimes
January: Possible project outage
December 16: Eliza 3, 8, 9 will no longer be accessible
Other Issues
New Mendel rack is in place, new interactives are open to a few beta testers.
Amount of available scratch now tracked properly by queue system.
Multicore jobs can be run on PDSF.
Larry gave an update on the Mendel maintenance. Everything is stable, based on the lack of errors in the logs and no more connection hiccups.
Slides
You can find the slides shown at the meeting here.
November 5
Attending
Alex, Mike, Jeff, Iwona, Larry, Lisa
Outages/Downtimes
October 8: Project offline for inode expansion
Upcoming Downtimes
November 6: Rolling upgrade of PDSF interactive nodes
November 12: Mendel compute nodes (48 of PDSF's 205) will be offline for a week for recabling
Other Issues
20 new Mendel nodes have been delivered. They will likely be permanently added to the cluster after the recabling finishes on 11/19.
The interactive nodes will be replaced with newer Mendel nodes after the holidays. We will be going from 4 nodes to 3, but keeping the number of cores constant.
Since 10/29 Iwona and Tony have been doing a rolling upgrade of the kernel and GPFS on the computes.
Benchmarking is online. You can find the link here: PDSF Monitoring
cvmfs is available via parrot on carver. Please see the minutes for instructions.
The NERSC AFS cell is going away in a few months. Since STAR still needs this, Iwona and Jeff will investigate alternatives.
ALICE is running on only cvmfs as of 1 week ago.
ALICE and STAR's project space is available and they will begin migrating there around the end of November.
Slides
You can find the slides shown at the meeting here.
October 1
Attending
Craig, Jeff, Yuen-Dat, Iwona
Outages/Downtimes
August 28: Project degraded
Upcoming Downtime
Mid-November: Maintenance and Mendel increase
Other Issues
Since Lisa was out of town the meeting was hosted by Iwona
Slides
The slides shown at the meeting are here.
September 3
Attending
Craig, Alex, Yuen-Dat, Iwona, Lisa
Outages/Downtimes
August 12: Network issues
August 20: Partial outage. Mendel nodes upgrade
Upcoming Downtimes
October 10 (tentative): All day, maintenance
Future: Mendel node upgrade (partial outage)
Slides
The slides shown at the meeting are here.
August 6
Attending
Larry, Jeff, Alex, Craig, Lisa
Outages/Downtimes
July 26: eliza18 degraded
July 30: maintenance
July 31: /project offline for all of NERSC
Upcoming Downtimes
August 20: Mendel maintenance
Other Issues
There are several new groups that are considering joining PDSF on a trial basis.
ESNet workshop in 2 1/2 weeks, some networking measurements may be need from PDSF
Slides
The slides shown at the meeting are here.
July 16
Attending
Mike, Craig (phone), Lisa
Outages/Downtimes
July 5: Univa Grid Enging Update
Upcoming Downtimes
July 30th maintenance.
August 20th tentative downtime for new Mendel nodes.
Other Issues
Other Daya Bay production site will be offline from July 17 to July 24, so PDSF uptime will be important
Slides
The slides shown at the meeting are here.
July 2
Attending
Alex, Mike, Yuen-Dat, Chang Hyon, Jeff (phone), Lisa
Outages/Downtimes
June 19: Univa Grid Enging Update
June 28: External voltage spike
Upcoming Downtimes
July 30th maintenance.
Other Issues
Jobs in Eqw state will be cleared after 1 business day.
New account creation for PDSF will follow existing NERSC notification procedure. This will result in all responsible people (PI, PI Proxy, Project Manager) getting an email telling them to approve the new users.
Benchmarking files will be put in each eliza. The web interface is being developed.
Slides
The slides shown at the meeting are here.
June 18
Attending
Craig, Alex, Chang Hyon, Iwona (phone), Larry (phone), Lisa
Outages/Downtimes
None
Upcoming Downtimes
July 30th maintenance.
Other Issues
Non-functioning modules are cleaned out of SL53. Please email Lisa if you need more modules.
Instructions for dot file migration were mailed to users, they are the new defaults.
Some new functionality has been added to the NERSC pages. The completed jobs summary page has been fixed, you can see it here . You can see the login node usage here and completed jobs on PDSF here.
PDSF Steering Committee Meeting tentatively scheduled for the end of July, there will be a doodle poll.
Lisa is working on providing benchmarking information for the login nodes, project, and the elizas.
Slides
The slides shown at the meeting are here.
June 4
Attending
Mike, Alex, Jeff, Chang Hyon, Iwona (phone), Lisa
Outages/Downtimes
May 21st maintenance.
Upcoming Downtimes
None planned.
Other Issues
Old and decrepeit modules are going to be cleaned out of SL53 on 6/5/13. Please email Lisa if something you need is suddenly missing.
Instructions for dot file migration will be mailed to PDSF users on Monday 6/10/13.
Some new functionality has been added to the NERSC pages. You can see the login node usage here and completed jobs on PDSF here.
Slides
The slides shown at the meeting are here.
May 21
Attending
Mike, Jeff, Hiroshi, Yuen-Dat, Yushu, Lisa
Outages/Downtimes
No signifcant outages in the past 2 weeks.
Upcoming Downtimes
May 21st to upgrade GPFS on elizas, interactive nodes.
Other Issues
Jeff asked about making 32 bit libraries available for STAR software on carver. Lisa referred him to consult@nersc.gov.
Yuen-Dat asked about compiling fluka on PDSF. There is already a ticket on this and it sounds like a solution has been found. Lisa will follow up.
Slides
The slides shown at the meeting are here.
May 7
Attending
Mike, Jeff, Keith, Yuen-Dat, Alex, Iwona, Lisa
Outages/Downtimes
No signifcant outages in the past 2 weeks.
Upcoming Downtimes
May 8th rolling upgrade of compute nodes to new kernel/GPFS, may impact cluster performance.
Downtime proposed for May 21st to upgrade GPFS on elizas, interactive nodes, and possibly ALICE upgrades.
Other Issues
Lisa takes over from Yushu. Thanks to Yushu for all his hard work.
Eliza11 has limited io slots, Iwona will look into.
Jeff asked for failure rate of xrootd disks. Lisa will look into this. Jeff mentioned that it would be nice to automate this process.
The transition to a more unified dot file system was discussed. Lisa will write up documentation on how to migrate to the new files.
Slides
The slides shown at the meeting are here.