PDSF Users Meeting 12/8/09
Attending: Eric and Jay from PDSF and users Andrew, Jeff P., Marjorie, Doug, Keith and Jeff A.
Cluster status: Utilization has been relatively light, mostly STAR and some ATLASUCI jobs.
Outages: There were two problems on 12/7. In the morning things were slow and the reason was probably related to ATLASUCI starting about 700 jobs at the same time and overloading /common. The user was contacted and temporarily disabled and his jobs were allowed to finish. The user will do things differently in the future including staggering job starts and using the appropriate io resource. In the afternoon there were GPFS problems that affected many nodes on the cluster. The cause was related to a scan and steps have been taken to avoid this problem in the future.
New Hardware: Four new filesystems are now online (eliza5, eliza6, eliza14 and eliza15) and the group that will use them have been notified. The "diskvaults" webpage is being updated with the new information.
ATLAS Grid Node: There are discussions in progress between Iwona and her contact in ATLAS regarding the optimal way to configure a grid node for ATLAS.
Squid/Fuse: Jay is working on sorting out some issues related to chos.
- /eliza4 permissions problem: An ATLASUCI user was found to be writing to /eliza4 which is ATLAS storage. This was possible because the user had "atlas" as a secondary unix group so all ATLASUCI users that were configured like that were reconfigured to only be in "atlasuci" and not "atlas". The user later removed his files and will work on /project instead.
- SGE accounting w.r.t. job arrays was discussed. Supposedly the bug was fixed for the new release but this has not been verified yet.
- There was a discussion about the setting for the io resources perhaps being too low. Typically the resources are set at about 200 and this can limit throughput. Jay agreed that perhaps we could try to adjust them somewhat higher.
- Retirement of old hardware came up and Jay said he would generate a list and we'd bring it up again at the next meeting