NERSCPowering Scientific Discovery Since 1974

Taylor Groves

XBD201703 00049060
Taylor Groves, Ph.D.
HPC Architecture and Performance Engineer
Advanced Technology Group
National Energy Research Scientific Computing Center
1 Cyclotron Rd
Mailstop: 59R4010A (office 59-3072B)
Berkeley, California 94720 US

Biographical Sketch

Taylor Groves is a member of NERSC's Advanced Technology Group, where his focus is on networks and distributed systems modeling analysis and simulation.  Prior to joining NERSC, he was a graduate researcher at Sandia National Laboratories Center for Computing Research.

He earned a B.S. in Computer Science from Texas State University.  He holds a MS and PhD from the University of New Mexico as a part of the Scalable Systems Laboratory under Prof. Dorian Arnold.

For a complete CV see TGroves-cv.pdf .

Most up to date publications on Google Scholar

Personal site: http://taylorgroves.com

Journal Articles

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", Concurrency and Computation Practice and Experience, December 1, 2018,

Taylor Groves, Ryan Grant, Aaron Gonzales, Dorian Arnold, "Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning", Transactions on Parallel and Distributed Systems, November 21, 2017, doi: 10.1109/TPDS.2017.2773483

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. Additionally, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Lastly, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

Conference Papers

George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, Keren Bergman, "Bandwidth Steering in HPC using Silicon Nanophotonics", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 17, 2019,

Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas Wright, "GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 16, 2019,

Network congestion is one of the biggest problems facing HPC systems today, affecting system throughput, performance, user experience and reproducibility. Congestion manifests as run-to-run variability due to contention for shared resources like filesystems or routes between compute endpoints. Despite its significance, current network benchmarks fail to proxy the real-world network utilization seen on congested systems. We propose a new open-source benchmark suite called the Global Performance and Congestion Network Tests (GPCNeT) to advance the state of the practice in this area. The guiding principles used in designing GPCNeT are described and the methodology employed to maximize its utility is presented. The capabilities of GPCNeT evaluated by analyzing results from several world’s largest HPC systems, including an evaluation of congestion management on a next-generation network. The results show that systems of all technologies and scales are susceptible to congestion and this work motivates the need for congestion control in next-generation networks.

Tiffany Connors, Taylor Groves, Tony Quan, Scott Hemmert, "Simulation Framework for Studying Optical Cable Failures in Dragonfly Topologies", Workshop on Scalable Networks for Advanced Computing Systems in conjunction with IPDPS, May 17, 2019,

Nathan Hjelm, Matthew Dosanjh, Ryan Grant, Taylor Groves, Patrick Bridges, Dorian Arnold, "Improving MPI Multi-threaded RMA Communication Performance", ACM International Conference on Parallel Processing (ICPP), August 1, 2018,

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", ExaMPI in association with SC17, November 12, 2017,

Taylor Groves, Yizi Gu, Nicholas J. Wright, "Understanding Performance Variability on the Aries Dragonfly Network", HPCMASPA in association with IEEE Cluster, September 1, 2017,

Matthew GF Dosanjh, Taylor Groves, Ryan E Grant, Ron Brightwell, Patrick G Bridges, "RMA-MT: a benchmark suite for assessing MPI multi-threaded RMA performance", Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on, IEEE, September 1, 2016, 550--559,

Taylor Groves, Ryan E Grant, Dorian Arnold, "NiMC: Characterizing and eliminating network-induced memory contention", Parallel and Distributed Processing Symposium, 2016 IEEE International, January 1, 2016, 253--262,

Taylor Groves, Ryan E Grant, Scott Hemmer, Simon Hammond, Michael Levenhagen, Dorian C Arnold, "(SAI) Stalled, Active and Idle: Characterizing Power and Performance of Large-Scale Dragonfly Networks", Cluster Computing (CLUSTER), 2016 IEEE International Conference on, January 1, 2016, 50--59,

Taylor Groves, Samuel K Gutierrez, Dorian Arnold, "A LogP Extension for Modeling Tree Aggregation Networks", Cluster Computing (CLUSTER), 2015 IEEE International Conference on, 2015, 666--673,

Joshua D Goehner, Taylor L Groves, Dorian C Arnold, Dong H Ahn, Gregory L Lee, "An Optimal Algorithm for Extreme Scale Job Launching", Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on, 2013, 1115--1122,

Taylor Groves, Dorian Arnold, Yihua He, "In-network, Push-based Network Resource Monitoring: Scalable, Responsive Network Management", Proceedings of the Third International Workshop on Network-Aware Data Management, 2013, 8,

Xiao Chen, Jian Shen, Taylor Groves, Wu Jie, "Probability Delegation Forwarding in Delay Tolerant Networks", Computer Communications and Networks, 2009. ICCCN 2009. Proceedings of 18th Internatonal Conference on, IEEE, January 1, 2009,

Book Chapters

Ryan E. Grant, Taylor Groves, Simon Hammond, K. Scott Hemmert, Michael Levenhagen, Ron Brightwell, "Handbook of Exascale Computing: Network Communications", (ISBN:978-1466569003 Chapman and Hall: January 1, 2017)

Presentation/Talks

Taylor Groves, Networks, Damn Networks and Aries, NERSC CS/Data Seminar, October 6, 2017,

Presentation of the performance of the Cori Aries network.   Highlights of monitoring and analysis efforts underway.

Doug Jacobsen, Taylor Groves, Global Aries Counter Collection and Analysis, Cray Quarterly Meeting, July 25, 2017,

Taylor Groves, Characterizing Power and Performance in HPC Networks, Future Technologies Group at ORNL, January 10, 2017,

Taylor Groves, Characterizing and Improving Power and Performance in HPC Networks, Advanced Technology Group -- NERSC, January 8, 2017,

Taylor Groves, Improving Power and Performance in HPC Networks, AMD Research - Austin, June 10, 2016,

Reports

Taylor Groves, Ryan Grant, "Power Aware, Dynamic Provisioning of HPC Networks", Sandia National Labs report, 2015,

Taylor Groves, Kurt B Ferreira, "BALANCING POWER AND TIME OF MPI OPERATIONS", CCR, 2014,

Taylor Groves, Jeff Knockel, Eric Schulte, "BFS vs CFS scheduler comparison", 2009,

Posters

Taylor Groves, "Characterizing and Improving Power and Performance in HPC Networks (Doctoral Showcase)", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 1, 2016,

Taylor Groves, Ryan Grant, Dorian Arnold, "Network-induced Memory Contention.", Salishan Conference on High Speed Computing, Gleneden Beach, OR,, April 1, 2016,