Testing, Tuning, and Configuring NERSC's Data Direct Disk Array for Thruput

Rick Un

Purpose: to tune and configure our data direct disk array to maximize sustained transfers.

Pre-testing:

An ftp transfer from H70 to a winterhawk I, produces read of times from 15-46 /meg/sec The higher number is a memory to network copy. A disk (non-cache) copy to network (gig) is 15-ish seconds for a 292 meg file (19 meg/sec)

TEST CASE 1:

setup: 
3 different machines concurrently ftp a 400 meg file to /dev/null of swift

result: 
test1: 22.17meg/s , 18.52meg/s , 26.27meg/s = (agg) 66.96 meg/sec
test2: 21.78meg/s , 18.76meg/s , 25.70meg/s = (agg) 66.24 meg/sec
test3: 24.08meg/s , 18.75meg/s , 24.16meg/s = (agg) 66.99 meg/sec

conclusion:  H70's gig ethernet can sustain 60+ meg/sec real bandwidth

TEST CASE 2:

setup: 
3 different machines concurrent ftp 400meg file to /testfs2 (fc filesystem)
raid 3, same filesystem, no caching

result:
test1: 2.43 meg/s , 1.81 meg/s , 1.92 meg/s = (agg)  6.16 meg/sec 
test2: 1.72 meg/s , 1.66 meg/s , 1.74 meg/s = (agg)  5.12 meg/sec
test3: 2.41 meg/s , 2.08 meg/s , 3.44 meg/s = (agg)  7.93 meg/sec


TEST CASE 3:

setup: 2 different machine ftp transfers of 400 meg file
again raid 3 , same filesystem, no caching

test1: 4.16 meg/s , 3.93 meg/s  = (agg) 8.09 meg/sec
test2: 4.02 meg/s , 4.15 meg/s  = (agg) 8.17 meg/sec 
test3: 24.59 meg/s , 17.53 meg/s = (agg) 42.12 meg/sec 
#first removed all files
test4: 4.16 meg/s , 4.04 meg/s  = (agg) 8.20 meg/sec 
test5: 9.99 meg/s , 8.90 meg/s  = (agg) 18.89 meg/sec
test6: 3.01 meg/s , 3.00 meg/s  = (agg) 6.01 meg/sec
test7: 23.81 meg/s , 6.16 meg/s = (agg) 29.97 meg/sec
#again, removed all files

NOTE: Previously, we were using all raid 3 and no caching.  However, raid 5 
was actually 1-2 meg/sec slower.  And caching did not seem to matter 
in most test.  From now on all tests are using raid 3 and caching 
enabled.

setup: 2 different machines ftp a 400 meg file to same filesystem on a 
single LUN (hdisk)

test1: 22.68 meg/sec, 22.49 meg/sec     = (agg) 45.17 meg/sec
# note: did a rm of all files in fs before# 
test2: 5.05 meg/sec, 4.36 meg/sec       = (agg) 9.41 meg/sec
test3: 3.32 meg/sec, 3.44 meg/sec       = (agg) 6.76 meg/sec
test4: 21.75 meg/sec, 5.80 meg/sec      = (agg) 27.55 meg/sec

the files are stored in cache of the FC disks because I still
notice write even after the ftp transfer is finished.
                                                                
setup: 2 different machines ftp a 400 meg file to different filesystem 
on 2 different LUN's but within the same raid group

test1: 28.12 meg/sec, 6.12 meg/sec      = (agg) 34.24 meg/sec
# note: did a rm of all files in fs before# 
test2: 9.31 meg/sec, 6.15 meg/sec       = (agg) 15.46 meg/sec
test3: 6.87 meg/sec, 6.41 meg/sec       = (agg) 13.28 meg/sec
test4: 28.19 meg/sec, 27.56 meg/sec     = (agg) 55.75 meg/sec
# note did a rm of all files in fs before #

setup: 2 different machines ftp a 400 meg file to different filesystem 
on 2 different LUN's in 2 different raid groups

test1: 27.98 meg/sec, 17.88 meg/sec = (agg) 45.86 meg/sec 
#removed all files first 
test2: 16.52 meg/sec, 17.34 meg/sec = (agg) 33.95 meg/sec
test3: 17.48 meg/sec, 18.30 meg/sec = (agg) 35.78 meg/sec
test4: 17.08 meg/sec, 18.22 meg/sec = (agg) 35.30 meg/sec
test5: 27.42 meg/sec, 27.14 meg/sec = (agg) 54.56 meg/sec
#removed all files first, definitely flushing cache
# write continuing after ftp session

Summary:

Basically, caching does help performance but usually only in the cases where the filesystems were empty, which seems to flush the cache. You notice this behavior at every performance run after an "rm //*" is run on all tested filesystems.

Also, it is interesting to note that as a filesystem fills up with more files, it's performance decreases. It is unclear if this behavior also appears when using raw logical volumes. Yet, in general raw LV's had faster average write speeds and did not seem to be using much cache as seem by the monitor tool. However, it is unclear how to "flush" out a raw lv as was done to the filesystems.

However, most important seem to be the fact that one can sustain almost full transfer rates to the fibre channel disks by writing to two separate LUNs in two separate raid groups. Additional, raw lvs do not seem to have what I suspect are locking contention or filesystem control block updates that filesystems experience when performing multiple writes to the same filesystem (verse multiple writes to the same raw lv).

Some initial tests (not included above) with writing to 2 different LUN's within the same raid group using raw lv, results in transfer rates of only 12 meg/sec where as a single transfer results in 16+ meg/sec. So there also seems to be some type of contention here also.