<?xml version="1.0"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title> blog</title>
		<link>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/</link>
		<atom:link href="http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/" rel="self" type="application/rss+xml" />
		<description></description>

		
		<item>
			<title>RESOLVED: Projectb filesystem outage July 9, 2012</title>
			<link>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/resolved-projectb-filesystem-outage-july-9-2012/</link>
			<description>&lt;p&gt; &lt;/p&gt;
&lt;div&gt;
&lt;div&gt;The projectb filesystem had a hardware failure that potentially generated I/O errors.  The filesystem logs indicate that the earliest abnormal event on the filesystem occurred at 9:19AM and the filesystem was taken down for maintenance at 10:42AM.  The filesystem returned to service at 11:20AM.  Jobs running on the cluster would not have been able to read from or write to the projectb filesystem between 10:42AM and 11:20AM.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;Between 9:19AM and 10:42AM one out of the 20 GPFS controllers on projectb was down, and didn't failover (as it should have).&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;&lt;strong&gt;This means:&lt;/strong&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;1/20 file I/O operations could have failed between 9:19AM and 10:42AM&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;If your job was performing a large number of short reads and writes, then there is a better chance you were affected.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;Any data that was successfully written (to a complete file) is good.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;Any data that was written in random I/O (e.g. fseek/fwrite) could be suspect, and should be looked at with care.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;&lt;em&gt;Please check your data that was written between &lt;strong&gt;9:19AM and 10:42AM&lt;/strong&gt;.&lt;/em&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;&lt;em&gt;Please check your jobs that were operating between these time periods if they were performing I/O on projectb.&lt;/em&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;&lt;em&gt;Please file a ticket with NERSC if you need help.  (&lt;a href=&quot;http://help.nersc.gov/&quot; target=&quot;_blank&quot;&gt;http://help.nersc.gov&lt;/a&gt;)&lt;/em&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;&lt;em&gt;&lt;br/&gt;&lt;/em&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;Sincerely,&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;Doug Jacobsen&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;NERSC Bioinformatics Computing Consultant&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt; &lt;/p&gt;</description>
			<pubDate>Mon, 09 Jul 2012 11:56:08 -0700</pubDate>
			
			
			<guid>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/resolved-projectb-filesystem-outage-july-9-2012/</guid>
		</item>
		
		<item>
			<title>Important notice about using /house</title>
			<link>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/important-notice-about-using-house/</link>
			<description>&lt;h2&gt;Description&lt;/h2&gt;
&lt;div&gt;
&lt;div&gt;There have been a lot of issues recently with NFS hangs on the gpint machines.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;The origin of the gpint hanging has been determined to be a defect in the Isilon filesystem software, and happens when a file being written is simultaneously opened for reading on the same host.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;This most frequently happens when people tail files being written by the same machine.&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;E.g.:  DO NOT DO THIS:&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;gpint17 $ ./somewritingProcess &amp;gt; outfile &amp;amp;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;gpint17 $ tail -f outfile&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;This very common (and desirable) operation has been determined to hang the filesystem on the host reading/writing the file.  We are working with the vendor to try to correct this situation, but in the meantime a work-around is to read the file from a different machine than the writer:&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;E.g.: THIS IS OK&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;gpint17 $ ./somewritingProcess &amp;gt; outfile &amp;amp;&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;gpint17 $ ssh gpintXX&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;gpintXX $ tail -f outfile&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt; &lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;div&gt;Please note that these problems are not limited to the gpint machines - but any machine connected to /house.  Please write me back with any questions, and file tickets at &lt;a href=&quot;http://help.nersc.gov/&quot; target=&quot;_blank&quot;&gt;http://help.nersc.gov&lt;/a&gt; if you run into any trouble.&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt; &lt;/p&gt;</description>
			<pubDate>Fri, 06 Jul 2012 12:31:02 -0700</pubDate>
			
			
			<guid>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/important-notice-about-using-house/</guid>
		</item>
		
		<item>
			<title>Can&#39;t see tickets submitted by other users</title>
			<link>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/can-t-see-tickets-submitted-by-other-users/</link>
			<description>&lt;p&gt; &lt;/p&gt;
&lt;h2&gt;Description&lt;/h2&gt;
&lt;p&gt;Currently only a user who submits a ticket can view and modify it.  This is because the ticket system does not have any concept of groups.&lt;/p&gt;
&lt;h2&gt;Solution&lt;/h2&gt;
&lt;p&gt;We are in the process of importing group information from our accounting database which will enable users to see tickets submitted by other users.&lt;/p&gt;
&lt;h2&gt;Status&lt;/h2&gt;
&lt;p&gt;In progress.  Testing in development test system. &lt;/p&gt;</description>
			<pubDate>Thu, 07 Jun 2012 23:03:33 -0700</pubDate>
			
			
			<guid>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/can-t-see-tickets-submitted-by-other-users/</guid>
		</item>
		
		<item>
			<title>Make long queue work fairly across all groups</title>
			<link>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/make-long-queue-work-fairly-across-all-groups/</link>
			<description>&lt;h2&gt;Description&lt;/h2&gt;
&lt;p&gt;Currently a single user or group of users can take up all available slots for the long queue, blocking other users.&lt;/p&gt;
&lt;h2&gt;Solution&lt;/h2&gt;
&lt;p&gt;We plan to limit the number of long job slots a group can use.&lt;/p&gt;
&lt;h2&gt;Status&lt;/h2&gt;
&lt;p&gt;Long queue slots have been limited to 320 per user&lt;/p&gt;</description>
			<pubDate>Wed, 04 Apr 2012 08:00:00 -0700</pubDate>
			
			
			<guid>http://www.nersc.gov/users/computational-systems/genepool/updates-and-status/open-issues/make-long-queue-work-fairly-across-all-groups/</guid>
		</item>
		

	</channel>
</rss>