Re: Bug 15809921 on a pool *without* l2arc?
- From: "Robin P. Blanchard" <
- To: "<
- Subject: Re: Bug 15809921 on a pool *without* l2arc?
- Date: Fri, 26 Jul 2013 13:59:20 +0000
- Accept-language: en-US
On Jul 24, 2013, at 16:50 PM, Victor Latushkin wrote:
> On 7/24/13 2:43 PM, Robin P. Blanchard wrote:
>> On Jul 24, 2013, at 16:32 PM, Robin P. Blanchard wrote:
>>> On Jul 24, 2013, at 15:50 PM, Victor Latushkin wrote:
>>>> On 7/24/13 1:41 PM, Robin P. Blanchard wrote:
>>>>> We have just discovered (on 11.0 SRU 13.4 + idr357) that we have
>>>>> seemingly tripped 15809921 / SUNBT7191375 on a pool that did *not* have
>>>>> l2arc devices. The pool, as indicated below in its trimmed zpool
>>>>> history, did at one time have a cache device enabled.
>>>>> Being unable to schedule a complete maintenance window back in March
>>>>> (when we first tripped this same bug on a different system, *with*
>>>>> l2arc), the customer elected to remove the l2arc device to mitigate the
>>>>> risk. It would seem, however, that said bug can be tripped *without*
>>>>> l2arc present. Or is it possible that the metadata became corrupted
>>>>> while there was l2arc present and -- despite the device's removal --
>>>>> only now was the bug triggered? The latter scenario is quite
>>>> Since L2ARC device was present in the config at one time in the past, it
>>>> could damage this pool. It is only detected when corresponding space map
>>>> is loaded.
>>> Thanks for the quick reply, Victor.
>>> I'm trying to get the full implications of your answer. Does this mean
>>> that any pool that had had l2arc at any point prior to 11.1 SRU 3.4 could
>>> have this sleeping bug? We were under the impression that the removal of
>>> cache devices would mitigate this bug (as stated back on 19 December 2012
>>> to this list " Solaris 11 System Reboots Continuously Because of a
>>> ZFS-Related Panic (7191375)"). Any help/suggestions you can provide would
>>> be most appreciated.
>> Does this also imply that the damage could be done, the cache removed, the
>> system patched to post 11.1 SRU 3.4 and the bug still strike?
> Damage can strike, not the bug.
more questions: what operations can "tickle" the (dormant) corruption? As we
attempt to formulate a mitigation plan, we need to determine the safest
method by which to isolate pools to individual nodes (so as to avoid impact
on other co-existing pools)
if dormant corruption exists:
01) does zpool export trigger panic?
02) can we safely zfs snap, send and receive without triggering a panic?
03) is it safe to run zdb -emm on an read/write imported pool? (will this
only trigger a core-dump if spacemap corruption is detected? or a panic?)
04) if a pool (that once had l2arc) has been since successfully scrubbed,
does that indicate the pool is "safe"?
Robin P. Blanchard