<?xml version='1.0' encoding='utf-8'?> encoding='UTF-8'?>

<!DOCTYPE rfc>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt'?> rfc [
 <!ENTITY nbsp    "&#160;">
 <!ENTITY zwsp   "&#8203;">
 <!ENTITY nbhy   "&#8209;">
 <!ENTITY wj     "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category='std' docName='draft-ietf-nfsv4-layrec-04' number="9737" ipr='trust200902' obsoletes=''
 scripts='Common,Latin' updates="" sortRefs='true' submissionType='IETF' symRefs='true' tocDepth='3' tocInclude='true' consensus='true' version='3' xml:lang='en'>

  <front>
  <title abbrev='LAYOUT_RECOVERY'>

<!--[rfced] Title and Short Title

a) May we update the document title for conciseness by
removing "of" and rephrasing the text to reflect that
the errors are reported "in NFSv4" as shown below?

b) May we update the short title that spans the header
of the PDF file to more closely match the document title
as shown below?

c) We note that "LAYOUTRETURN" is mentioned in the title but
not in the Abstract or Introduction. Should "LAYOUTRETURN"
be included to those sections for consistency with the title?
If so, please provide the desired text.

Document Title
Original:
   Reporting of Errors via LAYOUTRETURN in NFSv4.2
  </title>

Perhaps:
   Reporting Errors in NFSv4.2 via LAYOUTRETURN

...
Short Title
Original:
   LAYOUT_RECOVERY

Perhaps:
   Reporting Errors via LAYOUTRETURN
-->

  <title abbrev='LAYOUT_RECOVERY'>Reporting of Errors via LAYOUTRETURN in
  NFSv4.2</title>
  <seriesInfo name='Internet-Draft' value='draft-ietf-nfsv4-layrec-04'/> name='RFC' value='9737'/>
  <author fullname='Thomas Haynes' initials='T.' surname='Haynes'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>loghyr@gmail.com</email>
    </address>
  </author>
  <author fullname='Trond Myklebust' initials='T.' surname='Myklebust'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>trondmy@hammerspace.com</email>
    </address>
  </author>
  <date year='2024' month='November' day='21'/>
  <area>Transport</area>
  <workgroup>Network File System Version 4</workgroup> year='2025' month='February'/>
  <area>WIT</area>
  <workgroup>nfsv4</workgroup>
  <keyword>NFSv4</keyword>

  <abstract>
    <t>

<!--[rfced] We note that "MDS" and "DS" are expanded as "metadata
server" and "data server", respectively, in RFC 8435. May we
expand these terms in the Abstract as shown below (option A) to
match RFC 8435?

After these terms are expanded, would you like to use the abbreviations?
There are 37 instances of "metadata server" and 2 instances of
"data server". If not, and it is desired to have the term written out,
should "MDS" and "DS" simply be removed since they are not used elsewhere
in the document (option B)? Please let us know your preference.

Original:
   The Parallel Network File System (pNFS) allows for a file's metadata
   (MDS) and data (DS) to be on different servers.  When the metadata
   server is restarted, the client can still modify the data file
   component.  During the recovery phase of startup, the metadata server
   and the data servers work together to recover state (which files are
   open, last modification time, size, etc.).

Perhaps A:
   The Parallel Network File System (pNFS) allows for a file's metadata
   and data to be on different servers (i.e., the metadata server (MDS)
   and the data server (DS)).

or

Perhaps B:
   The Parallel Network File System (pNFS) allows for a file's metadata
   and data to be on different servers.
-->
      The Parallel Network File System (pNFS) allows
      for a file's metadata (MDS) and data (DS) to be on different
      servers. When the metadata server is restarted, the client
      can still modify the data file component.

<!--[rfced] Please clarify "which files are open, last modification
time, size, etc.)". Are these files used by the servers during
the recovery phase?

Original:
   During the recovery phase of startup, the metadata server
   and the data servers work together to recover state
   (which files are open, last modification time, size, etc.).

Perhaps:
   During the recovery phase of startup, the metadata server
   and the data servers work together to recover state
   (the files used are "open", "last modification time",
   "size", etc.).
-->
      During the
      recovery phase of startup, the metadata server and the
      data servers work together to recover state (which files
      are open, last modification time, size, etc.). If the client
      has not encountered errors with the data files, then the state can be
      recovered, avoiding
      recovered and the resilvering of the data files. files can be avoided. With any
      errors, there is no means by which the client can report errors to the
      metadata server. As such, the metadata server has to
      assume that a file needs resilvering. This document presents an
      extension to RFC8435 RFC 8435 to allow the client to update the metadata
      and avoid the resilvering.
    </t>
  </abstract>

  <note removeInRFC='true'>
    <t>
      Discussion of this draft takes place
      on the NFSv4 working group mailing list (nfsv4@ietf.org),
      which is archived at
      <eref target='https://mailarchive.ietf.org/arch/browse/nfsv4/'/>.
      Working Group information can be found at
      <eref target='https://datatracker.ietf.org/wg/nfsv4/about/'/>.
    </t>
  </note>
</front>

<middle>

<section anchor='sec_intro' numbered='true' removeInRFC='false'  toc='default'>
  <name>Introduction</name>
  <t>
    In the Network File System version4 version 4 (NFSv4) with a Parallel NFS
    (pNFS) Flexible File Layout (<xref <xref target='RFC8435' format='default'
    sectionFormat='of'/>)
    sectionFormat='of'/> server, during recovery after a restart,
    there is no mechanism for the client
    to inform the metadata server about an error which that occurred during a
    WRITE operation (see Section 18.32 of <xref section="18.32" target='RFC8881' format='default'
    sectionFormat='of'/>) operation to the data servers in the period of
    the outage.
  </t>

  <t>
    Using the process detailed in <xref target='RFC8178' format='default'
    sectionFormat='of'/>, the revisions in this document become an
    extension of NFSv4.2 <xref target='RFC7862' format='default'
    sectionFormat='of'/>. They are built on top of the external data
    representation External Data
    Representation (XDR) <xref target='RFC4506' format='default'
    sectionFormat='of'/> generated from <xref target='RFC7863'
    format='default' sectionFormat='of'/>.
  </t>

  <section anchor='sec_defs' numbered='true' removeInRFC='false'  toc='default'>
    <name>Definitions</name>
    <t>
      See Section 1.1 of <xref section="1.1" target='RFC8435' format='default'
      sectionFormat='of'/> for a set of definitions.
    </t>
  </section>
  <section numbered='true' removeInRFC='false'  toc='default'>
    <name>Requirements Language</name>
        <t>
    The key words '<bcp14>MUST</bcp14>', '<bcp14>MUST NOT</bcp14>',
      '<bcp14>REQUIRED</bcp14>', '<bcp14>SHALL</bcp14>', '<bcp14>SHALL
      NOT</bcp14>', '<bcp14>SHOULD</bcp14>', '<bcp14>SHOULD NOT</bcp14>',
      '<bcp14>RECOMMENDED</bcp14>', '<bcp14>NOT RECOMMENDED</bcp14>',
      '<bcp14>MAY</bcp14>', "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
    NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
    "<bcp14>MAY</bcp14>", and '<bcp14>OPTIONAL</bcp14>' "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
    described in BCP 14 BCP&nbsp;14 <xref
      target='RFC2119' format='default' sectionFormat='of'/> target="RFC2119"/> <xref
      target='RFC8174' format='default' sectionFormat='of'/> target="RFC8174"/>
    when, and only when, they appear in all capitals, as shown here.
        </t>
  </section>
</section>

<section anchor='layout_state_recovery' numbered='true' removeInRFC='false'  toc='default'>
  <name>Layout State Recovery</name>
  <t>
    When a metadata server restarts, clients are provided a grace recovery period where
    they are allowed to recover any state that
    they had established. With open files, the client can send an OPEN operation (see
    Section 18.16 of
    <xref section="18.16" target='RFC8881' format='default' sectionFormat='of'/>)
    operation
    with a claim type of CLAIM_PREVIOUS (see Section 9.11 of <xref section="9.11" target='RFC8881' format='default' sectionFormat='of'/>). The client
    uses the RECLAIM_COMPLETE operation (see Section 18.51
    of <xref section="18.51" target='RFC8881' format='default' sectionFormat='of'/>) operation
    to notify the metadata server that it is done reclaiming state.
  </t>
  <t>
    The NFSv4 Flexible File Layout Type allows for the client to mirror files
    (see Section 8 of <xref section="8" target='RFC8435' format='default' sectionFormat='of'/>).
    With client side client-side mirroring, it is important for the client to inform
    the metadata server of any I/O errors encountered with one of the mirrors.
    This is the only way for the metadata server to determine if one or more
    of the mirrors is are corrupt and then repair the mirrors via resilvering
    (see Section 1.1 of <xref section="1.1" target='RFC8435' format='default' sectionFormat='of'/>).
    The client can use LAYOUTRETURN (see
    Section 18.44 of
    <xref section="18.44" target='RFC8881' format='default' sectionFormat='of'/>)
    and the ff_ioerr4 structure (see Section 9.1.1 of <xref section="9.1.1" target='RFC8435' format='default' sectionFormat='of'/>) structure to inform
    the metadata server of I/O errors.
  </t>
  <t>
    A problem is that arises when the metadata server restarts and the client has
    errors it needs to report, it can not report but cannot do so. Section 12.7.4 of <xref section="12.7.4" target='RFC8881' format='default' sectionFormat='of'/> requires
    that the client <bcp14>MUST</bcp14> stop using layouts. While the
    intent there is that the client <bcp14>MUST</bcp14> stop doing I/O
    to the storage devices, it is also true that the layout stateids
    are no longer valid. The LAYOUTRETURN needs
    a layout stateid to proceed proceed, and the client can not cannot get a layout
    during grace recovery (see Section 12.7.4 of <xref section="12.7.4" target='RFC8881' format='default' sectionFormat='of'/>) to
    recover layout state. As such, clients have no choice but to not recover
    files with I/O errors. In turn, the metadata server <bcp14>MUST</bcp14>
    assume that the mirrors are inconsistent and pick one for resilvering.
    It is a <bcp14>MUST</bcp14> because even if the metadata server can
    determine that the client did modify data during the outage, it <bcp14>MUST NOT</bcp14>
    assume those modifications were consistent.
  </t>
  <t>
    To fix this issue, the metadata server <bcp14>MUST</bcp14> accept
    for
    the lrf_stateid in LAYOUTRETURN (see Section 18.44.1 anonymous stateid of all zeros (see <xref section="8.2.3" target='RFC8881' format='default' sectionFormat='of'/>) for the anonymous stateid of all zeros lrf_stateid in LAYOUTRETURN (see Section 8.2.3 of <xref section="18.44.1" target='RFC8881' format='default' sectionFormat='of'/>).
    The client can use this anonymous stateid to
    inform the metadata server of errors
    encountered. The metadata server can then
    accurately resilver the file by picking the mirror(s) that do does not
    have any associated errors.
  </t>
  <t>
    During the grace period, if the client sends a an lrf_stateid
    in the LAYOUTRETURN with any value other than the
    anonymous stateid of all zeros, then the metadata server
    <bcp14>MUST</bcp14> now respond with an error of
    NFS4ERR_GRACE (see Section of 15.1.9.2 <xref section="15.1.9.2" target='RFC8881' format='default' sectionFormat='of'/>).
    After the grace period, if the client sends a an lrf_stateid
    in the LAYOUTRETURN with a value of the anonymous stateid of all zeros, then the metadata server
    <bcp14>MUST</bcp14> now respond with an error of
    NFS4ERR_NO_GRACE (see Section 15.1.9.3 of <xref section="15.1.9.3" target='RFC8881' format='default' sectionFormat='of'/>).
  </t>
  <t>
<!--[rfced] We are having trouble parsing this sentence. Are
words missing after "when a lrf_stateid with the value of the
anonymous stateid of all zeros", or should "when a lrf_stateid"
perhaps be "with an lrf_stateid"? Please review and let us
know how we may clarify.

Original:
   Also, when the metadata server builds the reply to the LAYOUTRETURN
   when a lrf_stateid with the value of the anonymous stateid of all
   zeros it MUST NOT bump the seqid of the lorr_stateid.

Perhaps:
   Also, when the metadata server builds the reply to the LAYOUTRETURN
   with an lrf_stateid with an anonymous stateid value of all
   zeros, it MUST NOT bump the seqid of the lorr_stateid.
-->

    Also, when the metadata server builds the reply to the LAYOUTRETURN
    when an lrf_stateid with the value of the anonymous stateid of all zeros
    it <bcp14>MUST NOT</bcp14> bump the seqid of the lorr_stateid.
  </t>
  <t>
    If the metadata server detects that the layout being returned in
    the LAYOUTRETURN does not match the current mirror instances found
    for the file, then it <bcp14>MUST</bcp14> ignore the LAYOUTRETURN and resilver the
    file in question.
  </t>
  <t>
    The metadata server <bcp14>MUST</bcp14> resilver any files
    which
    that are neither explicitly recovered with a CLAIM_PREVIOUS nor
    have a reported error via a LAYOUTRETURN.
    The client has most likely restarted and lost any state.
  </t>
  <section anchor='sec_when_to_resilver' numbered='true' removeInRFC='false'  toc='default'>
    <name>When to Resilver</name>
    <t>
      A write intent occurs when a client opens a file and gets
      a LAYOUTIOMODE4_RW from the metadata server. The metadata server
      <bcp14>MUST</bcp14> track outstanding write intents intents, and when it
      restarts, it <bcp14>MUST</bcp14> track recovery of those
      write intents.
      The method that the metadata server uses to track write intents is
      implementation specific, i.e., outside of the scope of this document.
    </t>
    <t>
      The decision to resilver a file depends on how the client recovers the
      file before the grace period ends. If the client reclaims the file
      and reports no errors, the metadata server <bcp14>MUST NOT</bcp14>
      resilver the file. If the client reports an error on the file,
      then the file <bcp14>MUST</bcp14> be resilvered. If the client
      does not reclaim or report an error before the grace period ends,
      then under the old behavior, the metadata server <bcp14>MUST</bcp14>
      resilver the file.
    </t>
    <t>
      The resilvering process is broadly to:
    </t>
    <ol>
      <li>
        fence the file (see Section 2.2
        of <xref section="2.2" target='RFC8435' format='default' sectionFormat='of'/>),
      </li>
      <li>
        record the need to resilver,
      </li>
      <li>
        release the write intent, and
      </li>
      <li>
        once there are no write intents on the file, start the resilvering process.
      </li>
    </ol>
    <t>
      The metadata server <bcp14>MUST NOT</bcp14> resilver a file if there
      are clients with outstanding write intents. I.e., intents, i.e., multiple clients
      might have the file open with write intents.  As it the metadata server <bcp14>MUST</bcp14>
      track write intents, it <bcp14>MUST</bcp14> also track the need to
      resilver. I.e.,
      resilver, i.e., if the metadata server restarts during the grace
      period, it <bcp14>MUST</bcp14> restart the file recovery if it
      replays the write intent intent, or else it <bcp14>MUST</bcp14> start
      the resilvering if it replays the resilvering intent.
    </t>

    <t>
      Whether the metadata server prevents all I/O to
      the file until the resilvering is done or done, forces all I/O to go through
      the metadata server server, or allows a proxy server to update the new data
      file as it is being reslivered resilvered is all an implementation choice. The
      constraint is that the metadata server is responsible for the
      reconstruction of the data file and for the consistency of the
      mirrors.
    </t>

    <t>
      If the metadata server does allow the client access to the
      file during the resilvering, then the client <bcp14>MUST</bcp14> have
      the same layout (set of mirror instances) after the metadata server
      as before. One way that such a resilvering can occur is for a proxy
      server to be inserted into the layout. That server will be copying
      a good mirror instance to a new instance. As it gets I/O via the
      layout, it will be responsible for updating the copy it is performing.
      This requirement is that the proxy server <bcp14>MUST</bcp14>
      stay in the layout until the grace period is finished.
    </t>
  </section>

  <section anchor='sec_vers_mismatch' numbered='true' removeInRFC='false'  toc='default'>
    <name>Version Mismatch Considerations</name>
    <t>
      The metadata server has no expectations for the client to use this
      new functionality. Therefore, if the client does not use it, the
      metadata server will function normally.
    </t>
    <t>
      If the client does use the new functionality and the metadata server does
      not support it, then the metadata server <bcp14>MUST</bcp14> reply with
      a NFS4ERR_BAD_STATEID to the LAYOUTRETURN. If the client detects
      a NFS4ERR_BAD_STATEID error in this scenario, it should fall back to
      the old behavior of not reporting errors.
    </t>
  </section>
</section>

<section anchor='sec_security' numbered='true' removeInRFC='false'  toc='default'>
  <name>Security Considerations</name>
  <t>
    There are no new security considerations beyond those in
    <xref target='RFC7862' format='default' sectionFormat='of'/>.
  </t>
</section>

<section anchor='sec_iana' numbered='true' removeInRFC='false'  toc='default'>
  <name>IANA Considerations</name>
  <t>
    There are
 This document has no IANA considerations for this document. actions.
  </t>
</section>

</middle>

<back>

<references>
  <name>References</name>

  <references>
  <name>Normative References</name>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml"/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml'/> href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml"/>

  </references>
</references>

<section numbered='true' removeInRFC='false' numbered='false' toc='default'>
      <name>Acknowledgments</name>
      <t>
        Tigran Mkrtchyan, Jeff Layton,
      <t><contact fullname="Tigran Mkrtchyan"/>, <contact fullname="Jeff
      Layton"/>, and Rick Macklem <contact fullname="Rick Macklem"/> provided reviews of
      the document.
      </t> document.</t>
    </section>

<!-- [rfced] We note that the following terms appear as lowercase in
FCs 8435 and 8881. Should these terms be made lowercase to match
se in those RFCs?

  Flexible File Layout
  Flexible File Layout Type
-->

<!-- [rfced] Please review the "Inclusive Language" portion of the online
Style Guide <https://www.rfc-editor.org/styleguide/part2/#inclusive_language>
and let us know if any changes are needed.  Updates of this nature typically
result in more precise language, which is helpful for readers.

Note that our script did not flag any words in particular, but this should
still be reviewed as a best practice.
-->

</back>
</rfc>