Determining if objects are permanently unavailable

You can determine if objects are permanently unavailable by making a request using the TSM administrative console.

About this task

This example is provided for your information only; this guide cannot help you identify all failure conditions that may result in unavailable objects or tape volumes. For information about TSM administration, see TSM Server documentation.

Steps

  1. Identify the object or objects that could not be retrieved by the Archive Node:
    1. From the service laptop, log in to the Admin Node:
      1. Enter the following command: ssh admin@grid_node_IP
      2. Enter the password listed in the Passwords.txt file.
      3. Enter the following command to switch to root: su -
      4. Enter the password listed in the Passwords.txt file.
    2. Go to the directory containing the audit log files: cd /var/local/audit/export

      The current audit log file is named audit.log, which is rotated daily to create logs named by the date in the format yyyy-mm-dd.txt. Files that are seven days old are compressed.

    3. Search the relevant audit log file for messages indicating that a retrieval failure occurred. For example, enter: grep ARCE audit.log | less -n
      When a retrieval fails, the ARCE audit message (Archive Object Retrieve End) has a result field of ARUN (archive middleware unavailable) or GERR (general error). The following example line from the audit log shows that the ARCE message terminated with the result ARUN for CBID 498D8A1F681F05B3.
      [AUDT:[CBID(UI64):0x498D8A1F681F05B3][VLID(UI64):20091127][RSLT(FC32):ARUN][AVER(UI32):7][ATIM(UI64):1350613602969243][ATYP(FC32):ARCE][ANID(UI32):13959984][AMID(FC32):ARCI][ATID(UI64):4560349751312520631][ASQN(UI64):62][ASES(UI64):1350580983645305]] 
      

      For more information about interpreting audit messages, see the Audit Message Reference.

    4. Record the CBID of each object with a request failure. You might also want to record the following additional information used by the TSM to identify objects saved by the Archive Node:
      • File Space Name

        Go to Grid Topology > Archive Node > ARC > Target > Overview. The file space name is the Archive Node's node ID.

      • High Level Name

        Equivalent to the volume ID assigned to the object by the Archive Node. The volume ID takes the form of a date (20091127), and is recorded as the VLID of the object in archive audit messages.

      • Low Level Name

        Equivalent to the CBID assigned to an object by the StorageGRID Webscale system.

    5. Log out of the command shell: exit
  2. Check the TSM server to see if the objects identified in step 1 are permanently unavailable:
    1. Log in to the administrative console of the TSM server: dsmadmc

      Use the administrative user name and password that is configured for the ARC service, as described in the Installation Guide (and entered in the StorageGRID Webscale system at Grid Topology > Archive Node > ARC > Target > Configuration).

    2. An example of a way that you might discover that an object is permanently unavailable is to search the TSM activity log for a data integrity error for that object. The following example shows a search of the activity log for the past day for an object with CBID 498D8A1F681F05B3.
      > query actlog begindate=-1 search=276C14E94082CC69
      12/21/2008 05:39:15 ANR0548W Retrieve or restore 
      failed for session 9139359 for node DEV-ARC-20 (Bycast ARC) 
      processing file space /19130020 4 for file /20081002/ 
      498D8A1F681F05B3 stored as Archive - data 
      integrity error detected. (SESSION: 9139359)
      >

      Note that depending on the nature of the error, the CBID might not be recorded in the TSM activity log. It might be necessary to search the log for other TSM errors around the time of the request failure.

    3. If an entire tape is permanently unavailable, identify the CBIDs for all objects stored on that volume: query content TSM_Volume_Name
      where TSM_Volume_Name is the TSM name for the unavailable tape. The following is an example of the output for this command:
      > query content TSM-Volume-Name
      Node Name       Type Filespace  FSID Client's Name for File Name
      --------------- ---- ---------- ---- --------------------------------
      DEV-ARC-20      Arch /19130020  216  /20081201/ C1D172940E6C7E12
      DEV-ARC-20      Arch /19130020  216  /20081201/ F1D7FBC2B4B0779E
      

      The “Client’s Name for File” is the Archive Node volume ID (TSM “high level name”) followed by the object’s CBID (TSM “low level name). That is: /Archive Node volume ID /CBID or, in the first line of this example: /20081201/ C1D172940E6C7E12

      Recall also that the “Filespace Name” is the node ID of the Archive Node.

      You will need the CBID of each object stored on the volume and the node ID of the Archive Node to cancel the retrieval request in step 3.

  3. For each object that is permanently unavailable, cancel the retrieval request and inform the StorageGRID Webscale system that the object copy was lost:
    Note: Use the ADE Console with caution. If the console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use the commands documented in this procedure.
    1. If you are not already logged in to the Archive Node, log in as follows:
      1. Enter the following command: ssh admin@grid_node_IP
      2. Enter the password listed in the Passwords.txt file.
      3. Enter the following command to switch to root: su -
      4. Enter the password listed in the Passwords.txt file.
    2. Access the ADE console of the ARC service: telnet localhost 1409
    3. Cancel the request for the object: /proc/BRTR/cancel -c CBID

      where CBID is the identifier of the object that cannot be retrieved from the TSM.

      If the only copies of the object are on tape, the “bulk retrieval” request is canceled with a message “1 requests canceled”. If copies of the object exist elsewhere in the system, the object retrieval is processed by a different module so the response to the message is “0 requests canceled”.

    4. Notify the StorageGRID Webscale system that an object copy has been lost and an additional copy must be made of the indicated object: /proc/CMSI/Object_Lost CBID node_ID

      where CBID is the identifier of the object that cannot be retrieved from the TSM server.

      For Archive Nodes, you cannot use a range of CBIDs.

      node_ID is the node ID of the Archive Node where the retrieval failed.

      In most cases, the StorageGRID Webscale system immediately begins to make additional copies of object data to ensure that the system's ILM policy is followed. In a StorageGRID Webscale system configured to use an ILM rule with only one active content placement instruction, copies of an object are not made. If an object is lost, it cannot be recovered. In this case, running the Object_Lost command purges the lost object’s metadata from the StorageGRID Webscale system. For more information about ILM rules, see the Administrator Guide.

      When the Object_Lost command completes successfully, it returns the message CLOC_LOST_ANS returned result ‘SUCS’.

    5. Exit the ADE Console: exit
    6. Log out of the Archive Node: exit
  4. Reset the value of Request Failures in the StorageGRID Webscale system:
    1. Go to Archive Node > ARC > Retrieve > Configuration, and select Reset Request Failure Count.
    2. Click Apply Changes.