Determining if objects are permanently unavailable

You can determine if objects are permanently unavailable by making a request using the TSM administrative console.

Before you begin

About this task

This example is provided for your information only; this procedure cannot help you identify all failure conditions that might result in unavailable objects or tape volumes. For information about TSM administration, see TSM Server documentation.

Procedure

  1. Log in to an Admin Node:
    1. Enter the following command: ssh admin@Admin_Node_IP
    2. Enter the password listed in the Passwords.txt file.
  2. Identify the object or objects that could not be retrieved by the Archive Node:
    1. Go to the directory containing the audit log files: cd /var/local/audit/export
      The active audit log file is named audit.log. Once a day, the active audit.log file is saved, and a new audit.log file is started. The name of the saved file indicates when it was saved, in the format yyyy-mm-dd.txt. After a day, the saved file is compressed and renamed, in the format yyyy-mm-dd.txt.gz, which preserves the original date.
    2. Search the relevant audit log file for messages indicating that an archived object could not be retrieved. For example, enter: grep ARCE audit.log | less -n
      When an object cannot be retrieved from an Archive Node, the ARCE audit message (Archive Object Retrieve End) displays ARUN (archive middleware unavailable) or GERR (general error) in the result field. The following example line from the audit log shows that the ARCE message terminated with the result ARUN for CBID 498D8A1F681F05B3.
      [AUDT:[CBID(UI64):0x498D8A1F681F05B3][VLID(UI64):20091127][RSLT(FC32):ARUN][AVER(UI32):7]
      [ATIM(UI64):1350613602969243][ATYP(FC32):ARCE][ANID(UI32):13959984][AMID(FC32):ARCI]
      [ATID(UI64):4560349751312520631]]

      For more information see the instructions for understanding audit messages.

    3. Record the CBID of each object that had a request failure.
      You might also want to record the following additional information used by the TSM to identify objects saved by the Archive Node:
      • File Space Name: Equivalent to the Archive Node ID. To find the Archive Node ID, select Support. Then, in the Tools section of the menu, select Grid Topology. Then, select Archive Node > ARC > Target > Overview.
      • High Level Name: Equivalent to the volume ID assigned to the object by the Archive Node. The volume ID takes the form of a date (20091127), and is recorded as the VLID of the object in archive audit messages.
      • Low Level Name: Equivalent to the CBID assigned to an object by the StorageGRID system.
    4. Log out of the command shell: exit
  3. Check the TSM server to see if the objects identified in step 2 are permanently unavailable:
    1. Log in to the administrative console of the TSM server: dsmadmc
      Use the administrative user name and password that are configured for the ARC service. Enter the user name and password in the Grid Manager. (To see the user name, select Support. Then, in the Tools section of the menu, select Grid Topology. Then, select Archive Node > ARC > Target > Configuration.)
    2. Determine if the object is permanently unavailable.
      For example, you might search the TSM activity log for a data integrity error for that object. The following example shows a search of the activity log for the past day for an object with CBID 498D8A1F681F05B3.
      > query actlog begindate=-1 search=276C14E94082CC69
      12/21/2008 05:39:15 ANR0548W Retrieve or restore 
      failed for session 9139359 for node DEV-ARC-20 (Bycast ARC) 
      processing file space /19130020 4 for file /20081002/ 
      498D8A1F681F05B3 stored as Archive - data 
      integrity error detected. (SESSION: 9139359)
      >

      Note that depending on the nature of the error, the CBID might not be recorded in the TSM activity log. You might need to search the log for other TSM errors around the time of the request failure.

    3. If an entire tape is permanently unavailable, identify the CBIDs for all objects stored on that volume: query content TSM_Volume_Name
      where TSM_Volume_Name is the TSM name for the unavailable tape. The following is an example of the output for this command:
       > query content TSM-Volume-Name
      Node Name       Type Filespace  FSID Client's Name for File Name
      --------------- ---- ---------- ---- ----------------------------
      DEV-ARC-20      Arch /19130020  216  /20081201/ C1D172940E6C7E12
      DEV-ARC-20      Arch /19130020  216  /20081201/ F1D7FBC2B4B0779E

      The Client’s Name for File Name is the same as the Archive Node volume ID (or TSM high level name) followed by the object’s CBID (or TSM low level name). That is, the Client’s Name for File Name takes the form /Archive Node volume ID /CBID. In the first line of the example output, the Client’s Name for File Name is /20081201/ C1D172940E6C7E12.

      Recall also that the Filespace is the node ID of the Archive Node.

      You will need the CBID of each object stored on the volume and the node ID of the Archive Node to cancel the retrieval request.

  4. For each object that is permanently unavailable, cancel the retrieval request and issue a command to inform the StorageGRID system that the object copy was lost:
    Attention: Use the ADE Console with caution. If the console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use the commands documented in this procedure.
    1. If you are not already logged in to the Archive Node, log in as follows:
      1. Enter the following command: ssh admin@grid_node_IP
      2. Enter the password listed in the Passwords.txt file.
      3. Enter the following command to switch to root: su -
      4. Enter the password listed in the Passwords.txt file.
    2. Access the ADE console of the ARC service: telnet localhost 1409
    3. Cancel the request for the object: /proc/BRTR/cancel -c CBID

      where CBID is the identifier of the object that cannot be retrieved from the TSM.

      If the only copies of the object are on tape, the bulk retrieval request is canceled with a message 1 requests canceled. If copies of the object exist elsewhere in the system, the object retrieval is processed by a different module so the response to the message is 0 requests canceled.

    4. Issue a command to notify the StorageGRID system that an object copy has been lost and that an additional copy must be made: /proc/CMSI/Object_Lost CBID node_ID

      where CBID is the identifier of the object that cannot be retrieved from the TSM server, and node_ID is the node ID of the Archive Node where the retrieval failed.

      You must enter a separate command for each lost object copy: entering a range of CBIDs is not supported.

      In most cases, the StorageGRID system immediately begins to make additional copies of object data to ensure that the system's ILM policy is followed.

      However, if the ILM rule for the object specified that only one copy be made and that copy has now been lost, the object cannot be recovered. In this case running the Object_Lost command purges the lost object’s metadata from the StorageGRID system.

      When the Object_Lost command completes successfully, the following message is returned:
      CLOC_LOST_ANS returned result ‘SUCS’
      Note: The /proc/CMSI/Object_Lost command is only valid for lost objects that are stored on Archive Nodes.
    5. Exit the ADE Console: exit
    6. Log out of the Archive Node: exit
  5. Reset the value of Request Failures in the StorageGRID system:
    1. Go to Archive Node > ARC > Retrieve > Configuration, and select Reset Request Failure Count.
    2. Click Apply Changes.