Troubleshoot issues on your Azure Stack Edge Pro GPU device

APPLIES TO: Yes for Pro GPU SKUAzure Stack Edge Pro - GPUYes for Pro R SKUAzure Stack Edge Pro RYes for Mini R SKUAzure Stack Edge Mini R                  

This article describes how to troubleshoot issues on your Azure Stack Edge Pro GPU device.

Run diagnostics

To diagnose and troubleshoot any device errors, you can run the diagnostics tests. Do the following steps in the local web UI of your device to run diagnostic tests.

  1. In the local web UI, go to Troubleshooting > Diagnostic tests. Select the test you want to run and select Run test. The test diagnoses any possible issues with your network, device, web proxy, time, or cloud settings. You are notified that the device is running tests.

    Select tests

  2. After the tests have completed, the results are displayed.

    View test results

    If a test fails, then a URL for recommended action is presented. Select the URL to view the recommended action.

    Review warnings for failed tests

Collect Support package

A log package is composed of all the relevant logs that can help Microsoft Support troubleshoot any device issues. You can generate a log package via the local web UI.

Do the following steps to collect a Support package.

  1. In the local web UI, go to Troubleshooting > Support. Select Create support package. The system starts collecting support package. The package collection may take several minutes.

    Select add user

  2. After the Support package is created, select Download Support package. A zipped package is downloaded on the path you chose. You can unzip the package and the view the system log files.

    Select add user 2

Gather advanced security logs

The advanced security logs can be software or hardware intrusion logs for your Azure Stack Edge Pro device.

Software intrusion logs

The software intrusion or the default firewall logs are collected for inbound and outbound traffic.

  • When the device is imaged at the factory, the default firewall logging is enabled. These logs are bundled in the support package by default when you create a support package via the local UI or via the Windows PowerShell interface of the device.

  • If only the firewall logs are needed in the support package to review any software (NW) intrusion in the device, use -Include FirewallLog option when creating the support package.

  • If no specific include option is provided, firewall log is included as a default in the support package.

  • In the support package, firewall log is the pfirewall.log and sits in the root folder. Here is an example of the software intrusion log for the Azure Stack Edge Pro device.

    #Version: 1.5
    #Software: Microsoft Windows Firewall
    #Time Format: Local
    #Fields: date time action protocol src-ip dst-ip src-port dst-port size tcpflags tcpsyn tcpack tcpwin icmptype icmpcode info path
    
    2019-11-06 12:35:19 DROP UDP 5.5.3.197 224.0.0.251 5353 5353 59 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e88 ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e88 ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e88 ff02::fb 5353 5353 89 - - - - - - 
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9d87 ff02::fb 5353 5353 79 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP 5.5.3.193 224.0.0.251 5353 5353 59 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe08:20d5 ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe08:20d5 ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e8b ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e8b ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP 5.5.3.33 224.0.0.251 5353 5353 59 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e8b ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e8a ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    2019-11-06 12:35:19 DROP UDP fe80::3680:dff:fe01:9e8b ff02::fb 5353 5353 89 - - - - - - - RECEIVE
    

Hardware intrusion logs

To detect any hardware intrusion into the device, currently all the chassis events such as opening or close of chassis, are logged.

  • The system event log from the device is read using the racadm cmdlet. These events are then filtered for chassis-related event in to a HWIntrusion.txt file.

  • To get only the hardware intrusion log in the support package, use the -Include HWSelLog option when you create the support package.

  • If no specific include option is provided, the hardware intrusion log is included as a default in the support package.

  • In the support package, the hardware intrusion log is the HWIntrusion.txt and sits in the root folder. Here is an example of the hardware intrusion log for the Azure Stack Edge Pro device.

    09/04/2019 15:51:23 system Critical The chassis is open while the power is off.
    09/04/2019 15:51:30 system Ok The chassis is closed while the power is off.
    

Use logs to troubleshoot

Any errors experienced during the upload and refresh processes are included in the respective error files.

  1. To view the error files, go to your share and select the share to view the contents.

  2. Select the Microsoft Data Box Edge folder. This folder has two subfolders:

    • Upload folder that has log files for upload errors.
    • Refresh folder for errors during refresh.

    Here is a sample log file for refresh.

    <root container="test1" machine="VM15BS020663" timestamp="03/18/2019 00:11:10" />
    <file item="test.txt" local="False" remote="True" error="16001" />
    <summary runtime="00:00:00.0945320" errors="1" creates="2" deletes="0" insync="3" replaces="0" pending="9" />
    
  3. When you see an error in this file (highlighted in the sample), note down the error code, in this case it is 16001. Look up the description of this error code against the following error reference.

    Error code Error description
    100 The container or share name must be between 3 and 63 characters.
    101 The container or share name must consist of only letters, numbers, or hyphens.
    102 The container or share name must consist of only letters, numbers, or hyphens.
    103 The blob or file name contains unsupported control characters.
    104 The blob or file name contains illegal characters.
    105 Blob or file name contains too many segments (each segment is separated by a slash -/).
    106 The blob or file name is too long.
    107 One of the segments in the blob or file name is too long.
    108 The file size exceeds the maximum file size for upload.
    109 The blob or file is incorrectly aligned.
    110 The Unicode encoded file name or blob is not valid.
    111 The name or the prefix of the file or blob is a reserved name that isn't supported (for example, COM1).
    2000 An etag mismatch indicates that there is a conflict between a block blob in the cloud and on the device. To resolve this conflict, delete one of those files – either the version in the cloud or the version on the device.
    2001 An unexpected problem occurred while processing a file after the file was uploaded. If you see this error, and the error persists for more than 24 hours, contact support.
    2002 The file is already open in another process and can't be uploaded until the handle is closed.
    2003 Couldn't open the file for upload. If you see this error, contact Microsoft Support.
    2004 Couldn't connect to the container to upload data to it.
    2005 Couldn't connect to the container because the account permissions are either wrong or out of date. Check your access.
    2006 Couldn't upload data to the account as the account or share is disabled.
    2007 Couldn't connect to the container because the account permissions are either wrong or out of date. Check your access.
    2008 Couldn't add new data as the container is full. Check the Azure specifications for supported container sizes based on type. For example, Azure File only supports a maximum file size of 5 TB.
    2009 Couldn't upload data because the container associated with the share doesn't exist.
    2997 An unexpected error occurred. This is a transient error that will resolve itself.
    2998 An unexpected error occurred. The error may resolve itself but if it persists for more than 24 hours, contact Microsoft Support.
    16000 Couldn't bring down this file.
    16001 Couldn't bring down this file since it already exists on your local system.
    16002 Couldn't refresh this file since it isn't fully uploaded.

Use error lists to troubleshoot

The errors lists are compiled from identified scenarios and can be used for self-diagnosis and troubleshooting.

Azure Resource Manager

Here are the errors that may show up during the configuration of Azure Resource Manager to access your device.

Issue / Errors Resolution
General issues
  • Verify that the Edge device is configured properly.
  • Verify that the client is configured properly
  • Add-AzureRmEnvironment: An error occurred while sending the request.
    At line:1 char:1
    + Add-AzureRmEnvironment -Name Az3 -ARMEndpoint "https://management.dbe ...
    This error means that your Azure Stack Edge Pro device is not reachable or configured properly. Verify that the Edge device and the client are configured correctly. For guidance, see the General issues row in this table.
    Service returned error. Check InnerException for more details: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. This error is likely due to one or more bring your own certificate steps incorrectly performed. You can find guidance here.
    Operation returned an invalid status code 'ServiceUnavailable'
    Response status code does not indicate success: 503 (Service Unavailable).
    This error could be the result of any of these conditions.
  • ArmStsPool is in stopped state.
  • Either of the Azure Resource Manager/Security token services websites are down.
  • The Azure Resource Manager cluster resource is down.

  • Note: Restarting the appliance might fix the issue, but you should collect the support package so that you can debug it further.
    AADSTS50126: Invalid username or password.
    Trace ID: 29317da9-52fc-4ba0-9778-446ae5625e5a
    Correlation ID: 1b9752c4-8cbf-4304-a714-8a16527410f4
    Timestamp: 2019-11-15 09:21:57Z: The remote server returned an error: (400) Bad Request.
    At line:1 char:1
    This error could be the result of any of these conditions.
  • For an invalid username and password, validate that the customer has changed the password from Azure portal by following the steps here and then by using the correct password.
  • For an invalid tenant ID, the tenant ID is a fixed GUID and should be set to c0257de7-538f-415c-993a-1b87a031879d
  • connect-AzureRmAccount: AADSTS90056: The resource is disabled or does not exist. Check your app's code to ensure that you have specified the exact resource URL for the resource you are trying to access.
    Trace ID: e19bdbc9-5dc8-4a74-85c3-ac6abdfda115
    Correlation ID: 75c8ef5a-830e-48b5-b039-595a96488ff9 Timestamp: 2019-11-18 07:00:51Z: The remote server returned an error: (400) Bad
    The resource endpoints used in the Add-AzureRmEnvironment command are incorrect.
    Unable to get endpoints from the cloud.
    Please ensure you have network connection. Error detail: HTTPSConnectionPool(host='management.dbg-of4k6suvm.microsoftdatabox.com', port=30005): Max retries exceeded with url: /metadata/endpoints?api-version=2015-01-01 (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))
    This error appears mostly in a Mac/Linux environment, and is due to the following issues:
  • A PEM format certificate wasn't added to the python certificate store.
  • Verify the device is configured properly

    1. From the local UI, verify that the device network is configured correctly.

    2. Verify that certificates are updated for all the endpoints as mentioned here.

    3. Get the Azure Resource Manager management and login endpoint from the Device page in local UI.

    4. Verify that the device is activated and registered in Azure.

    Verify the client is configured properly

    1. Validate that the correct PowerShell version is installed as mentioned here.

    2. Validate that the correct PowerShell modules are installed as mentioned here.

    3. Validate that Azure Resource Manager and login endpoints are reachable. You can try to ping the endpoints. For example:

      ping management.28bmdw2-bb9.microsoftdatabox.com ping login.28bmdw2-bb9.microsoftdatabox.com

      If they aren't reachable, add DNS / host file entries as mentioned here.

    4. Validate that client certificates are installed as mentioned here.

    5. If the customer is using PowerShell, you should enable the debug preference to see detailed messages by running this PowerShell command.

      $debugpreference = "continue"

    Blob Storage on device

    Here are the errors related to blob storage on Azure Stack Edge Pro/ Data Box Gateway device.

    Issue / Errors Resolution
    Unable to retrieve child resources. The value for one of the HTTP headers is not in the correct format. From the Edit menu, select Target Azure Stack APIs. Then, restart Azure Storage Explorer.
    getaddrinfo ENOTFOUND <accountname>.blob.<serialnumber>.microsoftdatabox.com Check that the endpoint name <accountname>.blob.<serialnumber>.microsoftdatabox.com is added to the hosts file at this path: C:\Windows\System32\drivers\etc\hosts on Windows, or /etc/hosts on Linux.
    Unable to retrieve child resources.
    Details: self-signed certificate
    Import the SSL certificate for your device into Azure Storage Explorer:
    1. Download the certificate from the Azure portal. For more information, see Download the certificate.
    2. From the Edit menu, select SSL Certificates and then select Import Certificates.
    AzCopy command appears to stop responding for a minute before displaying this error:
    Failed to enumerate directory https://… The remote name could not be resolved <accountname>.blob.<serialnumber>.microsoftdatabox.com
    Check that the endpoint name <accountname>.blob.<serialnumber>.microsoftdatabox.com is added to the hosts file at: C:\Windows\System32\drivers\etc\hosts.
    AzCopy command appears to stop responding for a minute before displaying this error:
    Error parsing source location. The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel.
    Import the SSL certificate for your device into the system's certificate store. For more information, see Download the certificate.
    AzCopy command appears to stop responding for 20 minutes before displaying this error:
    Error parsing source location https://<accountname>.blob.<serialnumber>.microsoftdatabox.com/<cntnr>. No such device or address.
    Check that the endpoint name <accountname>.blob.<serialnumber>.microsoftdatabox.com is added to the hosts file at: /etc/hosts.
    AzCopy command appears to stop responding for 20 minutes before displaying this error:
    Error parsing source location… The SSL connection could not be established.
    Import the SSL certificate for your device into the system's certificate store. For more information, see Download the certificate.
    AzCopy command appears to stop responding for 20 minutes before displaying this error:
    Error parsing source location https://<accountname>.blob.<serialnumber>.microsoftdatabox.com/<cntnr>. No such device or address
    Check that the endpoint name <accountname>.blob.<serialnumber>.microsoftdatabox.com is added to the hosts file at: /etc/hosts.
    AzCopy command appears to stop responding for 20 minutes before displaying this error: Error parsing source location… The SSL connection could not be established. Import the SSL certificate for your device into the system's certificate store. For more information, see Download the certificate.
    The value for one of the HTTP headers is not in the correct format. The installed version of the Microsoft Azure Storage Library for Python is not supported by Data Box. See Azure Data Box Blob storage requirements for supported versions.
    … [SSL: CERTIFICATE_VERIFY_FAILED] … Before running Python, set the REQUESTS_CA_BUNDLE environment variable to the path of the Base64-encoded SSL certificate file (see how to Download the certificate. For example:
    export REQUESTS_CA_BUNDLE=/tmp/mycert.cer
    python
    Alternately, add the certificate to the system's certificate store, and then set this environment variable to the path of that store. For example, on Ubuntu:
    export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
    python.
    The connection times out. Sign into the Azure Stack Edge Pro and then check that it's unlocked. Anytime the device restarts, it stays locked until someone signs in.

    Troubleshoot IoT Edge errors

    Use the IoT Edge agent runtime responses to troubleshoot compute-related errors. Here is a list of possible responses:

    • 200 - OK
    • 400 - The deployment configuration is malformed or invalid.
    • 417 - The device doesn't have a deployment configuration set.
    • 412 - The schema version in the deployment configuration is invalid.
    • 406 - The IoT Edge device is offline or not sending status reports.
    • 500 - An error occurred in the IoT Edge runtime.

    For more information, see IoT Edge Agent.

    The following error is related to the IoT Edge service on your Azure Stack Edge Pro device.

    Compute modules have Unknown status and can't be used

    Error description

    All modules on the device show Unknown status and can't be used. The Unknown status persists through a reboot.

    Suggested solution

    Delete the IoT Edge service, and then redeploy the module(s). For more information, see Remove IoT Edge service.

    Modules show as running but are not working

    Error description

    The runtime status of module shows as running but the expected outcomes are not seen.

    This condition could be because of an issue with module route configuration that is not working or edgehub is not routing messages as expected. You can check the edgehub logs. If you see that there are errors such as failing to connect to the IoT Hub service, then the most common reason is the connectivity issues. The connectivity issues could be because the AMPQ port that is used as a default port by IoT Hub service for communication is blocked or the web proxy server is blocking these messages.

    Suggested solution

    Take the following steps:

    1. To resolve the error, go to the IoT Hub resource for your device and then select your Edge device.
    2. Go to Set modules > Runtime settings.
    3. Add the Upstream protocol environment variable and assign it a value of AMQPWS. The messages configured in this case are sent over WebSockets via port 443.

    Modules show as running but do not have an IP assigned

    Error description

    The runtime status of module shows as running but the containerized app does not have an IP assigned.

    This condition is because the range of IPs that you have provided for Kubernetes external service IPs is not sufficient. You need to extend this range to ensure that each container or VM that you deployed are covered.

    Suggested solution

    In the local web UI of your device, do the following steps:

    1. Go to the Compute page. Select the port for which you enabled the compute network.
    2. Enter a static, contiguous range of IPs for Kubernetes external service IPs. You need 1 IP for edgehub service. Additionally, you need one IP for each IoT Edge module and for each VM you'll deploy.
    3. Select Apply. The changed IP range should take effect immediately.

    For more information, see Change external service IPs for containers.

    Configure static IPs for IoT Edge modules

    Problem description

    Kubernetes assigns dynamic IPs to each IoT Edge module on your Azure Stack Edge Pro GPU device. A method is needed to configure static IPs for the modules.

    Suggested solution

    You can specify fixed IP addresses for your IoT Edge modules via the K8s-experimental section as described below:

    {
      "k8s-experimental": {
        "serviceOptions" : {
          "loadBalancerIP" : "100.23.201.78",
          "type" : "LoadBalancer"
        }
      }
    }
    

    Expose Kubernetes service as cluster IP service for internal communication

    Problem description

    By default, the IoT service type is of type load balancer and assigned externally facing IP addresses. You may not want an external-facing IP address for your application. You may need to expose the pods within the KUbernetes cluster for access as other pods and not as an externally exposed load balancer service.

    Suggested solution

    You can use the create options via the K8s-experimental section. The following service option should work with port bindings.

    {
    "k8s-experimental": {
      "serviceOptions" : {
        "type" : "ClusterIP"
        }
      }
    }
    

    Next steps