Use curl to load data into HDFS on SQL Server Big Data Clusters

Applies to: yesSQL Server 2019 (15.x)

This article explains how to use curl to load data into HDFS on SQL Server 2019 Big Data Clusters.

Prerequisites

Obtain the service external IP

WebHDFS is started when deployment is completed, and its access goes through Knox. The Knox endpoint is exposed through a Kubernetes service called gateway-svc-external. To create the necessary WebHDFS URL to upload/download files, you need the gateway-svc-external service external IP address and the name of your big data cluster. You can get the gateway-svc-external service external IP address by running the following command:

kubectl get service gateway-svc-external -n <big data cluster name> -o json | jq -r .status.loadBalancer.ingress[0].ip

Note

The <big data cluster name> here is the name of the cluster that you specified in the deployment configuration file. The default name is mssql-cluster.

Construct the URL to access WebHDFS

Now, you can construct the URL to access the WebHDFS as follows:

https://<gateway-svc-external service external IP address>:30443/gateway/default/webhdfs/v1/

For example:

https://13.66.190.205:30443/gateway/default/webhdfs/v1/

List a file

To list file under hdfs:///product_review_data, use the following curl command:

curl -i -k -u root:<AZDATA_PASSWORD> -X GET 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/?op=liststatus'

Beginning with SQL Server 2019 CU5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For endpoints that do not use root, use the following curl command:

curl -i -k -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X GET 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/?op=liststatus'

Put a local file into HDFS

To put a new file test.csv from local directory to product_review_data directory, use the following curl command (the Content-Type parameter is required):

curl -i -L -k -u root:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/test.csv?op=create' -H 'Content-Type: application/octet-stream' -T 'test.csv'

Beginning with SQL Server 2019 CU5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For endpoints that do not use root, use the following curl command:

curl -i -L -k -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/test.csv?op=create' -H 'Content-Type: application/octet-stream' -T 'test.csv'

Create a directory

To create a directory test under hdfs:///, use the following command:

curl -i -L -k -u root:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/test?op=MKDIRS'

Beginning with SQL Server 2019 CU5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For endpoints that do not use root, use the following curl command:

curl -i -L -k -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/test?op=MKDIRS'

Next steps

For more information about SQL Server big data cluster, see What is SQL Server big data cluster?.