Use curl to load data into HDFS on SQL Server Big Data Clusters

Applies to: SQL Server 2019 (15.x)

This article explains how to use curl to load data into HDFS on SQL Server 2019 Big Data Clusters.

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

Prerequisites

Obtain the service external IP

WebHDFS is started when deployment is completed, and its access goes through Knox. The Knox endpoint is exposed through a Kubernetes service called gateway-svc-external. To create the necessary WebHDFS URL to upload/download files, you need the gateway-svc-external service external IP address and the name of your big data cluster. You can get the gateway-svc-external service external IP address by running the following command:

kubectl get service gateway-svc-external -n <big data cluster name> -o json | jq -r .status.loadBalancer.ingress[0].ip

Note

The <big data cluster name> here is the name of the cluster that you specified in the deployment configuration file. The default name is mssql-cluster.

Construct the URL to access WebHDFS

Now, you can construct the URL to access the WebHDFS as follows:

https://<gateway-svc-external service external IP address>:30443/gateway/default/webhdfs/v1/

For example:

https://13.66.190.205:30443/gateway/default/webhdfs/v1/

Authentication with Active Directory

For deployments with Active Directory, use the authentication parameter with curl with Negotiate authentication.

To use curl with Active Directory authentication, run this command:

kinit <username>

The command generates a Kerberos token for curl to use. The commands demonstrated in the next sections specify the --anyauth parameter for curl. For URLs that require Negotiate authentication, curl automatically detects and uses the generated Kerberos token instead of username and password to authenticate to the URLs.

List a file

To list file under hdfs:///product_review_data, use the following curl command:

curl -i -k --anyauth -u root:<AZDATA_PASSWORD> -X GET 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/?op=liststatus'

Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU 5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For endpoints that do not use root, use the following curl command:

curl -i -k --anyauth -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X GET 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/?op=liststatus'

Put a local file into HDFS

To put a new file test.csv from local directory to product_review_data directory, use the following curl command (the Content-Type parameter is required):

curl -i -L -k --anyauth -u root:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/test.csv?op=create' -H 'Content-Type: application/octet-stream' -T 'test.csv'

Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU 5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For endpoints that do not use root, use the following curl command:

curl -i -L -k --anyauth -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/product_review_data/test.csv?op=create' -H 'Content-Type: application/octet-stream' -T 'test.csv'

Create a directory

To create a directory test under hdfs:///, use the following command:

curl -i -L -k --anyauth -u root:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/test?op=MKDIRS'

Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU 5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For endpoints that do not use root, use the following curl command:

curl -i -L -k --anyauth -u <AZDATA_USERNAME>:<AZDATA_PASSWORD> -X PUT 'https://<gateway-svc-external IP external address>:30443/gateway/default/webhdfs/v1/test?op=MKDIRS'

Next steps

For more information, see Introducing SQL Server Big Data Clusters.