How to mount S3 for HDFS tiering in a big data cluster
The following sections provide an example of how to configure HDFS tiering with an S3 Storage data source.
- Deployed big data cluster
- Big data tools
- Create and upload data to an S3 bucket
- Upload CSV or Parquet files to your S3 bucket. This is the external HDFS data that will be mounted to HDFS in the big data cluster.
Open a command-prompt on a client machine that can access your big data cluster.
Create a local file named filename.creds that contains your S3 account credentials using the following format:
fs.s3a.access.key=<Access Key ID of the key> fs.s3a.secret.key=<Secret Access Key of the key>
For more information on how to create S3 access keys, see S3 access keys.
Mount the remote HDFS storage
Now that you have prepared a credential file with access keys, you can start mounting. The following steps mount the remote HDFS storage in S3 to the local HDFS storage of your big data cluster.
Use kubectl to find the IP Address for the endpoint controller-svc-external service in your big data cluster. Look for the External-IP.
kubectl get svc controller-svc-external -n <your-cluster-name>
Log in with mssqlctl using the external IP address of the controller endpoint with your cluster username and password:
mssqlctl login -e https://<IP-of-controller-svc-external>:30080/
Mount the remote HDFS storage in Azure using mssqlctl cluster storage-pool mount create. Replace the placeholder values before running the following command:
mssqlctl cluster storage-pool mount create --remote-uri s3a://<S3 bucket name> --mount-path /mounts/<mount-name> --credential-file <path-to-s3-credentials>/file.creds
The mount create command is asynchronous. At this time, there is no message indicating whether the mount succeeded. See the status section to check the status of your mounts.
If mounted successfully, you should be able to query the HDFS data and run Spark jobs against it. It will appear in the HDFS for your big data cluster in the location specified by
Get the status of mounts
To list the status of all mounts in your big data cluster, use the following command:
mssqlctl cluster storage-pool mount status
To list the status of a mount at a specific path in HDFS, use the following command:
mssqlctl cluster storage-pool mount status --mount-path <mount-path-in-hdfs>
Delete the mount
To delete the mount, use the mssqlctl cluster storage-pool mount delete command, and specify the mount path in HDFS:
mssqlctl cluster storage-pool mount delete --mount-path <mount-path-in-hdfs>
For more information about SQL Server 2019 big data clusters, see What are SQL Server 2019 big data clusters?.