Configure deployment settings for big data clusters

THIS TOPIC APPLIES TO:yesSQL Server noAzure SQL DatabasenoAzure SQL Data Warehouse noParallel Data Warehouse

To customize your cluster deployment configuration file, you can use any JSON format editor, such as VSCode. For scripting these edits for automation purposes, use the mssqlctl bdc config section command. This article explains how to configure big data cluster deployments by modifying deployment configuration files. It provides examples for how to change the configuration for different scenarios. For more information about how configuration files are used in deployments, see the deployment guidance.

Prerequisites

  • Install mssqlctl.

  • Each of the examples in this section assume that you have created a copy of one of the standard configuration files. For more information, see Create a custom configuration file. For example, the following command creates a directory called custom that contains a JSON deployment configuration file based on the default aks-dev-test configuration:

    mssqlctl bdc config init --source aks-dev-test --target custom
    

Change cluster name

The cluster name is both the name of the big data cluster and the Kubernetes namespace that will be created on deployment. It is specified in the following portion of the deployment configuration file:

"metadata": {
    "kind": "Cluster",
    "name": "mssql-cluster"
},

The following command sends a key-value pair to the --json-values parameter to change the big data cluster name to test-cluster:

mssqlctl bdc config section set --config-profile custom -j "metadata.name=test-cluster"

Important

The name of your big data cluster must be only lower case alpha-numeric characters, no spaces. All Kubernetes artifacts (containers, pods, statefull sets, services) for the cluster will be created in a namespace with same name as the cluster name specified.

Update endpoint ports

Endpoints are defined for the control plane as well as for individual pools. The following portion of the configuration file shows the endpoint definitions for the control plane:

"endpoints": [
    {
        "name": "Controller",
        "serviceType": "LoadBalancer",
        "port": 30080
    },
    {
        "name": "ServiceProxy",
        "serviceType": "LoadBalancer",
        "port": 30777
    }
]

The following example uses inline JSON to change the port for the Controller endpoint:

mssqlctl bdc config section set --config-profile custom -j "$.spec.controlPlane.spec.endpoints[?(@.name==""Controller"")].port=30000"

Configure pool replicas

The characteristics of each pool, such as the storage pool, is defined in the configuration file. For example, the following portion shows a storage pool definition:

"pools": [
    {
        "metadata": {
            "kind": "Pool",
            "name": "default"
        },
        "spec": {
            "type": "Storage",
            "replicas": 2,
            "storage": {
               "data": {
                  "className": "default",
                  "accessMode": "ReadWriteOnce",
                  "size": "15Gi"
               },
               "logs": {
                  "className": "default",
                  "accessMode": "ReadWriteOnce",
                  "size": "10Gi"
               }
           },
        }
    }
]

You can configure the number of instances in a pool by modifying the replicas value for each pool. The following example uses inline JSON to change these values for the storage and data pools to 10 and 4 respectively:

mssqlctl bdc config section set --config-profile custom -j "$.spec.pools[?(@.spec.type == ""Storage"")].spec.replicas=10"
mssqlctl bdc config section set --config-profile custom -j "$.spec.pools[?(@.spec.type == ""Data"")].spec.replicas=4"

Configure storage

You can also change the storage class and characteristics that are used for each pool. The following example assigns a custom storage class to the storage pool and updates the size of the persistent volume claim for storing data to 100Gb. You must have this section in the configuration file to update the settings using the mssqlctl bdc config set command, see below how to use a patch file to add this section:

mssqlctl bdc config section set --config-profile custom -j "$.spec.pools[?(@.spec.type == ""Storage"")].spec.storage.data.className=storage-pool-class"
mssqlctl bdc config section set --config-profile custom -j "$.spec.pools[?(@.spec.type == ""Storage"")].spec.storage.data.size=32Gi"

Note

A configuration file based on kubeadm-dev-test does not have a storage definition for each pool, but this can be added manually if needed.

For more information about storage configuration, see Data persistence with SQL Server big data cluster on Kubernetes.

Configure storage without spark

You can also configure the storage pools to run without spark and create a separate spark pool. This enables you to scale spark compute power independent of storage. To see how to configure the spark pool, see the JSON patch file example at the end of this article.

You must have this section in the configuration file to update the settings using the mssqlctl cluster config set command. The following JSON patch file shows how to add this.

By default, the includeSpark setting for the storage pool is set to true, so you must add the includeSpark field into the storage configuration in order to make changes:

mssqlctl cluster config section set --config-profile custom -j "$.spec.pools[?(@.spec.type == ""Storage"")].includeSpark=false"

Configure pod placement using Kubernetes labels

You can control pod placement on Kubernetes nodes that have specific resources to accommodate various types of workload requirements. For example, you might want to ensure the storage pool pods are placed on nodes with more storage, or SQL Server master instances are placed on nodes that have higher CPU and memory resources. In this case, you will first build a heterogeneous Kubernetes cluster with different types of hardware and then assign node labels accordingly. At the time of deploying big data cluster, you can specify same labels at pool level in the cluster deployment configuration file. Kubernetes will then take care of affinitizing the pods on nodes that match the specified labels.

The following example shows how to edit a custom configuration file to include a node label setting for the SQL Server master instance. Note that there is no nodeLabel key in the built in configurations so you will need to either edit a custom configuration file manually or create a patch file and apply it to the custom configuration file.

Create a file named patch.json in your current directory with the following contents:

{
  "patch": [
     {
      "op": "add",
      "path": "$.spec.pools[?(@.spec.type == 'Master')].spec",
      "value": {
      "nodeLabel": "<yourNodeLabel>"
       }
    }
  ]
}
mssqlctl bdc config section set --config-profile custom -p ./patch.json

JSON patch files

JSON patch files configure multiple settings at once. For more information about JSON patches, see JSON Patches in Python and the JSONPath Online Evaluator.

The following patch.json file performs the following changes:

  • Updates the port of single endpoint.
  • Updates all endpoints (port and serviceType).
  • Updates the control plane storage. These settings are applicable to all cluster components, unless overridden at pool level.
  • Updates the storage class name in control plane storage.
  • Updates pool storage settings for storage pool.
  • Updates Spark settings for storage pool.
  • Creates a spark pool with 2 replicas for the cluster
{
  "patch": [
    {
      "op": "replace",
      "path": "$.spec.controlPlane.spec.endpoints[?(@.name=='Controller')].port",
      "value": 30000
    },
    {
      "op": "replace",
      "path": "spec.controlPlane.spec.endpoints",
      "value": [
        {
          "serviceType": "LoadBalancer",
          "port": 30001,
          "name": "Controller"
        },
        {
            "serviceType": "LoadBalancer",
            "port": 30778,
            "name": "ServiceProxy"
        }
      ]
    },
    {
      "op": "replace",
      "path": "spec.controlPlane.spec.controlPlane",
      "value": {
          "data": {
            "className": "managed-premium",
            "accessMode": "ReadWriteOnce",
            "size": "100Gi"
          },
          "logs": {
            "className": "managed-premium",
            "accessMode": "ReadWriteOnce",
            "size": "32Gi"
          }
        }
    },
    {
      "op": "replace",
      "path": "spec.controlPlane.spec.storage.data.className",
      "value": "managed-premium"
    },
    {
      "op": "add",
      "path": "$.spec.pools[?(@.spec.type == 'Storage')].spec.storage",
      "value": {
          "data": {
            "className": "managed-premium",
            "accessMode": "ReadWriteOnce",
            "size": "100Gi"
          },
          "logs": {
            "className": "managed-premium",
            "accessMode": "ReadWriteOnce",
            "size": "32Gi"
          }
        }
    },
    {
      "op": "replace",
      "path": "$.spec.pools[?(@.spec.type == 'Storage')].hadoop.spark",
      "value": {
        "driverMemory": "2g",
        "driverCores": 1,
        "executorInstances": 3,
        "executorCores": 1,
        "executorMemory": "1536m"
      }
    },
    {
      "op": "add",
      "path": "spec.pools/-",
      "value":
      {
        "metadata": {
          "kind": "Pool",
          "name": "default"
        },
        "spec": {
          "type": "Spark",
          "replicas": 2
        },
        "hadoop": {
          "yarn": {
            "nodeManager": {
              "memory": 12288,
              "vcores": 6
            },
            "schedulerMax": {
              "memory": 12288,
              "vcores": 6
            },
            "capacityScheduler": {
              "maxAmPercent": 0.3
            }
          },
          "spark": {
            "driverMemory": "2g",
            "driverCores": 1,
            "executorInstances": 2,
            "executorMemory": "2g",
            "executorCores": 1
          }
        }
      }
    }   
  ]
}

Tip

For more information about the structure and options for changing a deployment configuration file, see Deployment configuration file reference for big data clusters.

Use mssqlctl bdc config section set to apply the changes in the JSON patch file. The following example applies the patch.json file to a target deployment configuration file custom.json.

mssqlctl bdc config section set --config-profile custom -p ./patch.json

Next steps

For more information about using configuration files in big data cluster deployments, see How to deploy SQL Server big data clusters on Kubernetes.