Question 1

What are the recommended best practices regarding file locations?

Accepted Answer

There is less flexibility in this regard comparing to configuring SQL Server on bare metal machines on Windows or Linux. In the Kubernetes environment these artifacts are abstracted and they need to be portable. Currently, there are 2 persistent volumes (PVs), for data and logs, provided per pod that can be configured. For more information, see Data persistence with SQL Server big data cluster in Kubernetes.

Question 2

Do I need to take transaction log backups on SQL Server Big Data Clusters?

Accepted Answer

You need to perform log backups only for user databases in SQL Server master instance (depending on recovery model or HA configuration). Data pool databases use SIMPLE recovery model only. Same applies for the DW* databases created for PolyBase.

Question 3

How can I monitor if distributed queries are actually using the compute pool?

Accepted Answer

You can use the existing PolyBase DMVs that were enhanced for Big Data Cluster scenarios. For more information, see Monitor and troubleshoot PolyBase.

Question 4

Is it possible to configure and manage Big Data Cluster resources directly via kubectl to the Kubernetes API Server?

Accepted Answer

While you can modify some of the settings using Kubernetes API or kubectl, it is not supported nor recommended. You must execute all the Big Data Cluster management operations via azdata.

Question 5

How can I backup data stored in HDFS?

Accepted Answer

You can use any solutions that enable hardware level storage snapshotting or copy/sync via webHDFS. You could also use azdata bdc hdfs cp, for more information see azdata bdc hdfs.

Question 6

Is there a way to 'scale out' a stored proc? For example, having it run on compute pool for example?

Accepted Answer

Not at this time. One option is to deploy SQL Server in an Always On Availability Group. You can then use readable secondary replica(s) to run some processes (ex: ml training/scoring, maintenance activities, etc).

Question 7

How to dynamically scale pods of a Pool?

Accepted Answer

This is not a supported scenario at this time.

Question 8

Is it possible to backup external tables stored in data pools?

Accepted Answer

Database in the data pool instance does not have any metadata about the external tables - it is like any user database. You can do backup/restore, but to avoid inconsistent results, you must ensure the external table metadata in the metadata database in the SQL Master instance is in sync.

Question 9

Does the data pool provide sharding?

Accepted Answer

Data pool is a distributed table concept. Sharding is typically referenced as an OLTP concept - this is not currently supported.

Question 10

When should I use the data pool or the storage pool for raw data storage?

Accepted Answer

The term pool is reserved to describe a collection of homogeneous services or applications. For example, data pool is a set of stateful SQL Server compute and storage and storage pool is a set of HDFS and Spark services. The SQL Server master is either a single-instance or multiple instances that can be configured in an availability group. The SQL Server master instance is a regular SQL Server instance on Linux and you can use any feature available on Linux there. You should start first with the data model, the entities and services/applications that will primarily operate on the entity. All the data doesn't have to be stored in one place like SQL Server or HDFS or data pool. Based on the data analysis, it is possible you store most of the data in HDFS, process the data to more efficient format, and expose to other services. The remaining data would be stored in SQL Master instance.

Question 11

Does SQL Server Big Data Cluster support GPU-based deep learning libraries and computations (PyTorch, Keras, specific image libraries, etc.)?

Accepted Answer

This is not a supported scenario at this time.

Question 12

Is there a way to configure multiple volume claims for a pool?

Accepted Answer

Each pod can have only two persisted volumes (PVs). You can abstract the volume at OS level and use it for persistent storage. For example, you can create a RAID 0 OS partition using multiple disks and use that for persistent volume using a local storage provisioner. There is no way to use more PVs per pod today. PVs are mapped to directories inside the container and this is fixed. For more information on persisted volumes see, Persistent Volumes in Kubernetes Documentation.

Question 13

If we configure multiple providers and multiple disks, will the HDFS config be updated with all the data volume claims?

Accepted Answer

You can configure storage pool to use a specific storage class at deployment time. See Data persistence with SQL Server big data cluster in Kubernetes.

Question 14

What are the options to access Ceph-based storage?

Accepted Answer

HDFS Tiering allows us to integrate transparently with S3-based protocols. For more information, se How to mount S3 for HDFS tiering in a big data cluster.

Question 15

Is data in HDFS preserved after an upgrade?

Accepted Answer

Yes, data will be preserved since it is backed by persistent volumes and upgrade just deploys existing pods with new images.

Question 16

How does HDFS tiering control the cache?

Accepted Answer

Using HDFS tiering, data is cached withing the local HDFS running in Big Data Cluster to allow users to attach to large data lakes without having to bring all the data in. There is a configurable amount of space allocated to the cache which is defaulted to 2% today. Data is maintained in the cache but will be removed if that threshold is exceeded. Security is also maintained from the lake and all ACLs are applied. For more information, see Configure HDFS tiering on Big Data Clusters.

Question 17

Can we use SQL Server 2019 to visualize Azure Data Lake Store Gen2? Will this integration take care of folder level permission?

Accepted Answer

Yes you can virtualize data stored in ADLS Gen2 using HDFS tiering. Once HDFS Tiering is mounted to ADLS Gen2, users gain ability to query the HDFS data and run Spark jobs against it. The mounted storage will appear in the HDFS for Big Data Cluster in the location specified by --mount-path, and users can work with that mount path as if working with a local storage. See more details here: Configure HDFS tiering on Big Data Cluster. For more information on HDFS tier permissions, see Manage HDFS permissions for SQL Server Big Data Clusters.

Question 18

What's the default high-availability and/or redundancy setting for the master node on Azure Kubernetes Service (AKS)?

Accepted Answer

The AKS control plane supports uptime SLA guarantees 99.95% availability. The AKS cluster nodes (worker nodes) use Availability Zones, for more information see AKS Availability Zones. An Availability Zone (AZ) is a high availability offering from Azure that protects applications and data from datacenter failures. AKS supports 99.9% availability for clusters that don't use Availability Zones. For more information, please refer to SLA for Azure Kubernetes Service (AKS).

Question 19

Is there a way to retain YARN and Spark Job History logs?

Accepted Answer

Restarting sparkhead won't cause the logs to be lost, these logs are in HDFS. You should still see Spark history logs from the /gateway/default/sparkhistory UI. For Yarn container logs, you won't see those apps in Yarn UI because Yarn RM restarts, but those yarn logs are still in HDFS and you can link to them from Spark history server. You should always use Spark history server as the entry point to diagnose their Spark apps.

Question 20

Is there a way to turn off the caching feature for any pools?

Accepted Answer

By default, 1% of the total HDFS storage will be reserved for caching of mounted data. Caching is a global setting across mounts. Currently, there is not an exposed way to turn it off, however, the percentage can be configured via the hdfs-site.dfs.provided.cache.capacity.fraction setting. This setting controls the fraction of the total capacity in the cluster that can be used to cache data from Provided stores. To modify, see How to configure Big Data Cluster settings post deployment. For more information, see Configure HDFS tiering on SQL Server Big Data Clusters.

Question 21

How to schedule SQL stored procedures in SQL Server 2019 Big Data Cluster?

Accepted Answer

You can use the SQL Server Agent service in the SQL Server master instance of the big data cluster.

Question 22

Does Big Data Cluster support native time series data scenarios, such as generated by IoT use-cases?

Accepted Answer

At this time InfluxDB in a Big Data Cluster is used only for storing monitoring data collected within the Big Data Cluster and is not exposed as an external endpoint.

Question 23

Can the provided InfluxDB be used as a time series database for customer data?

Accepted Answer

At this time InfluxDB in a Big Data Cluster is used only for storing monitoring data collected within the Big Data Cluster and is not exposed as an external endpoint.

Question 24

How do I add a database to the availability group?

Accepted Answer

In Big Data Cluster, the HA configuration creates an availability group called containedag which also includes system databases that are replicated across replicas. Databases created as result of a CREATE DATABASE or RESTORE workflows are automatically added to the contained AG and seeded. Prior to SQL Server 2019 (15.0) CU2, you have to connect to the physical instance in Big Data Cluster, restore the database and add it to the containedag. For more information, see Deploy SQL Server Big Data Cluster with high availability.

Question 25

Can I configure core/memory resources for components running within Big Data Cluster?

Accepted Answer

At this time, you can set memory for the SQL instances using sp_configure, just like in SQL Server. For cores, you can use ALTER SERVER CONFIGURATION SET PROCESS AFFINITY. By default, containers see all CPUs on the host and we don't have a way to specify resource limits using Kubernetes at this time. For compute pool/data pool/storage pool, the configuration can be done using EXECUTE AT DATA_SOURCE statement from SQL Server master instance.

Question 26

What happens when one of the Kubernetes worker nodes shuts down or has an outage?

Accepted Answer

Pods that are not affinitized to the respective worker node will be moved to another node in the Kubernetes cluster provided there are sufficient resources. Otherwise, the pod(s) will be unavailable causing outages.

Question 27

Does Big Data Cluster re-balance automatically if I add a node to the Kubernetes cluster?

Accepted Answer

This action depends only on Kubernetes. Apart from pod placement using node labels, there is no other mechanism to control re-balancing Kubernetes resources from within Big Data Cluster.

Question 28

What is the consequence on Big Data Cluster resources when I remove a node from the Kubernetes cluster?

Accepted Answer

This action is equivalent to the host node being shutdown. There are mechanisms to orchestrate this in Kubernetes using a tainting process and this is typically followed for upgrade or node maintenance. For more information, see Kubernetes documentation for Taints and Tolerations.

Question 29

Does the Hadoop bundled with Big Data Cluster handle replication of the data?

Accepted Answer

Yes, replication factor is one of the available configurations for HDFS. For more information see Configure Persistent Volumes.

Question 30

Does Big Data Cluster overlap with Synapse in terms of functionality and integration?

Accepted Answer

It depends on your use cases and requirements. Big Data Cluster provides a full SQL Server surface area in addition to Microsoft-supported Spark and HDFS, on-premises. Big Data Cluster enables the SQL Server customer to be able to integrate into analytics/big data. Azure Synapse is purely an analytical platform offering a first class experience for customers as a managed service in the cloud, with a focus on scale out analytics. Azure Synapse is not targeting an operational workload as part of that. Big Data Cluster is aiming to provide in database analytical scenarios, much closer to the operational store.

Question 31

Is SQL Server using HDFS as its storage in SQL Server Big Data Clusters?

Accepted Answer

The SQL Server instance's database files are not stored in HDFS, however, SQL Server can query HDFS using external table interface.

Question 32

What are the available distribution options for storing data in the distributed tables in each data pool?

Accepted Answer

ROUND_ROBIN and REPLICATED. ROUND_ROBIN is the default. HASH is not available.

Question 33

Does Big Data Cluster have the Spark Thrift Server included? If so, is ODBC endpoint exposed to connect to Hive Metastore tables?

Accepted Answer

We currently expose the Hive Metastore (HMS) via the Thrift protocol. We document the protocol but haven't opened up an ODBC endpoint at this time. You can access it via the Hive Metastore HTTP protocol, for more information see Hive Metastore HTTP Protocol.

Question 34

Is it possible to ingest data from SnowFlake into a Big Data Cluster?

Accepted Answer

SQL Server on Linux (applies to the SQL Server Master instance in Big Data Cluster too) does not support the generic ODBC data source which allows you to install a 3rd party ODBC driver (SnowFlake, DB2, PostgreSQL etc) and query those. This feature is currently available only in SQL Server 2019 (15.0) on Windows. In Big Data Cluster, you can read the data via Spark using JDBC and ingest into SQL Server using the MSSQL Spark Connector.

Question 35

Is it possible to ingest data using a custom ODBC data source into a Big Data Cluster?

Accepted Answer

SQL Server on Linux (applies to SQL Server Master instance in Big Data Cluster too) does not support the generic ODBC data source which allows you to install a 3rd party ODBC driver (SnowFlake, DB2, PostgreSQL etc) and query those.

Question 36

How can you import data to the same table using PolyBase CTAS instead of creating NEW table every time you run the CTAS?

Accepted Answer

You can use INSERT..SELECT approach to avoid the need a new table every time.

Question 37

What would be the advantage/considerations to load data into Data pool instead of directly into the Master Instance as local tables?

Accepted Answer

If your SQL Server Master instance has enough resources to satisfy your analytic workload then it is always the fastest option. Data pool helps if you want to offload execution to other SQL instances for your distributed queries. You can also use data pool to ingest data from Spark executors in parallel to different SQL instances – so load performance for large datasets that is being generated from the Hadoop Distributed File System (HDFS) will typically be better than going into a single SQL Server instance. However, this is also hard to say since you could still have multiple tables in a SQL Server and insert into parallel if you want. Performance depends on many factors and there is no single guidance or recommendation in that regard.

Question 38

How can I monitor the data distribution within the data pool tables?

Accepted Answer

You can use EXECUTE AT to query DMVs like sys.dm_db_partition_stats to get the data in each local table.

Question 39

Is curl the only option to upload files to HDFS?

Accepted Answer

No, you can use azdata bdc hdfs cp. If you provide the root directory the command will recursively copy the whole tree. You can copy in/out using this command just by changing what is the source/target paths.

Question 40

How can I load data into the data pool?

Accepted Answer

You can use MSSQL Spark connector library to help with SQL and data pool ingestion. For a guided walk-through, see Tutorial: Ingest data into a SQL Server data pool with Spark jobs.

Question 41

If I have a lot of data on a (Windows) network path, which contains lots of folders/sub-folders and text files, how do I upload them to HDFS on Big data cluster?

Accepted Answer

Give azdata bdc hdfs cp a try. If you provide the root directory the command will recursively copy the whole tree. You can copy in/out using this command just by changing what is the source/target paths.

Question 42

Is it possible to increase the size of the storage pool on a deployed cluster?

Accepted Answer

There is no azdata interface to perform this operation at this time. You have the option to resize desired PVCs manually. Resizing is a complex operation, see Persistent Volumes in Kubernetes Documentation.

Question 43

When should I use linked servers vs PolyBase?

Accepted Answer

See main differences and use cases here: PolyBase FAQ.

Question 44

What are the supported data virtualization sources?

Accepted Answer

Big Data Cluster supports data virtualization from ODBC sources – SQL Server, Oracle, MongoDB, Teradata, etc. It also supports tiering of remote stores such as Azure Data Lake Store Gen2 and S3-compatible storage, as well as AWS S3A and the Azure Blob File System (ABFS).

Question 45

Can I use PolyBase to virtualize data stored in an Azure SQL database?

Accepted Answer

Yes, you can use PolyBase in Big Data Cluster to access data in Azure SQL Database.

Question 46

Why do the CREATE TABLE statements include the key word EXTERNAL? What does EXTERNAL do differently than the standard CREATE TABLE?

Accepted Answer

In general, the external keyword implies that the data is not in the SQL Server instance. For example, you can define a storage pool table on top of an HDFS directory. The data are stored in HDFS files, not in your database files, but external table provided you the interface to query HDFS files as a relational table as if it is in the database.
This concept of accessing external data is called data virtualization, for more information see Introducing data virtualization with PolyBase. For a tutorial on virtualizing data from CSV files in HDFS, see [Virtualize CSV data from storage pool Big Data Clusters.

Question 47

What are the differences between data virtualization using SQL Server running within SQL Server Big Data Clusters vs SQL Server?

Accepted Answer

For an comparison, see PolyBase in Big Data Clusters vs. PolyBase in stand-alone instances.

Question 48

How can I easily tell that an external table is pointing to data pool vs storage pool?

Accepted Answer

You can determine type of external table by looking at the data source location prefix, for example, sqlserver://, oracle://, sqlhdfs://, sqldatapool://.

Question 49

My Big Data Cluster deployment failed. How do I see what went wrong?

Accepted Answer

See Manage SQL Server Big Data Clusters with Azure Data Studio notebooks. Also see the troubleshooting topics in Troubleshoot Kubernetes.

Question 50

Is there a definitive list of everything that can be set in the Big Data Cluster config?

Accepted Answer

All the customizations that can be done at deployment time are documented here in Configure deployment settings for cluster resources and services. For Spark, see Configure Apache Spark and Apache Hadoop in Big Data Clusters.

Question 51

Can we deploy SQL Server Analysis Services together with SQL Server Big Data Clusters?

Accepted Answer

No. Specifically, SQL Server Analysis Services (SSAS) is not supported on SQL Server on Linux, so you will have to install a SQL Server instance on Windows server to run SSAS.

Question 52

Is Big Data Cluster supported for deployment in EKS or GKS?

Accepted Answer

Big Data Cluster can run on any Kubernetes stack based on version 1.13 and higher. However, we have not performed specific validations of Big Data Cluster on EKS or GKS.

Question 53

What is version of HDFS and Spark running within Big Data Cluster?

Accepted Answer

Spark is 2.4 and HDFS is 3.2.1. For complete details on the open-source software included in Big Data Cluster, see Open-source software reference.

Question 54

How do I install libraries and packages in Spark?

Accepted Answer

You can add packages at job submission using the steps in the sample notebook for installing packages in Spark.

Question 55

Do I need to use SQL Server 2019 to use R and Python for SQL Server Big Data Clusters?

Accepted Answer

Machine Learning (ML) Services (R and Python) is available beginning in SQL Server 2017. ML Services is available in SQL Server Big Data Clusters as well. For more information, see What is SQL Server Machine Learning Services with Python and R?.

Question 56

How do SQL Server licenses work for SQL Server Big Data Clusters?

Accepted Answer

Please refer to the licensing guide which goes into much more detail, download the PDF.
For a summary, watch the video SQL Server Licensing: Big Data Clusters | Data Exposed.

Question 57

Does Big Data Cluster support Microsoft Entra ID ([formerly Azure Active Directory](/entra/fundamentals/new-name))?

Accepted Answer

Not at this time.

Question 58

Can we connect to Big Data Cluster master using integrated authentication?

Accepted Answer

Yes, you can connect to various Big Data Cluster services using integrated authentication (with Active Directory). For more information, see Deploy SQL Server Big Data Cluster in Active Directory mode. Also see Security concepts for Big Data Clusters.

Question 59

How can I add new users for various services within Big Data Cluster?

Accepted Answer

In basic authentication mode (username/password), there is no support for adding multiple users for controller or Knox gateway/HDFS endpoints. The only user supported for these endpoints is root. For SQL Server, you can add users using Transact-SQL as you would for any other SQL Server instance. If you deploy Big Data Cluster with AD auth for its endpoints, multiple users are supported. See here for details on how to configure the AD groups at deployment time. For more information, see Deploy SQL Server Big Data Cluster in Active Directory mode.

Question 60

For Big Data Cluster to pull the latest container images, is there an outbound IP range I can restrict?

Accepted Answer

You can review the IP addresses used by the various services in Azure IP Ranges and Service Tags – Public Cloud. Note that these IP addresses rotate periodically.
In order for the controller service to pull the container images from the Microsoft Container Registry (MCR) you'll need to grant access to the IP addresses specified in the MicrosoftContainerRegistry section. Another option is to set up a private Azure Container Registry and configure the Big Data Cluster to pull from there. In that case you'll need to expose the IP addresses specified in the AzureContainerRegistry section. Instructions on how to do this and a script are provided in Perform an offline deployment of a SQL Server big data cluster.

Question 61

Can I deploy Big Data Cluster in an air-gapped environment?

Accepted Answer

Yes, for more details see Perform an offline deployment of a SQL Server big data cluster.

Question 62

Does the feature "Azure Storage encryption" by default also applies to AKS-based big data clusters?

Accepted Answer

This depends on the dynamic storage provisioner configurations in Azure Kubernetes Service (AKS). See here for more details: Best practices for storage and backups in Azure Kubernetes Service (AKS).

Question 63

Can I rotate the keys for SQL Server and HDFS encryption in Big Data cluster?

Accepted Answer

Yes. For more information, see Key versions in Big Data Cluster.

Question 64

Can I rotate the passwords of autogenerated Active Directory objects?

Accepted Answer

Yes, you can easily rotate the passwords of Autogenerated Active Directory objects with a new feature introduced in SQL Server Big Data Clusters CU13. For more information, see AD password rotation.

SQL Server Big Data Clusters FAQ

Best practices