MPP, DWU and PolyBase in Azure SQL Data warehouse

Understanding some basic concepts in Azure SQL data warehouse can accelerate getting good grips on the functionality it offers.  Discussed below are some key terms.

MPP: Unlike the previous incarnation of on-premise SQL Server Data warehouse which uses SMP (Symmetric Multi Processing), Azure SQL Data warehouse is designed using MPP (Massively Parallel Processing).  In essence this design co-ordinates processing 1 task by multiple logical processors. Each logical processor consists of its own CPU and memory. Processors communicate with each other.

These processors can also be referred as nodes. Nodes can be classified into following 2 types.

Control node: Creates parallel query plan, co-ordinates query execution, data aggregation, etc.
Compute node: Nodes that do actual computation.

MPP has 2 design models -

  1. Share Nothing
  2. Shared Disk

In Share Nothing model,  used by Azure SQL Data warehouse, each node is independent and has its own data. This data is subset of rows of a table in Data warehouse. This enables massive scalability. Speaking of scalability, Azure SQL Data warehouse expresses it in terms of Data Warehouse Unit (DWU).

DWU: In essence, DWU is a function of memory, CPU and concurrency. Basic DWU, DW100 can have upto 24GB of RAM with lesser concurrency
1 DWU is approximately 7.5 DTU (Database Throughput Unit, used to express the horse power of an OLTP Azure SQL Database) in capacity although they are not exactly comparable.

 

Another important concept is the technology called as PolyBase.

PolyBase: It provides a scalable, T-SQL compatible query processing framework for combining data from RDBMS and Hadoop (or Azure Blob Storage) . It abstracts away many of the MapReduce/Hadoop technologies for SQL developers who are comfortable in the realms of  T-SQL.

As its name implies, it enables to query from or store at multiple (poly) places (base).  Its the actual engine that parallelize the query. While internally it does a lot of optimization, developers can use DMV (Data Management View) to monitor/query plan.