How to query JSON files stored in Azure Data Lake from SQL Server using External Data Source and Openrowset

Question

Please help! I am trying to query JSON documents stored in Azure Gen 2 data lake storage from SQL Server 2019 using external data source and openrowset similar to these examples:

https://learn.microsoft.com/en-us/archive/blogs/sqlserverstorageengine/bulk-importing-json-files-into-sql-server
https://learn.microsoft.com/en-us/sql/relational-databases/json/import-json-documents-into-sql-server?view=sql-server-ver15

but getting an error as if this syntax does not seem to work the same in SQL Server 2019 (Standard Edition). Please see my info below:

First, my Azure data lake folder structure is as follows :
Environment/Year/Month/Hour
i.e.: environment=production/year_added=????/month_added=??/hour_added=??/

Getting an error:
Msg 102, Level 15, State 1, Line 46
Incorrect syntax near 'fieldterminator'.
Msg 319, Level 15, State 1, Line 49
Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.

Here is my code:
CREATE MASTER KEY ENCRYPTION BY PASSWORD ='************'
GO

CREATE DATABASE SCOPED CREDENTIAL ReflectDataLake
WITH
IDENTITY = 'SHARED ACCESS SIGNATURE',
-- Remove ? from the beginning of the SAS token
SECRET = '******srt=sco&sp=rwac&se=2017-02-01T00:55:34Z&st=2016-12-29T16:55:34Z***************' ;
GO
CREATE EXTERNAL DATA SOURCE ReflectDataLake
WITH
(--LOCATION = 'https://newinvoices.blob.core.windows.net/week3' ,
LOCATION = 'https://mycompanydata.blob.core.windows.net/translogs/environment%3Dproduction' ,
CREDENTIAL = ReflectDataLake ,
TYPE = BLOB_STORAGE
) ;
go

--Query
select top 10 *
from openrowset(
bulk 'year_added%3D2021/month_added%3D1/day_added%3D11/hour_added%3D11/part-00364-tid-3791755914132760586-da93543b-b80f-4a6c-b31e-53dfb2bce992-7591-3.c000.json' ,
data_source = 'ReflectDataLake',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
rowterminator = '0x0b' --> You need to override rowterminator to read classic JSON
) with (doc nvarchar(max)) as rows

I am hoping to get to a point where I can point to a folder and use the * wildcard and query across all .JSON files in that folder and also hoping to not have to use the command line NET utility to map as a network share drive.

Answer

There are two errors in your OPENROWSET statement. You cannot use FIELDTERMINATOR etc with OPENROWSET(BULK), you need to have a format file. That is, if you are importing a file with multiple records in it. But if you are looking for a single JSON document, you would just use the option SINGLE_CBLOB like the examples in the documentation you linked to.

The part

with (doc nvarchar(max)) as rows

has nothing to do with OPENROWSET. It appears that you mixed it up with OPENJSON where you can use a WITH clause to specify the columns you shred the document to. With OPENROWSET you must supply a table alias_

FROM OPENROWSET (BULK ...) AS T

and if you have SINGLE_CLOB, also a column alias:

FROM OPENROWSET (BULK ..., SINGLE_CLOB) AS T(c)

Answer

Thanks @Erland Sommarskog for the reply. I am unfamiliar with those various options and new to this whole process.

I tried this:

select *
from openrowset(
bulk '....json' ,
data_source = 'ReflectDataLake',
format = 'csv'
--, SINGLE_CLOB
--fieldterminator ='0x0b',
,fieldquote = '0x0b'
,rowterminator = '0x0b'
) as T

Using the SINGLE_CLOB option it will return the entire file as a one line string and not parse the JSON.

Correct. You would do:

SELECT @json = T.c
FROM OPENROWSET (BULK '<path'>, SINGLE_CLOB) AS T(c)

and then you would run a query like

SELECT ..
FROM  OPENJSON (@json)
WITH ...

to extract the data out of the JSON document. This seems to be covered in the pages you linked to in your original post.

FYI-
My folder structure in my BLOB container is similar to something like this /yyyy/mm/dd/HH/ then assume 1000 .json files in each of the HH level folders. The JSON in each file looks something like this format:

{"field1": "Value1", "field2": "2021-01-11T11:56:41.056Z", "field3": "ValueA", "field4": "2021-02-11T11:56:41.056Z",}
{"field1": "Value2", "field2": "2021-01-11T11:56:41.057Z", "field3": "ValueB", "field4": "2021-02-11T11:56:41.057Z",}
{"field1": "Value3", "field2": "2021-01-11T11:56:41.058Z", "field3": "ValueC", "field4": "2021-02-11T11:56:41.058Z",}
{"field1": "Value4", "field2": "2021-01-11T11:56:41.059Z", "field3": "ValueD", "field4": "2021-02-11T11:56:41.059Z",}
.....

How can I query these files, or even better yet all files in a specific folder (i.e: *.json)?

How to query JSON files stored in Azure Data Lake from SQL Server using External Data Source and Openrowset

2 answers