How to query JSON files stored in Azure Data Lake from SQL Server using External Data Source and Openrowset

JasonW-5564 161 Reputation points
2021-01-11T21:54:32.45+00:00

Please help! I am trying to query JSON documents stored in Azure Gen 2 data lake storage from SQL Server 2019 using external data source and openrowset similar to these examples:

https://learn.microsoft.com/en-us/archive/blogs/sqlserverstorageengine/bulk-importing-json-files-into-sql-server
https://learn.microsoft.com/en-us/sql/relational-databases/json/import-json-documents-into-sql-server?view=sql-server-ver15

but getting an error as if this syntax does not seem to work the same in SQL Server 2019 (Standard Edition). Please see my info below:

First, my Azure data lake folder structure is as follows :
Environment/Year/Month/Hour
i.e.: environment=production/year_added=????/month_added=??/hour_added=??/<lots of .json files here>

Getting an error:
Msg 102, Level 15, State 1, Line 46
Incorrect syntax near 'fieldterminator'.
Msg 319, Level 15, State 1, Line 49
Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.

Here is my code:
CREATE MASTER KEY ENCRYPTION BY PASSWORD ='************'
GO

CREATE DATABASE SCOPED CREDENTIAL ReflectDataLake
WITH
IDENTITY = 'SHARED ACCESS SIGNATURE',
-- Remove ? from the beginning of the SAS token
SECRET = '******srt=sco&sp=rwac&se=2017-02-01T00:55:34Z&st=2016-12-29T16:55:34Z***************' ;
GO
CREATE EXTERNAL DATA SOURCE ReflectDataLake
WITH
(--LOCATION = 'https://newinvoices.blob.core.windows.net/week3' ,
LOCATION = 'https://mycompanydata.blob.core.windows.net/translogs/environment%3Dproduction' ,
CREDENTIAL = ReflectDataLake ,
TYPE = BLOB_STORAGE
) ;
go

--Query
select top 10 *
from openrowset(
bulk 'year_added%3D2021/month_added%3D1/day_added%3D11/hour_added%3D11/part-00364-tid-3791755914132760586-da93543b-b80f-4a6c-b31e-53dfb2bce992-7591-3.c000.json' ,
data_source = 'ReflectDataLake',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b',
rowterminator = '0x0b' --> You need to override rowterminator to read classic JSON
) with (doc nvarchar(max)) as rows

I am hoping to get to a point where I can point to a folder and use the * wildcard and query across all .JSON files in that folder and also hoping to not have to use the command line NET utility to map as a network share drive.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,337 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Erland Sommarskog 100.9K Reputation points MVP
    2021-01-11T22:36:09.897+00:00

    There are two errors in your OPENROWSET statement. You cannot use FIELDTERMINATOR etc with OPENROWSET(BULK), you need to have a format file. That is, if you are importing a file with multiple records in it. But if you are looking for a single JSON document, you would just use the option SINGLE_CBLOB like the examples in the documentation you linked to.

    The part

    with (doc nvarchar(max)) as rows
    

    has nothing to do with OPENROWSET. It appears that you mixed it up with OPENJSON where you can use a WITH clause to specify the columns you shred the document to. With OPENROWSET you must supply a table alias_

    FROM OPENROWSET (BULK ...) AS T
    

    and if you have SINGLE_CLOB, also a column alias:

    FROM OPENROWSET (BULK ..., SINGLE_CLOB) AS T(c)
    
    0 comments No comments

  2. JasonW-5564 161 Reputation points
    2021-01-14T02:34:03.947+00:00

    Thanks @Erland Sommarskog for the reply. I am unfamiliar with those various options and new to this whole process.

    I tried this:

    select *
    from openrowset(
    bulk '....json' ,
    data_source = 'ReflectDataLake',
    format = 'csv'
    --, SINGLE_CLOB
    --fieldterminator ='0x0b',
    ,fieldquote = '0x0b'
    ,rowterminator = '0x0b'
    ) as T

    Using the SINGLE_CLOB option it will return the entire file as a one line string and not parse the JSON. When I comment that out and enable the format='CSV' and the rest of the options, it says incorrect syntax near fieldterminator, fieldquote, etc. I cannot figure out how to get it to parse the JSON. Any ideas? I have tried everything I can think of.

    FYI-
    My folder structure in my BLOB container is similar to something like this /yyyy/mm/dd/HH/ then assume 1000 .json files in each of the HH level folders. The JSON in each file looks something like this format:

    {"field1": "Value1", "field2": "2021-01-11T11:56:41.056Z", "field3": "ValueA", "field4": "2021-02-11T11:56:41.056Z",}
    {"field1": "Value2", "field2": "2021-01-11T11:56:41.057Z", "field3": "ValueB", "field4": "2021-02-11T11:56:41.057Z",}
    {"field1": "Value3", "field2": "2021-01-11T11:56:41.058Z", "field3": "ValueC", "field4": "2021-02-11T11:56:41.058Z",}
    {"field1": "Value4", "field2": "2021-01-11T11:56:41.059Z", "field3": "ValueD", "field4": "2021-02-11T11:56:41.059Z",}
    .....

    How can I query these files, or even better yet all files in a specific folder (i.e: *.json)?