question

AlexLeBlanc-8988 avatar image
0 Votes"
AlexLeBlanc-8988 asked 84039204 rolled back

Reading Avro format in Synapse Analytics

I am trying to read and process avro files from ADLS using a Spark pool notebook in Azure Synapse Analytics. The avro files are capture files produced by eventhub.

When I run df = spark.read.format("avro").load(<file path>) as I would in databricks, I get the following error:
"
AnalysisException : 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
return self.df(self.jreader.load(path))
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in
call

answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
"

I have also tried creating a "dataset" with a linked service but no luck with that either.
I have tried adding spark-avro_2.12 as a package but I can't seem to install it, I can only install python packages to my spark pool.

Is there currently a way to read avro files within synapse analytics? If not, are there plans to have avro read capabilities built-in in the near future? What are other methods I can use to read avro for the time being?

Any and all help is much appreciated, thank you!

azure-synapse-analytics
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HimanshuSinha-MSFT avatar image
1 Vote"
HimanshuSinha-MSFT answered

Hello Alex ,
Thanks for the question and also using the forum .

While I am reaching out to the the Synapse team , I wanted to let you know that you can use the Azure data factory ( ADF ) to read the AVRO files . You can go through the below doc and you will have some idea about the implementation , let me know if you have any questions .

https://docs.microsoft.com/en-us/azure/data-factory/format-avro


Thanks Himanshu

Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

AlexLeBlanc-8988 avatar image
0 Votes"
AlexLeBlanc-8988 answered HimanshuSinha-MSFT commented

Thank you @HimanshuSinha-MSFT for your reply.

If I understand correctly, at this time it is not possible to do a simple spark.read.format("avro") (like I would in databricks), correct? but the feature may be available in the future?


And to clarify, by using Azure Data factory, do you mean the separate ADF service, or the one integrated into Synapse? (note: we are currently exploring Synapse and its viability for our solution. If we are to use it, we would want to only use the built-in data factory, not the external version). I have already tried creating a dataset within synapse, however I get this error message:
"Column: SystemProperties,Location: Source,Format: Avro,The data type 'System.Collections.Generic.Dictionary`2[System.String,System.Object]' is currently not supported by Avro format."
Any idea as to why I might be getting this error message? Is avro not supported for ADF datasets either in synapse? if not, will it be in the future?

When searching this error, I found these forums. which seem to suggest that complex types in avro format are not supported, has this been addressed since then?
17373472-support-more-complex-types-in-avro-format-like-di
error-importing-avro-file-generated-by-event-hubs-archive-using-copy-data-tool-i


Thank you,

Alex


· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @HimanshuSinha-MSFT

As a quick follow up:

I have looked deeper into the documentation link you provided. It suggests that the COPY activity does not support avro complex data types, however these complex avro data types can be read using data flows. It does not mention whether complex data types are supported or not by Datasets. Is it safe to assume that they are not supported as of today?

So when you suggest using data factory, you mean to use data flows? I am a bit confused by the distinction between data flows and datasets... when I try to create a data flow, it requires that I read my data as a Dataset. which would give me an error when I tried to preview the data when creating the dataset, so I don't see how creating the data flow will work...

Thanks again,

Alex

0 Votes 0 ·

Hello ,
My sincere apoloziges for the delay on rsponse on my side.
Yes at this time Azure data factory does not support complex types , but data flow does .
Please foillow the link here
https://docs.microsoft.com/en-us/azure/data-factory/format-avro#data-type-support

Thanks Himanshu

0 Votes 0 ·
84039204 avatar image
0 Votes"
84039204 answered 84039204 rolled back

Azure databricks easily reads avro files:

%python
df = spark.read.format("avro").load("<path to your event hub>/0/2021/05/*/*/*/*.avro")
js = df.select(df.Body.cast("string")).rdd.map(lambda x: x[0])
data=spark.read.json(js)
display(data)

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.