question

ChitraMarimuthu-5718 avatar image
0 Votes"
ChitraMarimuthu-5718 asked PRADEEPCHEEKATLA-MSFT commented

Azure databrciks pyspark code needs support based on the file extension received in datalake

In my azure data lake everyday I will receive any one of the below file (only the file extension will vary but the file name remains same).

File Name:
Receipts_currendate.xlsx(extension is in small letter) or Receipts_currentdate.XLSX(extension is in Caps letter).

In my databricks the current pyspark code reads a file from a data lake (through ADL mount)and is supported only for the file extension in Caps letter .but I need the code needs to support for the file extension with small letter as well .

Current code in databricks :

import os
from pyspark.sql.types import
from pyspark.sql.functions import

import pandas as pd
from datetime import date,datetime,timedelta
from pyspark.sql.window import Window
from pyspark.sql import SQLContext


currentdate=(date.today().strftime('%m_%d_%Y')) #Filepath='/dbfs'+var_raw_path+'Receipts/receipts_02_03_2021.XLSX' Filepath='/dbfs'+var_raw_path+'WFCReceipts/receipts_'+currentdate+'.XLSX'
df = pd.read_excel(Filepath,index_col=None,usecols=[*range(0, 9)])

azure-databricks
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

PRADEEPCHEEKATLA-MSFT avatar image
0 Votes"
PRADEEPCHEEKATLA-MSFT answered PRADEEPCHEEKATLA-MSFT commented

Hello @ChitraMarimuthu-5718,

Welcome to the Microsoft Q&A platform.

As per my repro, I'm able to read the excel file with file extension in small/Caps letter.

I had created two excel file with file extension in small & Caps letter in ADLS gen1 storage account.

  • Receipts_currendate.xlsx(extension is in small letter)

  • Receipts_currentdate.XLSX(extension is in Caps letter)

Note: Installed openpyxl package to read the excel files using pandas.

104194-image.png

Reason for install openpyxl package.

Pandas uses the xlrd as their default engine for reading excel files. However, xlrd has removed support for anything other than xls files in their latest release.

To solve this, do the following:

  • Install openpyxl: This is another excel package that still supports the xlsx format

  • Set the engine to “openpyxl” instead of the default “xlrd”

104241-image.png

Able to read the excel files with file extension in small/Caps letter using Azure Databricks.

104242-image.png

Hope this helps. Do let us know if you any further queries.


Please "Accept the answer" if the information helped you. This will help us and others in the community as well.


image.png (108.9 KiB)
image.png (69.5 KiB)
image.png (65.6 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi PRADEEPCHEEKATLA,

Thank you for your answer. Actually in my ADLS some time user is placing the receipts file with the extension in Small letter instead of Caps letter .In databricks we had written a code to handle the file with extension in Caps letter only. But user is saying databricks should accept both .xlsx and .XLSX since both are the same extension type. How to modify the code to handle the same file with upper/lowercase sensitivity


Filename:
Receipts.XLSX or xlsx--->the file extension only varies but file name remains same.




104285-file-in-adls.png104138-databricks-code1.png


0 Votes 0 ·
file-in-adls.png (125.1 KiB)

Hello @ChitraMarimuthu-5718,

As per the above answer, I'm able to read the excel files with file extension in small/Caps letter using Azure Databricks.


104656-image.png


0 Votes 0 ·
image.png (65.6 KiB)

Hello @ChitraMarimuthu-5718,

Following up to see if the above suggestion was helpful. And, if you have any further query do let us know.

0 Votes 0 ·