question

MuhammadZainal-1174 avatar image
5 Votes"
MuhammadZainal-1174 asked Tka32-3742 answered

AzureML Datetime Issue

Hi,

I am coming across an issue to do with retaining the datetime values in the datasets that I have uploaded to AzureML.
This issue can be replicated in the following ways:

  1. Create a pandas dataframe with a column of datetime strings and parse them accordingly

    d = {"Date": ["2020-03-06", "2021-01-05", "2016-01-30", "2019-12-14"]}
    df = pd.DataFrame(data=d)
    df["Date"] = pd.to_datetime(df["Date"], format="%Y-%m-%d")

104079-image.png

  1. Save this dataframe as a .parquet

  2. Upload to Azure Blob

  3. Create a Tabular Dataset object with the uploaded file

datastore = workspace.get_default_datastore()
datastore_path = [(datastore, "filename.parquet")]
azureml_df = Dataset.Tabular.from_parquet_files(path=datastore_path)

Printing the dataframe results in the following:
104157-image.png
The datetime values are now different.
To investigate further, we can cast the datetime to int:
104181-image.png
which gives us a 15 digit number.

We also cast the original df to int:
104119-image.png
which instead gives us an 18 digit number.

This 18 digit number represents the number of nanoseconds since UNIX epoch. Three trailing zeroes are stripped from the number when creating the Tabular Dataset object through azureml-sdk, resulting in an incorrect datetime being read. Keep in mind that if you were to download the parquet from Azure Blob, the values are still intact, meaning the issue is with AzureML and potentially the Dataset method, from_parquet_files. A simple workaround would be to multiply this column by 1000 then convert it back to datetime again but I would like to know if there's something I'm missing in between reading the parquet from AzureML or if the problem is on Azure's side.

Regards,
Muhammad


azure-machine-learning
image.png (5.2 KiB)
image.png (9.1 KiB)
image.png (8.4 KiB)
image.png (8.2 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

romungi-MSFT avatar image
1 Vote"
romungi-MSFT answered romungi-MSFT commented

@MuhammadZainal-1174 Thanks for the detailed explanation of the issue. I have tried to replicate this issue with the exact steps but the date in does not change to a different value as seen in your case. Here are the steps:

104253-image.png

With the exact same steps too the date is consistent.

104195-image.png

Maybe there is an issue with one of the SDK version. Which version of the SDK are you using?


image.png (51.0 KiB)
image.png (18.5 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@romungi-MSFT Thanks for the prompt reply. I am using conda 4.10.1 to build my environment from the following environment.yml file.

 name: datetime_env
 channels:
   - defaults
 dependencies:
   - python=3.7
   - pandas=1.2.2
   - notebook==6.4.0
   - pip=21.1.1
   - pip:
       - azureml-core==1.30.0
       - azureml-dataset-runtime==1.30.0
       - catboost==0.26
0 Votes 0 ·

@MuhammadZainal-1174 I have tried to replicate the scenario with a similar environment but did not see the same behavior. I would recommend to post the same details of reference this thread on the issues page of Azure python SDK repo here so the SDK team could take a look at it. Thanks!!


0 Votes 0 ·
RianFinnegan-2787 avatar image
0 Votes"
RianFinnegan-2787 answered RianFinnegan-2787 published

I'm seeing the same issue running PythonScriptStep in AzureML Pipelines.

I suspect it has something to do with the PyArrow representation of datetimes.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HungNguyenThanh-8753 avatar image
1 Vote"
HungNguyenThanh-8753 answered HungNguyenThanh-8753 edited

I'm facing the same problem. I'm using PythonScriptStep to create pipeline and PipelineData to get the output of the pipeline. The output has the right datetime, but once I registered that output data as a dataset in AzureML, the datetime is incorrect when I read it.

166286-image.png

166278-image.png


As for the environment, I specified as below:

pyarrow 3.0.0
pandas 0.25.3
azureml-core 1.34.0
python 3.6.9






image.png (123.3 KiB)
image.png (32.3 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Tka32-3742 avatar image
1 Vote"
Tka32-3742 answered

I have the same issue when running the following code on the ML notebook.
177840-screenshot-2022-02-25-at-163144.png



It was fine until yesterday and suddenly started happening today.
We can see the expected timestamp by converting it to datetime64 and then multiplying by 1000, but we would like know why it happens and don't want to have unexpected thing like this in the future or internally in the pipeline.
Please investigate if something has changed within the aml dataset and casting the datatype, etc.


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.