question

AravindYarram-8844 avatar image
1 Vote"
AravindYarram-8844 asked HakanAbdulovski-6931 commented

StreamAccessException was caused by UnexpectedException. Too many open files in system

I have the ADLSGen2 registered as a DataStore and trying to access the excel data from one of its folders from the Notebook. Following is the code

 from azureml.data.datapath import DataPath
 from azureml.data.data_reference import DataReference
 from azureml.core import Workspace, Datastore, Dataset
    
 ws = get_ws()
 dstore_name = 'ssss_aravind_store'
    
 aravind_dstore = Datastore.get(ws, dstore_name)
    
 raw_input_path = DataReference(
     datastore=aravind_dstore, 
     data_reference_name='ticket_raw_data_ref',
     path_on_datastore='semi-structured/ticket-incident-emails/raw_input_data_eng.xlsx')
 print('Raw DataReference:', raw_input_path)
    
 parent_keywords_path = DataReference(
     datastore=aravind_dstore, 
     data_reference_name='parent_keywords_data_ref',
     path_on_datastore='semi-structured/ticket-incident-emails/parent_keywords.xlsx')
 print('Parent Keywords DataReference:', parent_keywords_path)
    
 third_level_keywords_path = DataReference(
     datastore=aravind_dstore, 
     data_reference_name='parent_keywords_data_ref',
     path_on_datastore='semi-structured/ticket-incident-emails/third_level_keywords.xlsx')
 print('3rd Level Keywords DataReference:', third_level_keywords_path)
    
 cleaned_data_path = DataReference(
     datastore=aravind_dstore, 
     data_reference_name='ticket_phase1_data_ref',
     path_on_datastore='semi-structured/ticket-incident-emails/phase1_op_dummy_data.xlsx')
 print('Cleaned DataReference:', cleaned_data_path)
    
 raw_dset = Dataset.from_excel_files(raw_input_path, sheet_name= 'data', use_column_headers=True, infer_column_types=True)
 parent_kwords_dset = Dataset.from_excel_files(parent_keywords_path, sheet_name= 'New_keywords', use_column_headers=True, infer_column_types=True)
 level3_kwords_dset = Dataset.from_excel_files(third_level_keywords_path,  use_column_headers=True, infer_column_types=True)
 cleaned_dset = Dataset.from_excel_files(cleaned_data_path, sheet_name= 'phase1_op', use_column_headers=True, infer_column_types=True)


Error

 ---------------------------------------------------------------------------
 ExecutionError                            Traceback (most recent call last)
 <ipython-input-22-c4226f154137> in <module>
      44 # from_excel_files(path, sheet_name=None, use_column_headers=False, skip_rows=0, include_path=False, infer_column_types=True,
      45 # partition_format=None)
 ---> 46 raw_dset = Dataset.from_excel_files(raw_input_path, sheet_name= 'data', use_column_headers=True, infer_column_types=True)
      47 parent_kwords_dset = Dataset.from_excel_files(parent_keywords_path, sheet_name= 'New_keywords', use_column_headers=True, infer_column_types=True)
      48 level3_kwords_dset = Dataset.from_excel_files(third_level_keywords_path,  use_column_headers=True, infer_column_types=True)
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/data/_dataset_deprecation.py in wrapper(*args, **kwargs)
      20                 _warn_deprecation(target, replacement)  # only raise warning for top-level invocation
      21                 _warning_silenced_for = target
 ---> 22             result = func(*args, **kwargs)
      23             if _warning_silenced_for == target:
      24                 _warning_silenced_for = None
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/data/_loggerfactory.py in wrapper(*args, **kwargs)
     124             with _LoggerFactory.track_activity(logger, func.__name__, activity_type, custom_dimensions) as al:
     125                 try:
 --> 126                     return func(*args, **kwargs)
     127                 except Exception as e:
     128                     if hasattr(al, 'activity_info') and hasattr(e, 'error_code'):
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/core/dataset.py in from_excel_files(path, sheet_name, use_column_headers, skip_rows, include_path, infer_column_types, partition_format)
     661             include_path,
     662             infer_column_types,
 --> 663             partition_format)
     664 
     665     @staticmethod
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/data/_dataset_client.py in from_excel_files(path, sheet_name, use_column_headers, skip_rows, include_path, infer_column_types, partition_format)
     810             inference_arguments = dprep.InferenceArguments(day_first=True)
     811         dataflow = dprep.read_excel(
 --> 812             path, sheet_name, use_column_headers, inference_arguments, skip_rows, include_path)
     813         dataflow._name = sheet_name
     814         return _DatasetClient._get_dataset_from_dataflow(
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/readers.py in read_excel(path, sheet_name, use_column_headers, inference_arguments, skip_rows, include_path, infer_column_types, verify_exists)
     186     df = df.read_excel(sheet_name, use_column_headers, skip_rows)
     187 
 --> 188     df = _handle_type_inference_and_path(df, inference_arguments, infer_column_types, include_path)
     189 
     190     if verify_exists:
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/readers.py in _handle_type_inference_and_path(df, inference_arguments, infer_column_types, include_path)
      32         column_types_builder = df.builders.set_column_types()
      33         if use_inference_arguments:
 ---> 34             column_types_builder.learn(inference_arguments)
      35         else:
      36             column_types_builder.learn()
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/builders.py in learn(self, inference_arguments)
     193             if inference_arguments is not None and not isinstance(inference_arguments, InferenceArguments):
     194                 raise ValueError('Unexpected inference arguments. Expected instance of InferenceArguments class')
 --> 195             self._conversion_candidates = self._run_type_inference(self._dataflow._get_steps())
     196             if inference_arguments is not None:
     197                 self._resolve_date_ambiguity(inference_arguments.day_first)
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/builders.py in _run_type_inference(self, steps)
      79             inferences = self._engine_api.infer_types_with_span_context(InferTypesWithSpanContextMessageArguments(
      80                 blocks=steps_to_block_datas(steps),
 ---> 81                 span_context=to_dprep_span_context(span.get_context())
      82             ))
      83             return {col: _inference_info_from_result(inference) for col, inference in inferences.items()}
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/_aml_helper.py in wrapper(op_code, message, cancellation_token)
      36             if len(changed) > 0:
      37                 engine_api_func().update_environment_variable(changed)
 ---> 38             return send_message_func(op_code, message, cancellation_token)
      39 
      40         return wrapper
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/api.py in infer_types_with_span_context(self, message_args, cancellation_token)
     183     @update_aml_env_vars(get_engine_api)
     184     def infer_types_with_span_context(self, message_args: typedefinitions.InferTypesWithSpanContextMessageArguments, cancellation_token: CancellationToken = None) -> Dict[str, typedefinitions.FieldInference]:
 --> 185         response = self._message_channel.send_message('Engine.InferTypesWithSpanContextMessage', message_args, cancellation_token)
     186         return {k: typedefinitions.FieldInference.from_pod(v) if v is not None else None for k, v in response.items()} if response is not None else None
     187 
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/engineapi/engine.py in send_message(self, op_code, message, cancellation_token)
     180                 response = self._read_response()
     181                 if 'error' in response:
 --> 182                     raise_engine_error(response['error'])
     183                 elif response.get('id') == message_id:
     184                     return response['result']
    
 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/dataprep/api/errorhandlers.py in raise_engine_error(error_response)
       8     error_code = error_response['errorCode']
       9     if 'ScriptExecution' in error_code:
 ---> 10         raise ExecutionError(error_response)
      11     if 'Validation' in error_code:
      12         raise ValidationError(error_response)
    
 ExecutionError: 
 Error Code: ScriptExecution.StreamAccess.Unexpected
 Failed Step: 0....
 Error Message: ScriptExecutionException was caused by StreamAccessException.
   StreamAccessException was caused by UnexpectedException.
     Unexpected error when attempting 'GetHttpResourceStream' for 'https://stgaccount.dfs.core.windows.net/aravind/semi-structured/ticket-incident-emails/raw_input_data_eng.xlsx'.
       Too many open files in system
 | session_id=ff6......


azure-machine-learning
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ramr-msft avatar image
0 Votes"
ramr-msft answered HakanAbdulovski-6931 commented

@AravindYarram-8844 Thanks for the question, Based on the error message i.e., ScriptExecutionException was caused by StreamAccessException.
For a service identity to access ADL-S gen 2 storage, it needs to pass two criteria:

The machine running under that service identity needs to pass the storage firewall rules.
With your firewall settings, this criterion should pass.
The specific service identity needs to have read access to the ADL-S storage.
For this, you should create an Azure App Identity and give it (at a minimum) read access to the ADL-S gen 2 instance. Then register the ADL-S instance as a datastore in your AML workspace using the client ID and secret of that app identity. Your code running in AML will then be able to access this data via the corresponding datastore registered in AML.

31668-sample.png

If possible could you please share the link to the sample that you are trying.

Please follow this doc to set up the permission to access data behind vnet on studio?
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-enable-virtual-network#configure-a-datastore-to-use-managed-identity



sample.png (61.1 KiB)
· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

The read access is setup properly. I see this issue only intermittently.

0 Votes 0 ·

@AravindYarram-8844 Thanks for the details. We have forwarded to the product team to check on this.

0 Votes 0 ·

Is there a solution to this problem? I encounter same error. I try to access a parquet-file from azure blob storage that was saved there from a pipeline using a cluster instance. Another parquet-file in the same blob-storage was saved using a compute instance and I can access that using this exact code. So there seems to be a difference depending on if the file was saved with cluster/compute. Any guesses what it can be?

Just to be clear. We have created a pipeline that uses a cluster-instance. This pipeline works fine. In some steps I save intermittent files for analysis. I save these files in a blob-storage. I try to access these files using a notebook in AzureML.

If the file was saved with a step from the pipeline that was run on a cluster I can't access it.
If the file was saved with a step from the pipeline that was run from a compute-instance I can access it.

0 Votes 0 ·
Show more comments