question

KutiKreszacsMatyasRBROPJDT-0321 avatar image
0 Votes"
KutiKreszacsMatyasRBROPJDT-0321 asked romungi-MSFT answered

Azureml compute instance spark dependencies missing

Currently, I'm trying to use the AzureML SDK's, dataset.to_spark_dataframe() method, and facing a weird error(see below).
The ClassNotFoundExceptions suggest that some Jars might be missing from the base environment's Spark classpath. Some sources suggest hadoop-azure concretely: https://community.cloudera.com/t5/Support-Questions/Class-org-apache-hadoop-fs-azure-NativeAzureFileSystem-not/m-p/270675)

Is there a way to add these dependencies to the environment?

Error:
AzureMLException: AzureMLException:
Message: Execution failed in operation 'to_spark_dataframe' for Dataset(id='54df6c30-fb46-4c75-a084-d10c17cd3795', name='temperatures_parq', version=1, error_code=None, exception_type=Py4JJavaError)
InnerException An error occurred while calling o39.getFiles.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at com.microsoft.dprep.io.FileSystemStreamInfoHandler.globStatus(FileSystemStreamInfoHandler.scala:46)
at com.microsoft.dprep.io.StreamInfoFileSystem.globStatus(StreamInfoFileSystem.scala:206)
at com.microsoft.dprep.io.StreamInfoFileSystem.globStatus(StreamInfoFileSystem.scala:201)
at com.microsoft.dprep.execution.Storage$.expandHdfsPath(Storage.scala:44)
at com.microsoft.dprep.execution.executors.GetFilesExecutor$.$anonfun$getFiles$1(GetFilesExecutor.scala:18)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at com.microsoft.dprep.execution.executors.GetFilesExecutor$.getFiles(GetFilesExecutor.scala:12)
at com.microsoft.dprep.execution.LariatDataset$.getFiles(LariatDataset.scala:32)
at com.microsoft.dprep.execution.PySparkExecutor.getFiles(PySparkExecutor.scala:201)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)



azure-machine-learning
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

romungi-MSFT avatar image
0 Votes"
romungi-MSFT answered

@KutiKreszacsMatyasRBROPJDT-0321 I think dataset.to_spark_dataframe() is now deprecated since dataset class is categorized into two classes tabular and file. Deprecation notice about the changes are available here. Could you try using the latest SDK with TabularDataset class?

Example:

 from azureml.core import Dataset
 dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/iris.csv')])
    
 # preview the first 3 rows of the dataset
 dataset.take(3).to_spark_dataframe()




· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks @romungi-MSFT for the example.
I was trying to use a registered Tabular dataset.
I've tried the following way:
sales = Dataset.get_by_name(ws, "sales").take(10).to_spark_dataframe()

Seems that the issue is not with Spark as i'm able to create a dataframe from a file in the workspace.

1 Vote 1 ·