Currently, I'm trying to use the AzureML SDK's, dataset.to_spark_dataframe() method, and facing a weird error(see below).
The ClassNotFoundExceptions suggest that some Jars might be missing from the base environment's Spark classpath. Some sources suggest hadoop-azure concretely: https://community.cloudera.com/t5/Support-Questions/Class-org-apache-hadoop-fs-azure-NativeAzureFileSystem-not/m-p/270675)
Is there a way to add these dependencies to the environment?
Error:
AzureMLException: AzureMLException:
Message: Execution failed in operation 'to_spark_dataframe' for Dataset(id='54df6c30-fb46-4c75-a084-d10c17cd3795', name='temperatures_parq', version=1, error_code=None, exception_type=Py4JJavaError)
InnerException An error occurred while calling o39.getFiles.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at com.microsoft.dprep.io.FileSystemStreamInfoHandler.globStatus(FileSystemStreamInfoHandler.scala:46)
at com.microsoft.dprep.io.StreamInfoFileSystem.globStatus(StreamInfoFileSystem.scala:206)
at com.microsoft.dprep.io.StreamInfoFileSystem.globStatus(StreamInfoFileSystem.scala:201)
at com.microsoft.dprep.execution.Storage$.expandHdfsPath(Storage.scala:44)
at com.microsoft.dprep.execution.executors.GetFilesExecutor$.$anonfun$getFiles$1(GetFilesExecutor.scala:18)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at com.microsoft.dprep.execution.executors.GetFilesExecutor$.getFiles(GetFilesExecutor.scala:12)
at com.microsoft.dprep.execution.LariatDataset$.getFiles(LariatDataset.scala:32)
at com.microsoft.dprep.execution.PySparkExecutor.getFiles(PySparkExecutor.scala:201)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)