question

DongYuan-3685 avatar image
0 Votes"
DongYuan-3685 asked MichaUrban-2116 answered

Databricks-Connect also return module not found for multiple python files job.

Currently I'm connecting to databricks with local VS Code via databricks-connect. But my submmission all comes with error of module not found, which means the code in other python files not found.
I tried:


Move code into the folder with main.py
import the file inside of the function that uses it
adding the file via sparkContext.addPyFile

Does anyone have any experiecen on it? Or the even better way to interact with databricks for python projects.

I seems my python part code is executed in local python env, only the code directlry related spark is in cluster, but the cluster does not load all my python files. then raising error.

azure-databricks
· 9
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello,

Welcome to Microsoft Q&A platform.

For deeper investigation, could you please share the complete error message which you are experiencing?

0 Votes 0 ·
DongYuan-3685 avatar image DongYuan-3685 PRADEEPCHEEKATLA-MSFT ·

Yes, I have file folder
main.py
lib222.py
__init__.py

with class Foo in lib222.py.
main code is:

 from pyspark.sql import SparkSession
    
 spark = SparkSession.builder.getOrCreate()
    
 sc = spark.sparkContext
 #sc.setLogLevel("INFO")
    
 print("Testing addPyFile isolation")
 sc.addPyFile("lib222.py")
 from lib222 import Foo
 print(sc.parallelize(range(10)).map(lambda i: Foo(2)).collect())


But I got error of Module not find lib222.

Also when I print python version of some sys info, it seems the python code is executed in my local machine instead of remote driver.
My db version is 6.6.


Thanks.

0 Votes 0 ·

Detailed Error in answer section.

0 Votes 0 ·
DongYuan-3685 avatar image DongYuan-3685 PRADEEPCHEEKATLA-MSFT ·

Are you still following this question???

0 Votes 0 ·

Hello @DongYuan-3685,


From the error message: "Module not find lib222" it looks like "lib222" module is missing.


Could you please install the module named "lib222" and re-try?


0 Votes 0 ·
Show more comments
Jonas-4379 avatar image
0 Votes"
Jonas-4379 answered PRADEEPCHEEKATLA-MSFT commented

Hello,

I experience the same problem, my setting is as follows:

Databricks 6.6 Cluster
Databricks-Connect 6.6
All other dependencies and configuration unchanged since DB 6.2 when everything still worked fine (but the Runtime got depreciated so I had to update)

I have a custom package with the following structure

Package_Folder.zip
Package_Folder
init.py
Modules_Folder
init.py
Custom_Module.py

I added the ZIP-File to the local Databricks-Connect SparkContext via sc.addPyFile(Path_to_Package_Folder.zip)

If I try to use a function from Custom_Module.py, I get the error

ModuleNotFoundError: No module named 'Package_Folder'

The error is embedded in a lot of error messages which are based on

Exception has occurred: Py4JJavaError
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServeWithJobGroup.

which is raised during a joblibspark call

Do you have any idea what could cause this error? Thanks!

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hey @Jonas-4379,


From the error message "ModuleNotFoundError: No module named 'Package_Folder'", it looks like module is missing.


Could you please install the modules and retry?


0 Votes 0 ·
Jonas-4379 avatar image Jonas-4379 PRADEEPCHEEKATLA-MSFT ·

Hey @PRADEEPCHEEKATLA-MSFT ,


yes, I got this, my problem is that the custom module should be installed on the cluster through the sc.addPyFile() command, but it isn't, which is weird. Usually, the custom module should be available on the spark worker nodes (databricks spark cluster, connected through databricks-connect), but somehow it isn't available and I was wondering if anybody knew what causes this problem.


Best,
Jonas


0 Votes 0 ·

@Jonas-4379, For a deeper investigation and immediate assistance on this issue, if you have a support plan you may file a support ticket.

0 Votes 0 ·
DongYuan-3685 avatar image
0 Votes"
DongYuan-3685 answered

Detailed Error"

Exception has occurred: Py4JJavaError
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.139.64.8, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 462, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 185, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222'

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Jonas-4379 avatar image
0 Votes"
Jonas-4379 answered

Some days ago, Databrick-Connect 7.1.0 was released which seems to have solved this issue. I don't know what caused or solved the problem, but if anybody runs into this problem, try to update to a Databricks Cluster with Runtime Version >= 7.1.0 and use the corresponding Databricks-Connect Version. @DongYuan-3685 , can you confirm this solution?

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

KamelST-0949 avatar image
1 Vote"
KamelST-0949 answered

I have the exact same issue despite of using databricks-connect 7.3 on the cluster version 7.3 LTS ML.

Can anyone explain the root cause of such issue ? In the documentation it looks pretty straight forward since we just need to call the addPyFile function...

Many thanks

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

ESNALGORKA-1556 avatar image
0 Votes"
ESNALGORKA-1556 answered

I have the same issue as well and I'm already using databricks-connect 7.3, is there any solution or workaround?

Thanks,
Gorka

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

MichaUrban-2116 avatar image
0 Votes"
MichaUrban-2116 answered

Databricks team / anyone here, have you resolved the problem? I'm facing the same now.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.