如何在 Azure Databricks 中正确更新 Maven 库How to correctly update a Maven library in Azure Databricks

问题Problem

假设您对存储库中的库进行了少许更新,并且不希望更改版本号,因为这只是一种用于测试目的的小更改。Let’s say you make a minor update to a library in the repository, and you don’t want to change the version number because it is just a small change for testing purposes. 但是,当你再次将库附加到群集时,你的代码更改不会包含在库中。However, when you attach the library to your cluster again, your code changes are not included in the library.

原因Cause

Azure Databricks 的一个优点是可以安装第三方或自定义库,如从 Maven 存储库。One strength of Azure Databricks is the ability to install third-party or custom libraries, such as from a Maven repository. 但是,在存储库中更新库时,无法通过自动方式更新群集中的相应库。However, when a library is updated in the repository, there is no automated way to update the corresponding library in the cluster.

Azure Databricks 请求下载库以便将其附加到群集时,将发生以下过程:When you request Azure Databricks to download a library in order to attach it to a cluster, the following process occurs:

  1. 在 Azure Databricks 中,你将从 Maven 存储库请求一个库。In Azure Databricks, you request a library from a Maven repository.
  2. Azure Databricks 检查库的本地缓存,如果不存在,则将库从 Maven 存储库下载到本地缓存。Azure Databricks checks the local cache for the library, and if it is not present, downloads the library from the Maven repository to a local cache.
  3. 然后 Azure Databricks 将库复制到 DBFS ( /FileStore/jars/maven/ )。Azure Databricks then copies the library to DBFS (/FileStore/jars/maven/).
  4. 对库的后续请求时,Azure Databricks 使用已复制到 DBFS 的文件,并且不会下载新副本。Upon subsequent requests for the library, Azure Databricks uses the file that has already been copied to DBFS, and does not download a new copy.

解决方案Solution

若要确保将更新的库版本(或已自定义的库)下载到群集,请确保以某种方式递增项目的生成号或版本号。To ensure that an updated version of a library (or a library that you have customized) is downloaded to a cluster, make sure to increment the build number or version number of the artifact in some way. 例如,你可以将更改 libA_v1.0.0-SNAPSHOTlibA_v1.0.1-SNAPSHOT ,然后将下载新库。For example, you can change libA_v1.0.0-SNAPSHOT to libA_v1.0.1-SNAPSHOT, and then the new library will download. 然后,可以将其附加到群集。You can then attach it to your cluster.