Git integration for Azure Machine Learning

Git is a popular version control system that allows you to share and collaborate on your projects.

Azure Machine Learning fully supports Git repositories for tracking work - you can clone repositories directly onto your shared workspace file system, use Git on your local workstation, or use Git from a CI/CD pipeline.

When submitting a job to Azure Machine Learning, if source files are stored in a local git repository then information about the repo is tracked as part of the training process.

Since Azure Machine Learning tracks information from a local git repo, it isn't tied to any specific central repository. Your repository can be cloned from GitHub, GitLab, Bitbucket, Azure DevOps, or any other git-compatible service.

Clone Git repositories into your workspace file system

Azure Machine Learning provides a shared file system for all users in the workspace. To clone a Git repository into this file share, we recommend that you create a compute instance & open a terminal. Once the terminal is opened, you have access to a full Git client and can clone and work with Git via the Git CLI experience.

We recommend that you clone the repository into your users directory so that others will not make collisions directly on your working branch.

You can clone any Git repository you can authenticate to (GitHub, Azure Repos, BitBucket, etc.)

For more information about cloning, see the guide on how to use Git CLI.

Authenticate your Git Account with SSH

Generate a new SSH key

  1. Open the terminal window in the Azure Machine Learning Notebook Tab.

  2. Paste the text below, substituting in your email address.

ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

This creates a new ssh key, using the provided email as a label.

> Generating public/private rsa key pair.
  1. When you're prompted to "Enter a file in which to save the key" press Enter. This accepts the default file location.

  2. Verify that the default location is '/home/azureuser/.ssh' and press enter. Otherwise specify the location '/home/azureuser/.ssh'.

Tip

Make sure the SSH key is saved in '/home/azureuser/.ssh'. This file is saved on the compute instance is only accessible by the owner of the Compute Instance

> Enter a file in which to save the key (/home/azureuser/.ssh/id_rsa): [Press enter]
  1. At the prompt, type a secure passphrase. We recommend you add a passphrase to your SSH key for added security
> Enter passphrase (empty for no passphrase): [Type a passphrase]
> Enter same passphrase again: [Type passphrase again]

Add the public key to Git Account

  1. In your terminal window, copy the contents of your public key file. If you renamed the key, replace id_rsa.pub with the public key file name.
cat ~/.ssh/id_rsa.pub

Tip

Copy and Paste in Terminal

  • Windows: Ctrl-Insert to copy and use Ctrl-Shift-v or Shift-Insert to paste.
  • Mac OS: Cmd-c to copy and Cmd-v to paste.
  • FireFox/IE may not support clipboard permissions properly.
  1. Select and copy the key output in the clipboard.

Clone the Git repository with SSH

  1. Copy the SSH Git clone URL from the Git repo.

  2. Paste the url into the git clone command below, to use your SSH Git repo URL. This will look something like:

git clone git@example.com:GitUser/azureml-example.git
Cloning into 'azureml-example'...

You will see a response like:

The authenticity of host 'example.com (192.30.255.112)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,192.30.255.112' (RSA) to the list of known hosts.

SSH may display the server's SSH fingerprint and ask you to verify it. You should verify that the displayed fingerprint matches one of the fingerprints in the SSH public keys page.

SSH displays this fingerprint when it connects to an unknown host to protect you from man-in-the-middle attacks. Once you accept the host's fingerprint, SSH will not prompt you again unless the fingerprint changes.

  1. When you are asked if you want to continue connecting, type yes. Git will clone the repo and set up the origin remote to connect with SSH for future Git commands.

Track code that comes from Git repositories

When you submit a training run from the Python SDK or Machine Learning CLI, the files needed to train the model are uploaded to your workspace. If the git command is available on your development environment, the upload process uses it to check if the files are stored in a git repository. If so, then information from your git repository is also uploaded as part of the training run. This information is stored in the following properties for the training run:

Property Git command used to get the value Description
azureml.git.repository_uri git ls-remote --get-url The URI that your repository was cloned from.
mlflow.source.git.repoURL git ls-remote --get-url The URI that your repository was cloned from.
azureml.git.branch git symbolic-ref --short HEAD The active branch when the run was submitted.
mlflow.source.git.branch git symbolic-ref --short HEAD The active branch when the run was submitted.
azureml.git.commit git rev-parse HEAD The commit hash of the code that was submitted for the run.
mlflow.source.git.commit git rev-parse HEAD The commit hash of the code that was submitted for the run.
azureml.git.dirty git status --porcelain . True, if the branch/commit is dirty; otherwise, false.

This information is sent for runs that use an estimator, machine learning pipeline, or script run.

If your training files are not located in a git repository on your development environment, or the git command is not available, then no git-related information is tracked.

Tip

To check if the git command is available on your development environment, open a shell session, command prompt, PowerShell or other command line interface and type the following command:

git --version

If installed, and in the path, you receive a response similar to git version 2.4.1. For more information on installing git on your development environment, see the Git website.

View the logged information

The git information is stored in the properties for a training run. You can view this information using the Azure portal, Python SDK, and CLI.

Azure portal

  1. From the studio portal, select your workspace.
  2. Select Experiments, and then select one of your experiments.
  3. Select one of the runs from the RUN NUMBER column.
  4. Select Outputs + logs, and then expand the logs and azureml entries. Select the link that begins with ###_azure.

The logged information contains text similar to the following JSON:

"properties": {
    "_azureml.ComputeTargetType": "batchai",
    "ContentSnapshotId": "5ca66406-cbac-4d7d-bc95-f5a51dd3e57e",
    "azureml.git.repository_uri": "git@github.com:azure/machinelearningnotebooks",
    "mlflow.source.git.repoURL": "git@github.com:azure/machinelearningnotebooks",
    "azureml.git.branch": "master",
    "mlflow.source.git.branch": "master",
    "azureml.git.commit": "4d2b93784676893f8e346d5f0b9fb894a9cf0742",
    "mlflow.source.git.commit": "4d2b93784676893f8e346d5f0b9fb894a9cf0742",
    "azureml.git.dirty": "True",
    "AzureML.DerivedImageName": "azureml/azureml_9d3568242c6bfef9631879915768deaf",
    "ProcessInfoFile": "azureml-logs/process_info.json",
    "ProcessStatusFile": "azureml-logs/process_status.json"
}

Python SDK

After submitting a training run, a Run object is returned. The properties attribute of this object contains the logged git information. For example, the following code retrieves the commit hash:

run.properties['azureml.git.commit']

CLI

The az ml run CLI command can be used to retrieve the properties from a run. For example, the following command returns the properties for the last run in the experiment named train-on-amlcompute:

az ml run list -e train-on-amlcompute --last 1 -w myworkspace -g myresourcegroup --query '[].properties'

For more information, see the az ml run reference documentation.

Next steps