Image

The image data source abstracts from the details of image representations and provides a standard API to load image data. To read image files, specify the data source format as image.

df = spark.read.format("image").load("<path-to-image-data>")

Similar APIs exist for Scala, Java, and R.

You can import a nested directory structure (for example, use a path like /path/to/dir/) and you can use partition discovery by specifying a path with a partition directory (that is, a path like /path/to/dir/date=2018-01-02/category=automobile).

Note

If you do not want to decode images, Azure Databricks recommends that you use the binary file data source.

Image structure

Image files are loaded as a DataFrame containing a single struct-type column called image with the following fields:

image: struct containing all the image data
  |-- origin: string representing the source URI
  |-- height: integer, image height in pixels
  |-- width: integer, image width in pixels
  |-- nChannels
  |-- mode
  |-- data

where the fields are:

  • nChannels: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.

  • mode: Integer flag that indicates how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order.

    Map of Type to Numbers in OpenCV (data types x number of channels)

    Type C1 C2 C3 C4
    CV_8U 0 8 16 24
    CV_8S 1 9 17 25
    CV_16U 2 10 18 26
    CV_16S 3 11 19 27
    CV_32S 4 12 20 28
    CV_32S 5 13 21 29
    CV_64F 6 14 22 30
  • data: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.

Display image data

The Databricks display function supports displaying image data. See Images.

Notebook

The following notebook shows how to read and write data to image files.

Image data source notebook

Get notebook