使用 Python Api 讀取大型 DBFS 裝載的檔案Reading large DBFS-mounted files using Python APIs

本文說明如何解決使用本機 Python Api 讀取大型 DBFS 掛接的檔案時所發生的錯誤。This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs.

問題Problem

如果您在 dbfs:// PYTHON API (如pandas)中掛接資料夾並讀取大於 2 gb 的檔案,您會看到下列錯誤:If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error:

/databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)()
/databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6883)()
IOError: Initializing from file failed

原因Cause

發生錯誤的原因是 Python 方法中的一個引數是帶正負號的 int,檔案的長度是 int,如果物件是大於2GB 的檔案,長度可能會大於最大帶正負號的 int。The error occurs because one argument in the Python method to read a file is a signed int, the length of the file is an int, and if the object is a file larger than 2GB, the length can be larger than maximum signed int.

解決方法Solution

將檔案從移 dbfs:// 至本機檔案系統 (file://)Move the file from dbfs:// to local file system (file://). 然後使用 Python API 進行讀取。Then read using the Python API. 例如:For example:

  1. 將檔案從複製 dbfs://file://Copy the file from dbfs:// to file://:

    %fs cp dbfs:/mnt/large_file.csv file:/tmp/large_file.csv
    
  2. 讀取pandas API 中的檔案:Read the file in the pandas API:

    import pandas as pd
    pd.read_csv('file:/tmp/large_file.csv',).head()