从Google云端存储中读取csv到pandas数据帧

问题描述 投票:13回答:6

我正在尝试将Google Cloud Storage存储桶中的csv文件读取到熊猫数据框中。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

它显示以下错误消息:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

我做错了什么,我找不到任何不涉及谷歌datalab的解决方案?

python pandas csv google-cloud-platform google-cloud-storage
6个回答
33
投票

UPDATE

截至0.24版本的pandas,read_csv支持直接从Google云端存储中读取。只需提供链接到这样的桶:

df = pd.read_csv('gs://bucket/your_path.csv')

为了完整起见,我还留下了其他三个选项。

  • 自制代码
  • gcsfs
  • DASK

我将在下面介绍它们。

艰难的方法:自己动手做代码

我已经写了一些便利功能来从Google存储中读取。为了使其更具可读性,我添加了类型注释。如果你碰巧在Python 2上,只需删除它们,代码将完全相同。

假设您获得授权,它在公共和私人数据集上同样有效。在此方法中,您无需先将数据下载到本地驱动器。

如何使用它:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

代码:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs

gcsfs是一个“用于谷歌云存储的Pythonic文件系统”。

如何使用它:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

DASK

Dask“为分析提供高级并行性,为您喜爱的工具提供大规模性能”。当您需要在Python中处理大量数据时,它非常棒。 Dask尝试模仿pandas API的大部分内容,使其易于用于新手。

这是read_csv

如何使用它:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

12
投票

另一个选择是使用TensorFlow,它能够从Google云端存储执行流式读取:

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df = pd.read_csv(f)

使用tensorflow还可以方便地处理文件名中的通配符。例如:

将通配符CSV读入熊猫

以下代码将读取与特定模式匹配的所有CSV(例如:gs:// bucket / some / dir / train- *)到Pandas数据帧中:

import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd

def read_csv_file(filename):
  with file_io.FileIO(filename, 'r') as f:
    df = pd.read_csv(f, header=None, names=['col1', 'col2'])
    return df

def read_csv_files(filename_pattern):
  filenames = tf.gfile.Glob(filename_pattern)
  dataframes = [read_csv_file(filename) for filename in filenames]
  return pd.concat(dataframes)

usage

DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))

3
投票

read_csv不支持gs://

来自documentation

字符串可以是URL。有效的URL方案包括http,ftp,s3和file。对于文件URL,需要主机。例如,本地文件可以是file://localhost/path/to/table.csv

你可以download the filefetch it as a string来操纵它。


2
投票

pandas==0.24.0开始,如果您安装了gcsfs,则本机支持:https://github.com/pandas-dev/pandas/pull/22704

在官方发布之前,您可以尝试使用pip install pandas==0.24.0rc1


1
投票

在GCS中有三种访问文件的方法:

  1. 下载客户端库(这个给你)
  2. 在Google Cloud Platform Console中使用云存储浏览器
  3. 使用gsutil,一个用于处理云存储中文件的命令行工具。

使用步骤1,setup GSC为您的工作。之后你必须:

import cloudstorage as gcs
from google.appengine.api import app_identity

然后,您必须指定云存储桶名称并创建读取/写入功能以访问您的存储桶:

你可以找到剩下的读/写教程here


1
投票

如果我正确理解你的问题,那么这个链接可以帮助你获得read_csv()函数的更好的URL:

https://cloud.google.com/storage/docs/access-public-data

© www.soinside.com 2019 - 2024. All rights reserved.