使用 gspread 的 Python 代码(有时)需要太长时间

问题描述 投票:0回答:1

我最近开发了一个代码,使用它的 url 和表标题(其范围受到保护,以免失败)来恢复 5 个负载非常大的 google 工作表中的表。 这些表未标准化,每张表可以有 1 到 5 个。有时它从 A3 开始,有时从 B5 等开始,这就是我们使用标题来检索它们的原因。 我认为最好只发布花费太长时间的代码,并通过它发表评论:

#Finds the sheet where the table is located (info stored in a BQ table)
hoja = query_select_keys_tablas(columnas='NOMBRE_HOJA',where='NOMBRE_TABLA',filtro=hoja_key).iloc[0,0]

gc = gspread.authorize(credentials=credentials)

#This is one of the lines that takes a bit of time, but it is understandable
ws = gc.open_by_key(url_pais).worksheet(hoja)

#LINE 1: One of three lines that takes over 10 seconds to run (depending on url and table).
table_location = ws.find(hoja_key)

table_start_row = table_location.row + 1
table_start_col = table_location.col

#LINE 2: Another line that takes 10 seconds. This is needed to understand where the table ends.
columns_in_sheet = ws.row_values(table_start_row)

#This step is needed because there are tables next to each other, but always separated by one empty column, so this way we can know where one starts and another ends.
columns_in_sheet = columns_in_sheet[table_start_col-1:]
try:
  table_end_col = columns_in_sheet.index('') + table_start_col - 1
except ValueError:
  table_end_col = len(columns_in_sheet) + table_start_col - 1

table_range = f'{get_column_letter(table_start_col)}{table_start_row}:{get_column_letter(table_end_col)}'

#LINE 3: This takes the longest I think.
table = ws.get(table_range,value_render_option = 'UNFORMATTED_VALUE', date_time_render_option = 'FORMATTED_STRING')

df = pd.DataFrame(table)
df.columns = df.iloc[0]
df = df[1:]
df = df[df[df.columns[0]] != '']

首先,我做错了什么吗?也许缺少一个会使所有 gspread 方法更快的参数。这非常奇怪,因为有时需要 1 分钟运行的相同代码将需要 4 秒,而且看起来超级随机。我认识到的唯一模式是,第一次使用一对变量运行代码永远不会很快,但下一个可能会更快或不会更快。

如果我对此无能为力,有什么方法可以通过删除或更改三个标记行之一来提高此代码的时间效率? 例如,我有一个第四行将检索最后一行,就像获取最后一列一样。但是,我发现通过仅指定最后一列,然后过滤掉最后的所有空单元格,我节省了更多时间。

虽然 1 或 2 分钟看起来时间不多,但这只是一个更大脚本的一部分,运行整个脚本的分析师可能需要在一天内运行 20 次。

任何想法将不胜感激!

python google-sheets google-sheets-api gspread
1个回答
0
投票

修改要点:

  • 当我看到您显示的用于检索

    table
    的脚本时,似乎对 Sheets API 执行了 3 个请求,如下所示。我担心这会带来成本的增加。我觉得这种情况下,请求次数可以减少。

    1. table_location = ws.find(hoja_key)
    2. columns_in_sheet = ws.row_values(table_start_row)
    3. table = ws.get(table_range,value_render_option = 'UNFORMATTED_VALUE', date_time_render_option = 'FORMATTED_STRING')
  • 虽然我不确定这是否与成本的增加有关,但作为另一项修改,我声明了行和列的开始以及行和列的结束。

当这些点都体现在你的脚本中时,下面的修改如何?

来自:

#LINE 1: One of three lines that takes over 10 seconds to run (depending on url and table).
table_location = ws.find(hoja_key)

table_start_row = table_location.row + 1
table_start_col = table_location.col

#LINE 2: Another line that takes 10 seconds. This is needed to understand where the table ends.
columns_in_sheet = ws.row_values(table_start_row)

#This step is needed because there are tables next to each other, but always separated by one empty column, so this way we can know where one starts and another ends.
columns_in_sheet = columns_in_sheet[table_start_col-1:]
try:
  table_end_col = columns_in_sheet.index('') + table_start_col - 1
except ValueError:
  table_end_col = len(columns_in_sheet) + table_start_col - 1

table_range = f'{get_column_letter(table_start_col)}{table_start_row}:{get_column_letter(table_end_col)}'

#LINE 3: This takes the longest I think.
table = ws.get(table_range,value_render_option = 'UNFORMATTED_VALUE', date_time_render_option = 'FORMATTED_STRING')

致:

values = ws.get_all_values()
searched = [{"row": i, "col": r.index(hoja_key)} for i, r in enumerate(values) if hoja_key in r]
if searched != []:
    table_start_row = searched[0]["row"] + 2
    table_end_row = len(values)
    table_start_col = searched[0]["col"] + 1
    table_end_col = max([len(r) for r in values])
    table_range = f'{get_column_letter(table_start_col)}{table_start_row}:{get_column_letter(table_end_col)}{table_end_row}'
    table = ws.get(table_range, value_render_option='UNFORMATTED_VALUE', date_time_render_option='FORMATTED_STRING')
  • 运行此脚本时,Sheets API 被请求 2 次,并且仅通过
    table_range
    检索数据范围。我猜想这样或许能减少一点脚本的成本。
© www.soinside.com 2019 - 2024. All rights reserved.