我是 stackoverflow 的新手,所以如果我做错了什么,我提前道歉
我在 Google 表格上有一个电子表格,例如,这个
href 标签内的单元格中有一个链接。我想使用 Google Sheets API 或 gspread 获取单元格的链接和文本。
我已经尝试过这个解决方案,但我得到访问令牌“无”。
我尝试过使用 beautifulsoup 进行网页抓取,但效果不佳。
对于bs4解决方案,我尝试使用这段代码,我发现here
from bs4 import BeautifulSoup
import requests
html = requests.get('https://docs.google.com/spreadsheets/d/1v8vM7yQ-27SFemt8_3IRiZr-ZauE29edin-azKpigws/edit#gid=0').text
soup = BeautifulSoup(html, "lxml")
tables = soup.find_all("table")
content = []
for table in tables:
content.append([[td.text for td in row.find_all("td")] for row in table.find_all("tr")])
print(content)
我想通了。这是完整的代码,如果有人需要的话
import requests
import gspread
import urllib.parse
import pickle
spreadsheetId = "###" # Please set the Spreadsheet ID.
cellRange = "Yoursheetname!A1:A100" # Please set the range with A1Notation. In this case, the hyperlink of the cell "A1" of "Sheet1" is retrieved.
with open('token_sheets_v4.pickle', 'rb') as token:
# get this file here
# https://developers.google.com/identity/sign-in/web/sign-in
credentials = pickle.load(token)
client = gspread.authorize(credentials)
# 1. Retrieve the access token.
access_token = client.auth.token
# 2. Request to the method of spreadsheets.get in Sheets API using `requests` module.
fields = "sheets(data(rowData(values(hyperlink))))"
url = "https://sheets.googleapis.com/v4/spreadsheets/" + spreadsheetId + "?ranges=" + urllib.parse.quote(cellRange) + "&fields=" + urllib.parse.quote(fields)
res = requests.get(url, headers={"Authorization": "Bearer " + access_token})
print(res)
# 3. Retrieve the hyperlink.
obj = res.json()
print(obj)
link = obj["sheets"][0]['data'][0]['rowData'][0]['values'][0]['hyperlink']
print(link)
更新!!
更优雅的解决方案是这样的。创建服务:
CLIENT_SECRET_FILE = 'secret/secret.json'
API_SERVICE_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']
def Create_Service():
cred = None
pickle_file = f'secret/token_{API_SERVICE_NAME}_{API_VERSION}.pickle'
if os.path.exists(pickle_file):
with open(pickle_file, 'rb') as token:
cred = pickle.load(token)
if not cred or not cred.valid:
if cred and cred.expired and cred.refresh_token:
cred.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRET_FILE, SCOPES)
cred = flow.run_local_server()
with open(pickle_file, 'wb') as token:
pickle.dump(cred, token)
try:
service = build(API_SERVICE_NAME, API_VERSION, credentials=cred)
print(API_SERVICE_NAME, 'service created successfully')
return service
except Exception as e:
print('Unable to connect.')
print(e)
return None
service = Create_Service()
并以方便的字典形式从电子表格中的每个工作表中提取链接
fields = "sheets(properties(title),data(startColumn,rowData(values(hyperlink))))"
print(service.spreadsheets().get(spreadsheetId=self.__spreadsheet_id,
fields=fields).execute())
那么,字段是如何运作的。我们转到电子表格对象描述并寻找 JSON 表示。例如,如果我们想从该 json 表示形式返回工作表对象,我们只需使用此 fields = "sheets",因为 Spreadsheet 的 json 表示形式具有字段“sheets”。
好吧,酷。我们得到了 Sheets 对象。如何访问工作表对象字段?只需单击那个东西并查找它的字段即可。
那么,如何组合字段呢?这很容易。例如,我想从 Sheets 对象返回字段“properties”和“data”,我这样编写字段字符串:fields =“sheets(properties,data)”。所以我们只是将它们作为普通函数中的参数列出,但没有空格。
这同样适用于返回数据字段等的对象。
您可以使用
def _spreadsheets_get(self, params=None)
中的gspread/spreadsheet.py
方法来实现这一点。
示例:
params = {
"spreadsheetId" : "spreadsheet_id_here",
"ranges" : "Sheet1!A1:A1",
"includeGridData" : True
}
print(spreadsheet._spreadsheets_get(params=params))
这将返回一个 JSON 对象,其中包含
textFormatRuns
部分中与超链接相关的数据。