我想把所有的问题和答案摘录下来,从 本问卷 但我不能点击复选框。
<div class="freebirdFormviewerViewItemsCheckboxChoice"><label class="docssharedWizToggleLabeledContainer freebirdFormviewerViewItemsCheckboxContainer"><div class="docssharedWizToggleLabeledLabelWrapper exportLabelWrapper"><div class="quantumWizTogglePapercheckboxEl appsMaterialWizTogglePapercheckboxCheckbox docssharedWizToggleLabeledControl freebirdThemedCheckbox freebirdThemedCheckboxDarkerDisabled freebirdFormviewerViewItemsCheckboxControl isCheckedNext" jscontroller="EcW08c" jsaction="keydown:I481le;dyRcpb:dyRcpb;click:cOuCgd; mousedown:UX7yZ; mouseup:lbsD7e; mouseenter:tfO1Yc; mouseleave:JywGue; focus:AHmuwe; blur:O22p3e; contextmenu:mg9Pef;touchstart:p6p2H; touchmove:FwuNnf; touchend:yfqBxc(preventMouseEvents=true|preventDefault=true); touchcancel:JMtRjd;" jsshadow="" jsname="FkQz1b" aria-label="Conditions about promotions clearly shown" tabindex="0" aria-describedby=" i198" role="checkbox" aria-checked="false"><div class="quantumWizTogglePapercheckboxInk exportInk"></div><div class="quantumWizTogglePapercheckboxInnerBox exportInnerBox"></div><div class="quantumWizTogglePapercheckboxCheckMarkContainer"><div class="quantumWizTogglePapercheckboxCheckMark"><div class="quantumWizTogglePapercheckboxShort exportCheck"></div><div class="quantumWizTogglePapercheckboxLong exportCheck"></div></div></div></div><div class="docssharedWizToggleLabeledContent"><div class="docssharedWizToggleLabeledPrimaryText"><span dir="auto" class="docssharedWizToggleLabeledLabelText exportLabel freebirdFormviewerViewItemsCheckboxLabel">Conditions about promotions clearly shown</span></div></div></div></label></div>
这里我想提取 Conditions about promotions clearly shown
我需要点击它,因为它是强制性的进入下一页。
为了点击它们,我试了一下:
btn_check_boxes = driver.find_elements_by_class_name(
"freebirdFormviewerViewItemsCheckboxChoice"
)
print("btn_check_boxes: ", btn_check_boxes)
for btn_check_box in btn_check_boxes:
btn_check_box.click()
break
但是没有用. 虽然我似乎抓住了它们。
...
published questionnaire
len_containers: 13
No question, NoSuchElementException
len_containers: 12
We also skip content_area.get_attribute("aria-label"): Other response
We also skip content_area.get_attribute("aria-label"): Other response
We also skip content_area.get_attribute("aria-label"): Other response
btn_check_boxes: [<selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="e620a782-0a7f-452e-a7bb-c975840fb4bd")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="b5009986-4f49-4d50-86c7-32a151c6f223")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="fc127bd8-5ebb-47f7-ae5b-ebcdb76af8cb")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="2456577c-b566-4503-92fc-e84828c73f9e")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="86648fb0-472a-419a-8752-cf50d49f147a")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="f2fa1ffa-bd19-4e32-91d0-2f2d54d2ae78")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="42b04359-2d23-4216-9404-eab63f881828")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="d3acae80-95b5-4c39-ba9f-bda78d6d15d4")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="7703effe-eab7-4f42-838e-62e29683d72a")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="2092ac6b-c798-4761-8632-21f1e0de2372")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="c121c982-0d03-43bf-a7a4-52a669c69011")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="1738790d-b311-420c-aae8-a0e290fa105f")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="c4de4cd3-12de-45dc-82a2-42cb4f52f16d")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="63dc8841-b58f-4323-aa60-3b851e7083df")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="a0c9129b-dfc8-46e5-bfd5-f50e69d80294")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="559839c2-13f5-4e69-a11c-b3030ee951f2")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="13badedc-909a-4b37-a4b8-63c7722e4dfb")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="71735d9a-1137-4de7-a921-175da9618a12")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="e1141178-e1cb-400c-b8b0-8fc26828f15e")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="a1aaa788-1e37-4fa1-b97e-c4f91b02e6a9")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="73871107-85cd-4842-83fb-a4fd1bd3dfc7")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="5313ec0f-3bb0-4fb0-a2e6-2137b6656392")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="1f85efd6-cd9e-4d75-85ff-cb5b8559c2f7")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="a0430c36-b0ff-484a-9880-87e1f7376480")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="207e8ec3-cbda-46de-96a2-95fb2390e4af")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="494fd699-dca5-4602-a6a4-af17a581f093")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="fb6a7103-76a9-4274-81ce-3d5631e20fc7")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="88f7e564-a200-44a8-9c79-1358fde458f0")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="e26cd285-269f-4a63-bcc9-7a5f97ffca3c")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="1cbfcd32-9370-425a-8c7f-f70739d3e6f0")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="b17a18a5-f394-4f46-b182-a899bd334901")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="f23a034d-2279-441b-bfca-baf41a92269a")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="b2f3f154-9afa-4183-afb0-72ea33eab2df")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="7f94283b-d2ac-4657-b545-9a25a79d886d")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="8a800eff-fbc9-4fb6-b858-37b98034a4b5")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="5737d1ad-531f-45c9-b7ea-7f95965c5973")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="8c444406-16ae-4fe1-ab2b-73759cc27eed")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="c0d540d8-745c-4a01-ad63-60535c62a46b")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="659d8306-624b-4cca-801a-346386c3be90")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="7e95b2db-568a-4192-bbfe-cbccb88f2481")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="bb54e2fb-d597-4eff-b400-5cd450517552")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="da0a8b07-3b5e-4a6f-9351-7ee6c5ed5955")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="8e990bb9-335a-484d-9895-99e0051e0ebe")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="18d10ada-c4c3-4608-b780-0262544523fd")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="508eb37d-48d4-4eb0-9f55-8acd367e1c6e")>]
len_containers: 11
We also skip content_area.get_attribute("aria-label"): Other response
We also skip content_area.get_attribute("aria-label"): Other response
We also skip content_area.get_attribute("aria-label"): Other response
btn_check_boxes: [<selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="e620a782-0a7f-452e-a7bb-c975840fb4bd")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="b5009986-4f49-4d50-86c7-32a151c6f223")>, ... ver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="18d10ada-c4c3-4608-b780-0262544523fd")>, <selenium.webdriver.remote.webelement.WebElement (session="e5556bb6f3bd48b64f9f68b1acd09d0d", element="508eb37d-48d4-4eb0-9f55-8acd367e1c6e")>]
len_containers: 10 #
...
但似乎不能点击,因为最后什么也没点击。正如你所看到的那样,我的效率并不高,因为我在循环我的容器(QA不是复选框,为了得到他们自己的问题和回复),而且似乎我每次循环都要抓住我的复选框,这毫无意义。一次应该就够了。
我的整个代码是。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from selenium.common.exceptions import ElementNotInteractableException, NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common import exceptions
import pickle
import config
WDWTIME = 20
USER = config.username
PWD = config.password
def setup_chromedriver():
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome("C:\Programs\chromedriver.exe")
"""Some of the google forms need a login"""
url = 'https://www.google.com/accounts/'
driver.get(url)
# Find login field
login_field = WebDriverWait(driver, WDWTIME).until(
EC.presence_of_element_located((By.ID, 'identifierId')))
login_field.send_keys(USER)
# Click next button
driver.find_element_by_id('identifierNext').click()
# Find password field
time.sleep(4)
driver.set_page_load_timeout(50)
driver.set_script_timeout(50)
password_field = WebDriverWait(driver, WDWTIME).until(
EC.presence_of_element_located((By.ID, 'password')))
password_field = password_field.find_element_by_tag_name('input')
password_field.send_keys(PWD)
# Click next button
driver.find_element_by_id('passwordNext').click()
driver.set_page_load_timeout(30)
driver.set_script_timeout(30)
return driver
def load_data():
df = pd.read_csv("research_assistant_intern_recruitment_an.csv")
filter_col = ["Link"]
return df, filter_col
def get_published_questionnaire():
"""gets the questions and related answers of a google forms.
Returns:
dictionary: the dictionary of questions and answers successfully scraped.
"""
print("published questionnaire")
questionnaire = {}
btns = driver.find_elements_by_css_selector(".appsMaterialWizButtonEl")
# get "next" button, *warning* "request edit access" is also catched
next_btns = driver.find_elements_by_class_name("appsMaterialWizButtonPaperbuttonContent.exportButtonContent")
if next_btns:
next_btns[-1].click()
next_btns = driver.find_elements_by_class_name("appsMaterialWizButtonPaperbuttonContent.exportButtonContent")
# we iterate to find questions and click on the next page while there is a button we can click on
# *warning* for some google forms like
# https://docs.google.com/forms/d/e/1FAIpQLScWOjVVIKX9Qis2d0vCVpo3RuYqgiZ9TkD4BZm_eTvgVdvGNg/formResponse
# it creates an infinite loop
while next_btns != []:
containers = driver.find_elements_by_class_name(
"freebirdFormviewerViewNumberedItemContainer"
)
len_containers = len(containers)
for container in containers:
time.sleep(0.5)
len_containers -=1
print("len_containers: ", len_containers)
try:
time.sleep(0.5)
question = container.find_element_by_class_name(
"freebirdFormviewerViewItemsItemItemTitle.exportItemTitle.freebirdCustomFont"
)
except NoSuchElementException:
print("No question, NoSuchElementException")
continue
except exceptions.StaleElementReferenceException:
print("No question, StaleElementReferenceException")
continue
responses = container.find_elements_by_class_name(
"docssharedWizToggleLabeledLabelText"
)
extracted_text = [response.text for response in responses]
questionnaire[question.text] = extracted_text
# writing when compulsory
content_areas = driver.find_elements_by_class_name(
"quantumWizTextinputSimpleinputInput.exportInput"
)
for content_area in content_areas:
skip = ["Document title", "Titre du document", "Adresse e-mail valide"]
if content_area.get_attribute("aria-label") in skip and not content_area.get_attribute("aria-label").isspace():
print("We skip content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
else:
print("We also skip content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
content_area.send_keys("10102015")
content_areas = driver.find_elements_by_class_name(
"quantumWizTextinputPaperinputInput.exportInput"
)
for content_area in content_areas:
if content_area.get_attribute("type") == "date" and not content_area.get_attribute("type").isspace():
condition = content_area.get_attribute("type")
if condition == "date":
content_area.send_keys("10102015")
elif content_area.get_attribute("max") and not content_area.get_attribute("max").isspace():
max = content_area.get_attribute("max")
content_area.send_keys(max)
elif content_area.get_attribute("aria-label") and not content_area.get_attribute("aria-label").isspace():
condition = content_area.get_attribute("aria-label")
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
if condition == "State (Two letter Abbreviation)":
content_area.send_keys("CA")
else:
content_area.send_keys("10102015")
for content_area in content_areas:
skip = ["Document title", "Titre du document", "Adresse e-mail valide"]
if content_area.get_attribute("aria-label") in skip and not content_area.get_attribute("aria-label").isspace():
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
else:
print("content_area.get_attribute(\"aria-label\"): ", content_area.get_attribute("aria-label"))
content_area.send_keys("10102015")
btns_answers = driver.find_elements_by_css_selector(".appsMaterialWizToggleRadiogroupElContainer")
for btn_answer in btns_answers:
try:
driver.execute_script('arguments[0].scrollIntoView(true);', btn_answer)
btn_answer.click()
except ElementNotInteractableException:
pass
except exceptions.ElementClickInterceptedException:
continue
# long answers
content_areas = driver.find_elements_by_class_name(
"quantumWizTextinputPapertextareaInput.exportTextarea"
)
for content_area in content_areas:
content_area.send_keys("This restaurant is really good! Me and my boyfriend went there on our holiday \
we had dinner there at 3 of February food was 100% And the service vas 150% And i really want to thank "
"\Asie for a really good service as for his coworkers. We highly recommended \
this restaurant!")
# check boxes
btn_check_boxes = driver.find_elements_by_class_name(
"docssharedWizToggleLabeledContainer.freebirdFormviewerViewItemsCheckboxContainer"
)
for btn_check_box in btn_check_boxes:
btn_check_box.click()
break
# btn_check_box[-1].click()
# # other weird check boxes
btn_check_boxes = driver.find_elements_by_class_name(
"docssharedWizToggleLabeledLabelText.exportLabel.freebirdFormviewerViewItemsCheckboxLabel"
)
for btn_check_box in btn_check_boxes:
btn_check_box.click()
break
# Clicking on text. *warning* : don't work
btn_check_boxes = driver.find_elements_by_class_name(
"freebirdFormviewerViewItemsCheckboxChoice"
)
print("btn_check_boxes: ", btn_check_boxes)
for btn_check_box in btn_check_boxes:
btn_check_box.click()
break
# btns[-1].click()
next_btns = driver.find_elements_by_class_name(
"appsMaterialWizButtonPaperbuttonContent.exportButtonContent")
if next_btns != []:
next_btns[-1].click()
next_btns = []
else:
continue
print("questionnaire: ", questionnaire)
return questionnaire
def get_backend_questionnaire():
print("backend questionnaire")
# sometimes we start with something that looks like a published page with a "next" button
# if driver.find_element_by_id('identifierNext'):
# driver.find_element_by_id('identifierNext').click()
questionnaire = {}
# I get all the cards with questions and answers inside
containers = driver.find_elements_by_class_name(
"freebirdFormeditorViewItemContentWrapper"
)
driver.set_page_load_timeout(30)
driver.set_script_timeout(30)
# for each card
for container in containers:
try:
# Get the question
# question = container.find_element_by_class_name(
# "appsMaterialWizTextinputTextareaInput.exportTextarea"
# )
question = container.find_element_by_css_selector(".exportTextarea[aria-label='Intitulé de la question']")
except NoSuchElementException:
print("NoSuchElementException in " + str(container))
continue
# Get the answers
responses = container.find_elements_by_css_selector(
".quantumWizTextinputSimpleinputInput.exportInput"
)
extracted_responses = [response.get_attribute("data-initial-value") for response in responses]
questionnaire[question.text] = extracted_responses
driver.set_page_load_timeout(30)
driver.set_script_timeout(30)
print("questionnaire backend: ", questionnaire)
return questionnaire
def extract(driver, df, survey):
count_questionnaires = 0
result = []
count_not_empty = 0.0
print("survey: ", survey)
# df = pd.DataFrame({"Link":["https://docs.google.com/forms/d/1_iRBtfJANF5MGWqoIMQUxBdeuAa4ePMltdIsVRmdY5Y/edit?usp=sharing"],
# "Task":["Hotel ABC"]}) # debugging StaleElementReferenceException
for location, task in zip(df.Link, df.Task):
if task == survey:
print("location: ", location)
questionnaire = {}
if "docs.google.com" in str(location):
count_questionnaires +=1.0
driver.get(location)
# test if it is a published version
try:
ask_access_btn = driver.find_elements_by_class_name(
"freebirdFormviewerViewNavigationHeaderButtonContent"
)
except exceptions.UnexpectedAlertPresentException:
print("UnexpectedAlertPresentException")
get_published_questionnaire
if ask_access_btn:
questionnaire = get_published_questionnaire()
else:
questionnaire = get_backend_questionnaire()
if questionnaire not in [{}, {'': ''}]:
count_not_empty += 1.0
result.append({str(count_questionnaires): questionnaire})
count_questionnaires += 1
print("count_questionnaires: ", count_questionnaires)
if count_questionnaires != 0:
print("count_not_empty/count_questionnaires: ", count_not_empty/count_questionnaires)
return result
if __name__ == '__main__':
""" Need to log on to the google account to access certain questionaires. Also Setup chromedriver to run in
headless state """
driver = setup_chromedriver()
published_questionnaires = [] # tracking published ones
""" Load CSV download of Google Sheet """
df, columns = load_data()
surveys = ['Hotel ABC', "Airline XYZ", "The Ministry of Tourism of France"]
for survey in surveys:
result = extract(driver, df, survey)
survey = survey.replace(" ", "_")
pickle_out = open("applicant" + survey + "_c.p", "wb")
pickle.dump(result, pickle_out)
pickle_out.close()
print("published_questionnaires: ", published_questionnaires)
我正在加载的csv是:
Link, Task
https://docs.google.com/forms/d/1j0nk_Oo-_pfJBM4UcWITDPXT97-qX5mZpb3uVyKS3CA/edit?usp=sharing,Hotel ABC
试着去找 btn_check_boxes
与 .find_elements_by_css_selector('div.quantumWizTogglePapercheckboxCheckMark')
并点击使用 arguments[0].click();
争论。
btn_check_boxes = driver.find_elements_by_css_selector('div.quantumWizTogglePapercheckboxCheckMark')
for btn_check_box in btn_check_boxes:
driver.execute_script('arguments[0].click();', btn_check_box)
我相信你不需要点击任何东西,除了。Next
按钮来提取问题及其答案。下面的Ruby代码可以在一个页面中提取所有的问题和答案。
Capybara.page.all(:xpath, '//div[contains(@class, "ItemContainer")]').each do |container|
puts "Title: #{container.find('[role=heading]').text}"
container.all('.docssharedWizToggleLabeledContent').each { |choice| puts choice.text }
puts "\n"
end
我们只需要把它包在一个循环中,执行这个块,如果没有下一步按钮就退出,比如说