如何从文本节点中提取文本,使用Selenium和Python隔离表数据中的标签

问题描述 投票:1回答:4

enter image description here

我在同一行中打印表格数据时遇到问题。当然,我只能用css_selector("td")标识,但可以打印出来:名称地址市,州同一列中的电话而我正在尝试创建:名称,地址,城市/州,电话到同一行

HTML :(请参阅附图)

这似乎是一个愚蠢的问题,要挂掉……但是我已经被困了很长时间了,而且还无法隔离<br>标签。

代码:

for x in link:
driver.get(x)
try:
    i = 0
    while 0 < 20:
        name = driver.find_elements_by_xpath("/html/body/div[2]/div/div[1]/div/div/table/tbody/tr/td[1]/table/tbody/tr['"+str(i)+"']/td/strong")
        if name[i].is_displayed():
            print(name[i].text)

            i = i + 1
        else:
            i = i + 1
except(NoSuchElementException,JavascriptException, IndexError):
    continue

我已经以这种方式确定了这种方法,试图简单地将过程中的兄弟姐妹的文本返回...再次无济于事。driver.find_elements_by_css_selector("td")还返回整个表数据...但带有中断

javascript python selenium css-selectors webdriverwait
4个回答
0
投票

<br>\n的文本中添加新行<td>,您将其拆分或删除

tds = driver.find_elements_by_css_selector("td")
for td in tds:
     text = td.text.split('\n')
     print(text) # list: ['text1', 'text2', 'text3', 'text4']

     text = td.text.replace('\n', ' ')
     print(text) # str: 'textr text2 text3 text4'

0
投票

如果您能够用<td> 标识父级css_selector("td")元素以打印名称地址城市/州电话,则可以使用以下Locator Strategies

  • [名称

    print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "td>strong"))).get_attribute("innerHTML"))
    
  • 地址

    print(driver.execute_script('return arguments[0].childNodes[3].textContent;', WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "td")))).strip())
    
  • 城市/州

    print(driver.execute_script('return arguments[0].childNodes[5].textContent;', WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "td")))).strip())
    
  • 电话

    print(driver.execute_script('return arguments[0].lastChild.textContent;', WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "td")))).strip())
    

0
投票

BeautifulSoup也可以在这种情况下使用。

>>>from bs4 import beautifulsoup
>>>import requests
>>>contents=requests.get(url).text

>>>soup=beautifulsoup('lxml',contents)

>>>>Text=soup.find('body').text

并检查条件是否存在'br'标签,然后跳过


0
投票
for x in link:
driver.get(x)
try:

    names = driver.find_elements_by_css_selector("td")
    i = 0
    while i <= len(names):
        address = names[i].text.splitlines()
        r = len(address)

        if r == 4:
            print(x, " | ",address[0], " | ", address[1], " | ", address[2], " | ", address[3])


        elif r == 3:

            print(x, " | ",address[0], " | ", address[1], " | ", address[2])

        else:
            pass
        i=i+1


except(NoSuchElementException, IndexError):
    continue

这完成了工作。

© www.soinside.com 2019 - 2024. All rights reserved.