我正在抓一个FAQ页面,我需要在FAQ页面找到哪个标签有答案

问题描述 投票:0回答:1
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd                     
import re
req = requests.get('https://www.godrejproperties.com/nricorner/nri-faqs')
soup = BeautifulSoup(req.text, "html5lib")

ist1=[]
for elem in soup(text=re.compile(r'\s*((?:how|How|Can|can|what|What|where|Where|describe|Describe|Who|who|When|when|Why|why|Should|should|is|Is|I|Do|do|Are|are|Will|will)[^.<>?]*?\s*\?)')):
    print elem.parent
    list1.append(elem.parent)

x=str(list1[1])
tag=x[x.find("<")+1:x.find(">")]
print tag

Ques = []
for header in soup.find_all(tag):
    list_=[header]
    ffff=re.findall(r'\s*((?:how|How|Can|can|what|What|where|Where|describe|Describe|Who|who|When|when|Why|why|Should|should|is|Is|I|Do|do|Are|are|Will|will)[^.<>?]*?\s*\?)',str(list_))
    #print(ffff)
    #print (len(ffff))
    if len(ffff)>0:
        Ques.append(ffff)
Ques = np.array(Ques)
print(Ques) 

Similarly I need to find the answers in FAQ pages I need to create a algorithm which will capture in which tag answer is contained and get it's content and save it in a list. Later I need question and answer as a pair

python pandas numpy web-scraping beautifulsoup
1个回答
1
投票

您可以使用xpath获取详细信息。正如你可以看到html结构所有问题和答案都是手风琴。那么基本上我们需要通过属性遍历它。对于直接答案,我们可以使用以下xpath位置

// * [@ class =“ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom”]

但是你需要聪明,因为这可能会导致其他手风琴进入你捕获的数据,所以根据问题ID验证数据,这也反映在答案ID中。

// * [@ class =“ui-accordion-header ui-state-default ui-corner-all ui-accordion-icons”]

您还可以使用xpath或css选择器例如:

enter image description here

甚至穿过article

© www.soinside.com 2019 - 2024. All rights reserved.