通过抓取信息创建新列

Question

我正在尝试将从网站抓取的信息添加到列中。我有一个数据集，看起来像：

COL1   COL2    COL3
...     ...    bbc.co.uk

而且我希望有一个包含新列的数据集：

 COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk

IP Address  Server Location    City       Region

这些新列来自此网站：https://www.urlvoid.com/scan/bbc.co.uk。我需要在每一列中填写相关信息。

例如：

  COL1   COL2    COL3          Website Address   Last Analysis   Blacklist Status \  
...     ...    bbc.co.uk         Bbc.co.uk         9 days ago       0/35

Domain Registration               IP Address       Server Location    City       Region
1996-08-01 | 24 years ago       151.101.64.81    (US) United States   Unknown    Unknown

[不幸的是，在创建新列并用从网站上抓取的信息填充它们时，我遇到了一些问题。我可能要检查更多的网站，不仅是bbc.co.uk。请参见下面使用的代码。我敢肯定，有一种更好的方法（而不是混乱的方法）可以做到这一点。如果您能帮助我解决问题，我将不胜感激。谢谢

import requests
from bs4 import BeautifulSoup

headers=[]
info=[]
tot_headers=[]
tot_info=[]
tot_headers_all=[]
tot_info_all=[]


        r = requests.get('https://www.urlvoid.com/scan/bbc.co.uk)
        soup = BeautifulSoup(r.content, 'lxml')
        tab = soup.select("table.table.table-custom.table-striped")
        dat = tab[0].select('tr')
        for d in dat:
                row = d.select('td')
                headers=row[0].text
                info=row[1].text
                tot_headers = headers.split("    ")
                tot_info = info.split("    ")
                tot_headers_all.append(headers.split("    "))
                tot_info_all.append(info.split("    "))


flat_list_headers=[item for sublist in tot_info_all for item in sublist]
flat_list_info=[item for sublist in tot_info_all for item in sublist]

编辑：

因为我需要对列表中的所有URL进行此检查，所以我应该考虑以下内容：

urls= original_dataset['URLS'].tolist()
for x in urls:
        df = pd.read_html("https://www.urlvoid.com/scan/"+x)[0]
...

如上面的示例所示，在已经存在的三列（col1, col2 and col3）中，我还应该添加来自于抓取（Website Address,Last Analysis,Blacklist Status, ...）的字段。那么，对于每个网址，我都应该有与之相关的信息（例如示例中的bbc.co.uk）。

Answer 1

您可以通过使用pandas read_html方法来使用更简单的方法来获取数据。这是我的镜头-

import pandas as pd

df = pd.read_html("https://www.urlvoid.com/scan/bbc.co.uk/")[0]

df_transpose = df.T

现在您具有所需的转置数据。您可以根据需要删除不需要的列。之后，您现在要做的就是将其与现有数据集结合起来。考虑到您可以将数据集作为pandas数据框加载，您可以为此简单地使用concat函数（axis = 1可以串联为列）：

pd.concat([df_transpose, existing_dataset], axis=1)

请参阅有关合并/串联的熊猫文档：http://pandas.pydata.org/pandas-docs/stable/merging.html

通过抓取信息创建新列

问题描述投票：0回答：1

1个回答

最新问题

通过抓取信息创建新列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1