我想知道为什么列表all_links
和all_titles
不想从列表titles
和links
接收任何记录。我也尝试过.extend()
方法,但没有帮助。
import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []
def title_link(page_num):
page = requests.get(
'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
% (page_num, page_num, page_num))
soup = BeautifulSoup(page.content, 'html.parser')
links = ['https://www.gumtree.pl' + link.get('href')
for link in soup.find_all('a', class_ ="href-link tile-title-text")]
titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
print(titles)
for i in range(1,5+1):
title_link(i)
all_links = all_links + links
all_titles = all_titles + titles
i+=1
print(all_links)
import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
#df.to_csv("./gumtree_page_1.csv", sep=';',index=False, encoding = 'utf-8')
#df.to_excel('./gumtree_page_1.xlsx')
当我运行您的代码时,我得到了
NameError Traceback (most recent call last)
<ipython-input-3-6fff0b33d73b> in <module>
16 for i in range(1,5+1):
17 title_link(i)
---> 18 all_links = all_links + links
19 all_titles = all_titles + titles
20 i+=1
NameError: name 'links' is not defined
这表明存在问题-名为links
的变量未在全局范围内定义(将其添加到all_links
中)。您可以阅读有关python作用域here的信息。您需要return
来自title_link
的链接和标题。类似于以下内容:
def title_link(page_sum):
# your code here
return links, titles
for i in range(1,5+1):
links, titles = title_link(i)
all_links = all_links + links
all_titles = all_titles + titles
print(all_links)
此代码在作用域上表现出混乱。 titles
内部的links
和title_link
对于该功能而言是本地的。函数结束后,数据消失,无法从其他范围(例如main)访问数据。使用return
语句从函数返回值。在这种情况下,您需要返回一个元组对titles
和links
,但这表明存在基本的设计缺陷。
功能只能执行一项任务。像title_link
这样的函数已重载,应为两个独立的函数,一个用于获取标题,一个用于获取链接。
已经说过,这里的函数看起来像是过早的抽象,因为可以直接进行操作:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d"
data = {"title": [], "link": []}
for i in range(1, 6):
page = requests.get(url % (i, i, i))
soup = BeautifulSoup(page.content, "html.parser")
titles = soup.find_all("a", class_="href-link tile-title-text")
data["title"].extend([x.next_element for x in titles])
data["link"].extend("https://www.gumtree.pl" + x.get("href") for x in titles)
df = pd.DataFrame(data)
print(df.head(100))
其他说明:
i+=1
是不必要的; for
循环在Python中自动前进。 (1,5+1)
更清晰地表示为(1, 6)
。list.extend(other_list)
优于list = list + other_list
,后者速度慢且占用大量内存,会创建列表的整个副本。尝试一下:
import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []
def title_link(page_num):
page = requests.get(
'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
% (page_num, page_num, page_num))
soup = BeautifulSoup(page.content, 'html.parser')
links = ['https://www.gumtree.pl' + link.get('href')
for link in soup.find_all('a', class_ ="href-link tile-title-text")]
titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
print(titles)
return links, titles
for i in range(1,5+1):
links, titles = title_link(i)
all_links.extend(links)
all_titles.extend(titles)
# i+=1 not needed in python
print(all_links)
import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
我认为您只需要从links
中取出titles
和title_link(page_num)
。
编辑:删除了每个注释的手动递增编辑:将all_links = all_links + links
更改为all_links.extend(links)