感谢您的关注,并为我糟糕的英语感到抱歉。 我一直在尝试从 https://www.skiddle.com/festivals/dates.html 获取 html,但没有成功。我知道有些部分是通过js脚本下载的,但我不知道如何获取它。 我也尝试过使用“会话”,但结果相同。请告诉我我需要在代码中使用什么或我需要探索什么。
提前致谢!
这是我的代码
import requests
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver
import time
import undetected_chromedriver
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36 Edg/121.0.0.0'
}
proxies = {
'https': 'http://146.247.105.71:4827'
}
def get_location(url):
response = requests.get(url, headers=headers, proxies=proxies)
soup = BeautifulSoup(response.text, 'lxml')
print(soup, '\n\n\nlox\n\n\n')
# options = undetected_chromedriver.ChromeOptions()
# options.add_argument('--proxy-server=146.247.105.71:4827')
# driver = undetected_chromedriver.Chrome(
# options=options
# )
# driver.get(url)
# time.sleep(5)
# response = driver.page_source
# driver.close()
# driver.quit()
# print(response)
def main():
get_location(url='https://www.skiddle.com/festivals/dates.html')
if __name__ == '__main__':
main()
我需要每个节日页面上的链接。
以下是如何打印节日名称 + URL 的示例:
import requests
from bs4 import BeautifulSoup
url = "https://www.skiddle.com/festivals/dates.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("li.margin-bottom-10 a"):
print(f'{a.text:<50} {a["href"]}')
打印:
...
Levitation '24 at Bedford Esquires /whats-on/Bedford/Bedford-Esquires/Levitation-24/37157298/
Day at Historic Centreville Park /whats-on/united-states/Historic-Centreville-Park/Day/36718089/
When We Were Young at Las Vegas USA https://www.skiddle.com/festivals/when-we-were-young/
When We Were Young at Las Vegas USA https://www.skiddle.com/festivals/when-we-were-young/
Damnation Festival 2024 at BEC Arena https://www.skiddle.com/festivals/damnation/
Hard Rock Hell at Vauxhall Holiday Park https://www.skiddle.com/festivals/hard-rock-hell/
...