尝试使用requests_html在python中抓取动态网站

问题描述 投票:0回答:1

当我尝试抓取此网站时,我遇到了问题,但我不知道出了什么问题。我尝试使用 Htmlsession 但 python 告诉我使用 AsyncHTMLSession 因为前者无法执行循环。当使用 AsyncHTMLSession 时,我不断遇到这个问题。

url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"
session = AsyncHTMLSession()
response = session.get(url)
await response.html.arender()
await session.close() 

print(response.html)
print(response.html.html)

这是我得到的错误

AttributeError                            Traceback (most recent call last)
Cell In [12], line 4
      2 session = AsyncHTMLSession()
      3 response = session.get(url)
----> 4 await response.html.arender()
      5 await session.close() 
      7 print(response.html)

AttributeError: '_asyncio.Future' object has no attribute 'html'

请提供任何帮助,我们将不胜感激。

我已将等待添加到渲染代码中。尝试在渲染代码中传递 sleep int,还添加await asession.close() 也产生相同的错误代码。

python web-scraping python-requests python-requests-html htmlsession
1个回答
0
投票

使用其他 URL 加载 HTLM(不是 Ajax-y 的 URL),例如:

from io import StringIO
import pandas as pd
import requests

# orinal_url = 'https://www.sec.gov/ix?doc=/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm'
new_url = "https://www.sec.gov/Archives/edgar/data/0000789019/000095017023035122/msft-20230630.htm"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:124.0) Gecko/20100101 Firefox/124.0"
}
soup = BeautifulSoup(requests.get(new_url, headers=headers).content, "html.parser")

balance_sheets = soup.select_one("#balance_sheets ~ table")

# for example, load the table into dataframe:
df = pd.read_html(StringIO(str(balance_sheets)))[0].fillna("")
print(df)

打印:

                                                                                           0 1     2       3  4 5     6       7  8 
0                                                                                                                                 
1                                                                              (In millions)                                      
2                                                                                                                                 
3                                                                                                                                  
4                                                                                   June 30,    2023    2023       2022    2022    
5                                                                                                                                 
6                                                                                     Assets                                       
7                                                                            Current assets:                                       
8                                                                  Cash and cash equivalents       $   34704          $   13931   
9                                                                     Short-term investments           76558              90826    
10                                                                                                                                 
11                                                                                                                                 
12                                  Total cash, cash equivalents, and short-term investments          111262             104757   
13              Accounts receivable, net of allowance for doubtful accounts of $650 and $633           48688              44261   
14                                                                               Inventories            2500               3742   
15                                                                      Other current assets           21807              16924   
16                                                                                                                                 
17                                                                                                                                 
18                                                                      Total current assets          184257             169684   
19            Property and equipment, net of accumulated depreciation of $68,251 and $59,660           95641              74398   
20                                                       Operating lease right-of-use assets           14346              13148   
21                                                                        Equity investments            9879               6891   
22                                                                                  Goodwill           67886              67524   
23                                                                    Intangible assets, net            9366              11298   
24                                                                    Other long-term assets           30601              21897   
25                                                                                                                                 
26                                                                                                                                 
27                                                                              Total assets       $  411976          $  364840   

...
© www.soinside.com 2019 - 2024. All rights reserved.