用python进行网络抓取。无法访问td元素

问题描述 投票:1回答:4

我试图从这个地址网上刮:https://www.pro-football-reference.com/boxscores/

这是美式足球比赛的一页分数。我想得到每场比赛的日期,赢家和输家。我没有问题访问日期,但无法弄清楚如何孤立和获取赢家和输家的团队名称。到目前为止我有什么......

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup


#assigning url
my_url = 'https://www.pro-football-reference.com/boxscores/'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html,"html.parser")

games = page_soup.findAll("div",{"class":"game_summary expanded nohover"})


for game in games:
    date_block = game.findAll("tr",{"class":"date"})
    date_val = date_block[0].text
    winner_block = game.findAll("tr",{"class":"winner"})
    #here I need a line that returns the game winner, e.g. "Philadelphia Eagles"
    loser = game.findAll("tr",{"class":"loser"})

这是相关的HTML ...

<div class="game_summary expanded nohover">
<table class="teams">
    <tbody>
        <tr class="date">
            <td colspan="3">Sep 6, 2018</td>
        </tr>
        <tr class="loser">
            <td><a href="/teams/atl/2018.htm">Atlanta Falcons</a></td>
            <td class="right">12</td>
            <td class="right gamelink">
                <a href="/boxscores/201809060phi.htm">Final</a>
            </td>
        </tr>
        <tr class="winner">
            <td><a href="/teams/phi/2018.htm">Philadelphia Eagles</a></td>
            <td class="right">18</td>
            <td class="right">
            </td>
        </tr>
    </tbody>
</table>
<table class="stats">
    <tbody>
        <tr>
            <td><strong>PassYds</strong></td>
            <td><a href="/players/R/RyanMa00.htm" title="Matt Ryan">Ryan</a>-ATL</td>
            <td class="right">251</td>
        </tr>
        <tr>
            <td><strong>RushYds</strong></td>
            <td><a href="/players/A/AjayJa00.htm" title="Jay Ajayi">Ajayi</a>-PHI</td>
            <td class="right">62</td>
        </tr>
        <tr>
            <td><strong>RecYds</strong></td>
            <td><a href="/players/J/JoneJu02.htm" title="Julio Jones">Jones</a>-ATL</td>
            <td class="right">169</td>
        </tr>
    </tbody>
</table>

我得到一个错误,说ResultSet对象没有属性“td”。任何帮助将不胜感激

python html web-scraping
4个回答
1
投票

小心领带游戏,我认为这是导致你的错误的原因,因为在这种情况下没有胜利者因此你不会找到与获胜者类别的行。以下代码输出日期和获胜者。

for game in games:
    date_block = game.find('tr',{'class':'date'})
    date_val = date_block.text
    winner_block = game.find('tr',{'class':'winner'})
    if winner_block:
        winner = winner_block.find('a').text
        print(date_val)
        print(winner)
    loser = game.findAll('tr',{'class':'loser'})

输出:

Sep 6, 2018
Philadelphia Eagles
Sep 9, 2018
New England Patriots
Sep 9, 2018
Tampa Bay Buccaneers
Sep 9, 2018
Minnesota Vikings
Sep 9, 2018
Miami Dolphins
Sep 9, 2018
Cincinnati Bengals
Sep 9, 2018
Baltimore Ravens
Sep 9, 2018
Jacksonville Jaguars
Sep 9, 2018
Kansas City Chiefs
Sep 9, 2018
Denver Broncos
Sep 9, 2018
Washington Redskins
Sep 9, 2018
Carolina Panthers
Sep 9, 2018
Green Bay Packers
Sep 10, 2018
New York Jets
Sep 10, 2018
Los Angeles Rams

0
投票

你的代码看起来非常正确。

html = ''' ... '''
soup = bs4.BeautifulSoup(html, 'lxml')  # or 'html.parser' either way
print([elem.text for elem in soup.find_all('tr', {'class': 'loser'})])
['\nAtlanta Falcons\n12\n\nFinal\n\n']`

究竟出了什么问题?


0
投票

您可以从"game_summaries" div锚定您的搜索:

import requests, json
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.pro-football-reference.com/boxscores/').text, 'html.parser')
def get_data(_soup_obj, _headers):
  _d = [(lambda x:[c.text for c in x.find_all('td')] if x is not None else [])(_soup_obj.find(a, {'class':b})) for a, b in _headers]
  if all(_d):
    [date], [t1, val, _], [t2, val2, _] = _d
    return {'date':date, 'winner':{'team':t1, 'score':int(val)}, 'loser':{'team':t2, 'score':int(val2)}}
  return {}

headers = [['tr', 'date'], ['tr', 'winner'], ['tr', 'loser']]
games = [get_data(i, headers) for i in d.find('div', {'class':'game_summaries'}).find_all('div', {'class':'game_summary'})]
print(json.dumps(games, indent=4))

输出:

[
  {
    "date": "Sep 6, 2018",
    "winner": {
        "team": "Philadelphia Eagles",
        "score": 18
    },
    "loser": {
        "team": "Atlanta Falcons",
        "score": 12
    }
 },
  {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "New England Patriots",
        "score": 27
    },
    "loser": {
        "team": "Houston Texans",
        "score": 20
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Tampa Bay Buccaneers",
        "score": 48
    },
    "loser": {
        "team": "New Orleans Saints",
        "score": 40
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Minnesota Vikings",
        "score": 24
    },
    "loser": {
        "team": "San Francisco 49ers",
        "score": 16
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Miami Dolphins",
        "score": 27
    },
    "loser": {
        "team": "Tennessee Titans",
        "score": 20
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Cincinnati Bengals",
        "score": 34
    },
    "loser": {
        "team": "Indianapolis Colts",
        "score": 23
    }
},
{},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Baltimore Ravens",
        "score": 47
    },
    "loser": {
        "team": "Buffalo Bills",
        "score": 3
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Jacksonville Jaguars",
        "score": 20
    },
    "loser": {
        "team": "New York Giants",
        "score": 15
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Kansas City Chiefs",
        "score": 38
    },
    "loser": {
        "team": "Los Angeles Chargers",
        "score": 28
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Denver Broncos",
        "score": 27
    },
    "loser": {
        "team": "Seattle Seahawks",
        "score": 24
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Washington Redskins",
        "score": 24
    },
    "loser": {
        "team": "Arizona Cardinals",
        "score": 6
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Carolina Panthers",
        "score": 16
    },
    "loser": {
        "team": "Dallas Cowboys",
        "score": 8
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Green Bay Packers",
        "score": 24
    },
    "loser": {
        "team": "Chicago Bears",
        "score": 23
    }
},
{
    "date": "Sep 10, 2018",
    "winner": {
        "team": "New York Jets",
        "score": 48
    },
    "loser": {
        "team": "Detroit Lions",
        "score": 17
    }
},
{
    "date": "Sep 10, 2018",
    "winner": {
        "team": "Los Angeles Rams",
        "score": 33
    },
    "loser": {
        "team": "Oakland Raiders",
        "score": 13
     }
  }
]

0
投票

你可能会遇到本周存在平局的问题。在匹兹堡/克利夫兰的比赛中没有冠军TD。运行此应输出所有游戏,包括领带游戏:

for game in games:
    date_block = game.findAll("tr",{"class":"date"})
    date_val = date_block[0].text
    print "Game Date: %s" % (date_val)
    #Test if a winner is defined
    if game.find("tr",{"class":"winner"}) is not None:        


        winner_block = game.findAll("tr",{"class":"winner"})
        #Get the winner from the first TD and print text only
        winner = winner_block[0].findAll("td")
        print "Winner: %s" % (winner[0].get_text())

        loser_block = game.findAll("tr",{"class":"loser"})
        #Get the loser from the first TD and print text only
        loser = loser_block[0].findAll("td")
        print "Loser: %s" % (loser[0].get_text())
    else:
        #If no winner is listed, it must be a tie. Get both teams and print them.
        print "Its a tie!"
        draw_block  = game.findAll("tr",{"class":"draw"})
        for team in draw_block:
            print "Draw : %s"   % (team.findAll("td")[0].get_text())
© www.soinside.com 2019 - 2024. All rights reserved.