BeautifulSoup find_all('href')仅返回部分值

问题描述 投票:0回答:1

我正在尝试从IMDB电影页面中抓取演员ID。我只想要演员(不是我想要的任何剧组),这个问题是关于获取此人的内部ID的。我已经有了人民的名字,所以我不需要别人帮忙。我将从此网页(https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast)作为硬编码的URL开始,以使代码正确。

通过检查链接,我发现演员的链接看起来像这样。

<a href="/name/nm0000638/?ref_=ttfc_fc_cl_t1"> William Shatner</a>
<a href="/name/nm0000559/?ref_=ttfc_fc_cl_t2"> Leonard Nimoy</a>
<a href="/name/nm0346415/?ref_=ttfc_fc_cl_t17"> Nicholas Guest</a> 

而其他贡献者看起来像这样

<a href="/name/nm0583292/?ref_=ttfc_fc_dr1"> Nicholas Meyer </a>
<a href="/name/nm0734472/?ref_=ttfc_fc_wr1"> Gene Roddenberry</a> 

这应该让我通过检查href的结尾是否为“ t [0-9] + $”而不是相同的“ dr”或“ wr”来区分演员/演员(如导演或作家) 。

这是我正在运行的代码。

import urllib.request
from bs4 import BeautifulSoup
import re

movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'

def clearLists(n):
    return [[] for _ in range(n)]

def getSoupObject(urlInput):
    page = urllib.request.urlopen(urlInput).read()
    soup = BeautifulSoup(page, features="html.parser")
    return(soup)

def getPeopleForMovie(soupObject):
    listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)

    #get all the tags with links in them
    link_tags = soupObject.find_all('a')

    #get the ids of people
    for linkTag in link_tags:
        link = str(linkTag.get('href'))
        #print(link)
        p = re.compile('t[0-9]+$')
        q = p.search(link)
        if link.startswith('/name/') and q != None:
            id = link[6:15]
            #print(id)
            listOfPeopleIDs.append(id)

    #return the names and IDs
    return listOfPeopleNames, listOfPeopleIDs

newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)

上面的代码返回一个空的ID列表,如果您取消注释打印语句,您会发现这是因为放入“ link”变量中的值最终是下面的值(特定人群的变体)

/name/nm0583292/
/name/nm0000638/

不会。我只想要演员的ID,以便以后可以使用这些ID。我试图在stackoverflow上找到其他答案。我一直找不到这个特定问题。

此问题(Beautifulsoup: parsing html – get part of href)与我想做的很接近,但是它从标签之间的文本部分而不是从标签属性中的href部分获取信息。

如何确保仅从页面中获得想要的名称ID(仅演员名称)?(也随时提供建议以加强代码)

python html web-scraping beautifulsoup href
1个回答
0
投票

[您似乎要匹配的链接似乎在加载后已被JavaScript修改,或者可能根据URL以外的其他变量(例如Cookie或标头)以不同的方式加载。

但是,由于您只关注演员表中的人员链接,因此一种更简单的方法是简单地匹配演员表部分中的人员ID。实际上,这非常简单,因为它们都在单个元素中,<table class="cast_list">

所以:

import urllib.request
from bs4 import BeautifulSoup
import re

# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'


# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
#     return [[] for _ in range(n)]


# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
    page = urllib.request.urlopen(url_input).read()
    soup = BeautifulSoup(page, features="html.parser")
    # removed needless parentheses - arguably, even `soup` is superfluous:
    # return BeautifulSoup(page, features="html.parser")
    return soup


# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
    # removed unused variables, also 'list_of_people_ids' is needlessly verbose
    # since they go together, why not return people as a list of tuples, or a dictionary?
    # I'd prefer a dictionary as it automatically gets rid of duplicates as well
    people = {}

    # (put a space at the start of your comment blocks!)
    # get all the anchors tags inside the `cast_list` table
    link_tags = soup_object.find('table', class_='cast_list').find_all('a')

    # the whole point of compiling the regex is to only have to do it once, 
    # so outside the loop
    id_regex = re.compile(r'/name/nm(\d+)/')

    # get the ids and names of people
    for link_tag in link_tags:
        # the href attributes is a strings, so casting with str() serves no purpose
        href = link_tag.get('href')
        # matching and extracting part of the match can all be done in one step:
        match = id_regex.search(href)
        if match:
            # don't shadow Python keywords like `id` with variable names!
            identifier = match.group(1)
            name = link_tag.text.strip()
            # just ignore the ones with no text, they're the thumbs
            if name:
                people[identifier] = name

    # return the names and IDs
    return people


def main():
    # don't do stuff globally, it'll just cause problems when reusing names in functions
    soup = get_soup(url)
    people = get_people_for_movie(soup)
    print(people)


# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
    main()

结果:

{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.

以及具有更多调整的代码,并且没有对代码的注释:

import urllib.request
from bs4 import BeautifulSoup
import re


def get_soup(url_input):
    page = urllib.request.urlopen(url_input).read()
    return BeautifulSoup(page, features="html.parser")


def get_people_for_movie(soup_object):
    people = {}

    link_tags = soup_object.find('table', class_='cast_list').find_all('a')

    id_regex = re.compile(r'/name/nm(\d+)/')

    # get the ids and names of the cast
    for link_tag in link_tags:
        match = id_regex.search(link_tag.get('href'))
        if match:
            name = link_tag.text.strip()
            if name:
                people[match.group(1)] = name

    return people


def main():
    movie_number = 'tt0084726'
    url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'

    people = get_people_for_movie(get_soup(url))
    print(people)


if __name__ == '__main__':
    main()
© www.soinside.com 2019 - 2024. All rights reserved.