我是一个用Python编码的新手,我正在寻找从维基百科页面列表中检索生日和死亡日期的不同方法。在一个个人项目中,我正在寻找不同的方法从维基百科的页面列表中检索生日和死亡日期。我使用的是 维基百科 包。
我试图实现这个目标的一个方法是通过迭代维基百科的摘要,并从我连续数到四位数时返回索引。
import wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
wiki_summary = wp.summary(names)
b_counter = 0
i_b_year = []
d_counter = 0
i_d_year = []
for i,x in enumerate(wiki_summary):
if x.isdigit() == True:
b_counter += 1
if b_counter == 4:
i_b_year = i
break
else:
continue
else:
b_counter = 0
到目前为止,这对我的列表中的第一个人是有效的,但我想遍历我的列表中的所有名字。names
列表。有没有办法用for循环找到索引,然后用for循环来遍历一下 names
?
我知道还有其他的方法,比如通过解析来找到的。bday
标签,但我想尝试几种不同的解决方案。
你正在尝试
问题是,摘要中的人物可能不包括出生年和死亡年这两个4位数的数字。例如 Rem_Koolhaas'的维基百科摘要中,他的出生年月是第一个4位数,但第二个4位数是在这一行。In 2005, he co-founded Volume Magazine together with Mark Wigley and Ole Bouman.
我们可以看到 birth_year
和 death_year
列表中可能不包含准确的信息。
这里的代码可以实现你想实现的目标。
import wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
i_d_year = []
for person_name in names:
wiki_summary = wp.summary(person_name)
birth_year_found = False
death_year_found = False
digits = ""
for c in wiki_summary:
if c.isdigit() == True:
if birth_year_found == False:
digits += c
if len(digits) == 4:
birth_year_found = True
i_b_year.append(int(digits))
digits = ""
elif death_year_found == False:
digits += c
if len(digits) == 4:
death_year_found = True
i_d_year.append(int(digits))
break
else:
digits = ""
if birth_year_found == False:
i_b_year.append(0)
if death_year_found == False:
i_d_year.append(0)
for i in range(len(names)):
print(names[i], i_b_year[i], i_d_year[i])
输出:
Zaha Hadid 1950 2016
Rem Koolhaas 1944 2005
声明:在上面的代码中,我已经附加了0,如果两个4位数的数字没有在任何一个人的摘要中找到。正如我已经提到的,维基百科的摘要中并没有将一个人的出生年份和死亡年份列为前两位4位数字的说法,名单中可能包含错误的信息。
我不熟悉维基百科的包,但似乎你可以只迭代名字元组。
import Wikipedia as wp
names = ('Zaha Hadid', 'Rem Koolhaas')
i_b_year = []
for name in names: #This line is new
wiki_summary = wp.summary(name) #Just changed names for name
b_counter = 0
d_counter = 0
i_d_year = []
for i,x in enumerate(wiki_summary):
if x.isdigit() == True:
b_counter += 1
if b_counter == 4:
i_b_year.append(i) #I am guessing you want this list to increase with each name in names. Thus, 'append'.
break
else:
continue
else:
b_counter = 0
首先,你的代码不能用,有几个原因:
import wikipedia
summary
方法接受字符串(在你的例子中是名字),所以你必须为一组中的每一个名字调用它抛开这些不谈,让我们试着实现你想要做的事情。
import wikipedia as wp
import re
# First thing we see (at least for pages provided) is that dates all share the same format:
# For those who are no longer with us 31 October 1950 – 31 March 2016
# For those who are still alive 17 November 1944
# So we have to build regex patterns to find those
# First is the months pattern, since it's quite a big one
MONTHS_PATTERN = r"January|February|March|April|May|June|July|August|September|October|November|December"
# Next we build our date pattern, double curly braces are used for literal text
DATE_PATTERN = re.compile(fr"\d{{1,2}}\s({MONTHS_PATTERN})\s\d{{,4}}")
# Declare our set of names, great choice of architects BTW :)
names = ('Zaha Hadid', 'Rem Koolhaas')
# Since we're trying to get birthdays and dates of death, we will create a dictionary for storing values
lifespans = {}
# Iterate over them in a loop
for name in names:
lifespan = {'birthday': None, 'deathday': None}
try:
summary = wp.summary(name)
# First we find the first date in summary, since it's most likely to be the birthday
first_date = DATE_PATTERN.search(summary)
if first_date:
# If we've found a date – suppose it's birthday
bday = first_date.group()
lifespan['birthday'] = bday
# Let's check whether the person is no longer with us
LIFESPAN_PATTERN = re.compile(fr"{bday}\s–\s{DATE_PATTERN.pattern}")
lifespan_found = LIFESPAN_PATTERN.search(summary)
if lifespan_found:
lifespan['deathday'] = lifespan_found.group().replace(f"{bday} – ", '')
lifespans[name] = lifespan
else:
print(f'No dates were found for {name}')
except wp.exceptions.PageError:
# Handle not found page, so that code won't break
print(f'{name} was not found on Wikipedia')
pass
# Print result
print(lifespans)
对提供的名字进行输出:
{'Zaha Hadid': {'birthday': '31 October 1950', 'deathday': '31 March 2016'}, 'Rem Koolhaas': {'birthday': '17 November 1944', 'deathday': None}}
这种方法效率很低,而且有很多缺陷,比如我们得到的页面中的日期符合我们的正则表达式,但却不是生日和死亡日期。这是相当丑陋的(尽管我已经尽力了:),你最好还是解析一下标签。
如果你对维基百科的日期格式不满意,我建议你研究一下 datetime
. 另外,考虑到 那些正则表达式 适合 那两页我并没有对维基百科中的日期如何表示进行任何研究,所以,如果有任何不一致的地方,我建议你坚持使用标签。所以,如果有任何不一致的地方,我建议你坚持使用解析标签。