使用Python从网站上刮取和排序日期

问题描述 投票:2回答:1

我试图从网站的某些结果中排序日期。我找到了标签<span class="f"之间的日期。不幸的是,我无法使用下面的代码提取此类信息。我想问一下代码中的错误以及如何提取日期并按升序/降序排序。

我已经做的是从网站收集信息(前20个结果)到一个数组。数组网址[]收集在不同时期发布的信息(句子)(按月,日,分......)。您可以想到Facebook上的帖子或谷歌的结果。

urls=[]
for url in search(' " life " ', stop=20):
    urls.append(url) # this creates a list of results (sentences. For each of them I would like to report the date when it was published)

soup = BeautifulSoup(url)

for url in urls:
    url = soup.find_all('span', {'class':'f'})

            # <span class="f">2 days ago - </span>

    print(url)

我应该期待结果,例如,

"Yesterday I went out with my friends"     2 days ago    the oldest result 
"I played basketball for several years"   20 hours ago  ....
.... 19 hours ago  ....
.... 5 hours ago   ....

... 

对于每个句子。所以我应该有两个数组,一个用于句子,一个用于它们的日期,以便绘制它们。

原始数据是:

enter image description here

能帮我提一下如何做到吗?谢谢

python-3.x web-scraping
1个回答
1
投票

这需要几个步骤:

  • 首先,通过删除span标记,仅从每个URL中提取持续时间。您可以使用replace()split()执行此操作,也可以使用正则表达式。
  • 将持续时间分为不同的类别(天,小时等)
  • 在每个类别中,按相反的顺序对持续时间进行排序(例如,2小时前应该在1小时之前到来)
  • 最后,按正确的顺序将日期(小时,小时等)加入到一个字符串中(天应该在小时之前)。

这是一个有效的实现。请注意,您可以将其扩展为也支持分钟,月份等。

elements = [
'<span class="f">21 hours ago - </span>',
'<span class="f">20 hours ago - </span>',
'<span class="f">2 days ago - </span>',
'<span class="f">1 day ago - </span>']

# extract the durations (eg. 21 hours ago) and store them in times list
times = [elem.replace('<span class="f">','').replace(' - </span>','') for elem in elements]

# categorize the times into days and hours
days = [time for time in times if "day" in time]
hours = [time for time in times if "hour" in time]

# sort each category in reverse order
days.sort(reverse=True)
hours.sort(reverse=True)

# join categories into a string, such that each time is on a new line
output = '\n'.join(days) + '\n' + '\n'.join(hours)
print(output)

输出:

2 days ago
1 day ago
21 hours ago
20 hours ago

但是:ぁzxswい

另一种更具伸缩性的方法是使用字典将每个持续时间转换为特定的分钟数,将这些数字持续时间存储到单独的列表中,并根据数字列表对原始字符串列表进行排序:

https://repl.it/@glhr/55552138

输出:

elements = [
'<span class="f">21 hours ago - </span>',
'<span class="f">20 hours ago - </span>',
'<span class="f">2 days ago - </span>',
'<span class="f">1 day ago - </span>']

# extract the durations (eg. 21 hours ago) and store them in times list
times = [elem.replace('<span class="f">','').replace(' - </span>','') for elem in elements]

minutes_per_duration = {"hours": 60, "hour": 60, "minute": 1, "minutes": 1, "day": 1440, "days": 1440}

duration_values = []

for time in times:
    duration = time.split(" ")[1] # eg. hours
    number = int(time.split(" ")[0]) # eg. 21
    minutes = minutes_per_duration[duration] # eg. 60 (for hours)
    total = minutes * number # 21 * 60 = 1260
    duration_values.append(total)

# sort times based on calculated duration values
output = '\n'.join([times for duration_values, times in sorted(zip(duration_values, times),reverse=True)])

print(output)

在您的代码中,您可以像这样实现它:

2 days ago
1 day ago
21 hours ago
20 hours ago
© www.soinside.com 2019 - 2024. All rights reserved.