使用Beautifulsoup在python分配以下链接

问题描述 投票:-2回答:9

我有这样的任务对于Python类,我必须从特定链路在特定位置开始,然后按照链接的特定次数。据说第一个环节都有位置1.这是链接:http://python-data.dr-chuck.net/known_by_Fikret.html

traceback error picture我有一个定位链接的麻烦,错误“索引超出范围”出来。谁能搞清楚如何找到链接/位置帮助吗?这是我的代码:

import urllib
from BeautifulSoup import *

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
count = int(raw_input('Enter count: '))+1
position = int(raw_input('Enter position: '))


tags = soup('a')
tags_lst = list()
for tag in tags:
    needed_tag = tag.get('href', None)
    tags_lst.append(needed_tag)
    for i in range(0,count):
        print 'retrieving: ',tags_lst[position]

好吧,我写了这个代码和它种工作方式:

import urllib
from BeautifulSoup import *

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
count = int(raw_input('Enter count: '))+1
position = int(raw_input('Enter position: '))


tags = soup('a')
tags_lst = list()
for tag in tags:
    needed_tag = tag.get('href', None)
    tags_lst.append(needed_tag)
for i in range(0,count):    
    print 'retrieving: ',tags_lst[position]
    position = position + 1

我仍然得到之外的其他环节中的例子但是当我打印链接的整个列表的位置相匹配,所以我不知道。很奇怪。

python python-2.7 beautifulsoup
9个回答
1
投票

[编辑:剪切粘贴+这条线从评论]你好!我曾在一个类似的工作上班,因为我有一些疑问,我发现你的问题。这里是我的代码,我认为它的工作原理。我希望这将是对你有帮助

import urllib
from bs4 import BeautifulSoup

url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
count = 8
position = 18
tags_lst = []

for x in xrange(count-1):
    tags = soup('a')
    my_tags = tags[position-1]
    needed_tag = my_tags.get('href', None)
    tags_lst.append(needed_tag)
    url = str(needed_tag)
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

0
投票

你BeautifulSoup进口是错误的。我不认为它与你展示的代码。另外你的下环扑朔迷离。你可以得到你想要的切片完全检索列表的URL列表。

我硬编码您的网址在我的代码,因为它是不是在每次运行中键入更加容易。

尝试这个:

import urllib
from bs4 import BeautifulSoup

#url = raw_input('Enter - ')
url = 'http://python-data.dr-chuck.net/known_by_Fikret.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# print soup
count = int(raw_input('Enter count: '))+1
position = int(raw_input('Enter position: '))


tags = soup('a')
# next line gets count tags starting from position
my_tags = tags[position: position+count]
tags_lst = []
for tag in my_tags:
    needed_tag = tag.get('href', None)
    tags_lst.append(needed_tag)
print tags_lst

0
投票

几乎这项任务的所有解决方案有两个部分装载的URL。相反,我定义了打印相关的链接,任意给定的URL的功能。

最初,该函数将使用Fikret.html网址输入。随后的输入依赖于出现在需要的位置刷新网址。代码的重要行是这一个:url = allerretour(url)[position-1]这得到了新的URL为食的环另一轮。

import urllib
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html' # raw_input('Enter URL : ')

position = 3 # int(raw_input('Enter position : '))
count = 4 #int(raw_input('Enter count : '))

def allerretour(url):
    print('Retrieving: ' + url)
    soup = BeautifulSoup(urllib.urlopen(url).read())
    link = list()
    for tag in soup('a'):
        link.append(tag.get('href', None))
    return(link)


for x in range(1, count + 2):
    url = allerretour(url)[position-1]

0
投票

这是我的解决方案:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter: ')
link_line = int(input("Enter position: ")) - 1 
relative to first link
count = int(input("Enter count: "))

html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

while count >= 0:
   html = urllib.request.urlopen(url, context=ctx).read()
   soup = BeautifulSoup(html, 'html.parser')
   tags = soup('a')
   print(url)
   url = tags[link_line].get("href", None)
   count = count - 1

0
投票

这就是我的回答是,在Python 2.7版为我工作:

import urllib
from BeautifulSoup import *

URL = raw_input("Enter the URL:") #Enter main URL
link_line = int(raw_input("Enter position:")) - 1 #The position of link relative to first link
count = int(raw_input("Enter count:")) #The number of times to be repeated

while count >= 0:
    html = urllib.urlopen(URL).read()
    soup = BeautifulSoup(html)
    tags = soup('a')
    print URL
    URL = tags[link_line].get("href", None)
    count = count - 1

0
投票

这里是工作的代码得到所需要的输出

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
n=1
url = input('Enter - ')
count= int(input('Enter count'))+1
pos=int(input('Enter position'))
new=url
while n<count:
    if new == url:
        html = urllib.request.urlopen(url, context=ctx).read()
        print('Retrieving', url)
    html = urllib.request.urlopen(new, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    my_tags=tags[pos-1]
    new=my_tags.get('href', None)
    print('Retrieving' , new)
    n=n+1

0
投票

我把下面的解决方案,测试和今天的运作良好。

importing the require modules

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import re

accessing websites

url = "http://py4e-data.dr-chuck.net/known_by_Vairi.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
all_num_list = list()
link_position = 18
Process_repeat = 7

Retrieve all of the anchor tags

tags = soup('a')

while Process_repeat - 1  >= 0 :
    print("Process round", Process_repeat)
    target = tags[link_position - 1]
    print("target:", target)
    url = target.get('href', 2)
    print("Current url", url)
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    Process_repeat = Process_repeat - 1

0
投票
import urllib.error, urllib.request
from bs4 import BeautifulSoup

#url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = input('Enter link - ')
count = int(input('Enter count - '))
position = int(input('position - ') )
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

tags = soup('a')
my_tags = tags[position-1]
needed_tag = my_tags.get('href', None)
print("------ : ", tags[position-1].contents[0])

for x in range(count-1):

    url = str(needed_tag)
    html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

    tags = soup('a')
    my_tags = tags[position-1]
    needed_tag = my_tags.get('href', None)
    print("------ : ", tags[position-1].contents[0])

0
投票

尝试这个。你可以离开输入URL。还有就是你的前任链接的样本。祝好运!

import urllib.request
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter url ')
cn = input('Enter count: ')
cnint = int(cn)
pos = input('Enter position: ')
posint = int(pos)
html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/known_by_Fikret.html''''url''', context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

tags_lst = list()
for x in range(0,cnint):
    tags = soup('a')
    my_tags = tags[posint-1]
    needed_tag = my_tags.get('href', None)
    url = str(needed_tag)
    html = urllib.request.urlopen(url,context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    print(my_tags.get('href', None))
© www.soinside.com 2019 - 2024. All rights reserved.