我正在尝试使用href
将结果替换为url
regex
,我也尝试了Beautifulsoup
模块,但没有成功。
class RandomChoiceNoImmediateRepeat(object):
def __init__(self, lst):
self.lst = lst
self.last = None
def choice(self):
if self.last is None:
self.last = random.choice(self.lst)
return self.last
else:
nxt = random.choice(self.lst)
# make a new choice as long as it's equal to the last.
while nxt == self.last:
nxt = random.choice(self.lst)
# Replace the last and return the choice
self.last = nxt
return nxt
for filename in glob.glob('/docs/*.txt'):
file_metadata = { 'name': 'file.txt', 'mimeType': '*/*' }
media = MediaFileUpload(filename, mimetype='*/*', resumable=True)
file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
link = 'https://drive.google.com/uc?export=download&id=' + file.get('id')
linkd = []
linkd.append(link)
for filename in glob.glob('/docs/htmlz/*.html'):
with open(filename, "r") as html_file:
soup = BeautifulSoup(html_file,'html.parser')
for anchor in soup.findAll("a", attrs={ "class" : "downloadme" }):
gen = RandomChoiceNoImmediateRepeat(linkd)
i = gen.choice()
anchor['href'] = str(i)
with open(filename, "w") as html_file:
html_file.write(str(soup))
html_file.close()
首先,根本原因是re.sub
需要类似字符串或字节的对象,但您提供了其他类型。
编辑:
我创建了一个示例,您如何访问bs4.element.ResultSet
类型的元素。
代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {} --> Type: {}".format(trs, type(trs)))
print("Content: {} --> Type: {}".format(trs[0], type(trs[0])))
print("Content: {} --> Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {} --> Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))
输出:
>>> python3 ci/common/python_utils/test_file.py
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>] --> Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr> --> Type: <class 'bs4.element.Tag'>
Content: ['hello'] --> Type: <class 'list'>
Content: hello --> Type: <class 'str'>
如您在上面看到的,.findAll
提供了一种包含bs4.element.ResultSet
元素的bs4.element.Tag
类型。如果选择标签,则会得到一个列表,例如:['hello']
,并且必须使用正确的索引,例如:[0]
,然后您将获得字符串类型变量(如输出的最后一行所示)。