将列表元素“随机无重复” 追加到多个html文件中

问题描述 投票:2回答:1

我正在尝试使用href将结果替换为url regex,我也尝试了Beautifulsoup模块,但没有成功。

class RandomChoiceNoImmediateRepeat(object):
    def __init__(self, lst):
        self.lst = lst
        self.last = None
    def choice(self):
        if self.last is None:
            self.last = random.choice(self.lst)
            return self.last
        else:
            nxt = random.choice(self.lst)
            # make a new choice as long as it's equal to the last.
            while nxt == self.last:   
                nxt = random.choice(self.lst)
            # Replace the last and return the choice
            self.last = nxt
            return nxt

for filename in glob.glob('/docs/*.txt'):
    file_metadata = { 'name': 'file.txt', 'mimeType': '*/*' }
    media = MediaFileUpload(filename, mimetype='*/*', resumable=True)
    file = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()
    link = 'https://drive.google.com/uc?export=download&id=' + file.get('id')
    linkd = []
    linkd.append(link)
    for filename in glob.glob('/docs/htmlz/*.html'):
        with open(filename, "r") as html_file:
            soup = BeautifulSoup(html_file,'html.parser')
            for anchor in soup.findAll("a", attrs={ "class" : "downloadme" }):
                gen = RandomChoiceNoImmediateRepeat(linkd)
                i = gen.choice()
                anchor['href'] = str(i)
                with open(filename, "w") as html_file:
                    html_file.write(str(soup))
                    html_file.close()


python html random beautifulsoup href
1个回答
0
投票

首先,根本原因是re.sub需要类似字符串或字节的对象,但您提供了其他类型。

编辑:

我创建了一个示例,您如何访问bs4.element.ResultSet类型的元素。

代码:

from bs4 import BeautifulSoup

soup = BeautifulSoup('<tr class="hello">first_elem</tr><tr>second_elem</tr>', "html.parser")
trs = soup.find_all("tr")
print("Content: {}  -->  Type: {}".format(trs, type(trs)))
print("Content: {}  -->  Type: {}".format(trs[0], type(trs[0])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"], type(trs[0]["class"])))
print("Content: {}  -->  Type: {}".format(trs[0]["class"][0], type(trs[0]["class"][0])))

输出:

>>> python3 ci/common/python_utils/test_file.py 
Content: [<tr class="hello">first_elem</tr>, <tr>second_elem</tr>]  -->  Type: <class 'bs4.element.ResultSet'>
Content: <tr class="hello">first_elem</tr>  -->  Type: <class 'bs4.element.Tag'>
Content: ['hello']  -->  Type: <class 'list'>
Content: hello  -->  Type: <class 'str'>

如您在上面看到的,.findAll提供了一种包含bs4.element.ResultSet元素的bs4.element.Tag类型。如果选择标签,则会得到一个列表,例如:['hello'],并且必须使用正确的索引,例如:[0],然后您将获得字符串类型变量(如输出的最后一行所示)。

© www.soinside.com 2019 - 2024. All rights reserved.