除了标签之外，如何将普通引号转换为Guillemets（法语引号）

Question

假设我们有以下文字：

<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»

我们需要的是将其转化为

<a href="link">some link</a> How to transform «ordinary quotes» to «Guillemets»

使用正则表达式和Python。

我试过了

import re

content = '<a href="link">some link</a> How to transform "ordinary quotes" to «Guillemets»'

res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)

print(res)

但是，正如@WiktorStribiżew注意到的，如果一个或多个标签具有多个属性，这将无效，所以

<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»

将转变为

<a href=«link" target=»_blank">some link</a> How to transform «ordinary quotes» to «Guillemets»

更新

请注意文字

可以是html，即：

<div><a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»</div>

不能是html，即：

How to transform "ordinary quotes" to «Guillemets»

不能是html，但包含一些html标签，即

<a href="link" target="_blank">some link</a> How to transform "ordinary quotes" to «Guillemets»

Answer 1

当你拿锤子时，一切看起来像钉子。您不必使用正则表达式。一个简单的状态机将执行（假设<>中的任何内容是HTML标记）。

# pos - current position in a string
# q1,q2 - opening and closing quotes position
s = ' How to transform "ordinary quotes" to «Guillemets» and " more <div><a href="link" target="_blank">some "bad" link</a>'
sl = list(s)
q1, q2 = 0, 0
pos = 0
while 1:
    tag_open = s.find('<', pos)
    q1 = s.find('"', pos)
    if q1 < 0:
        break   # no more quotation marks
    elif tag_open >= 0 and q1 > tag_open:
        pos = s.find('>', tag_open)     # tag close
    elif (tag_open >= 0 and q1 < tag_open) or tag_open < 0:
        q2 = s.find('"', q1 + 1)
        if q2 > 0 and (tag_open < 0 or q2 < tag_open):
            sl[q1] = '«'
            sl[q2] = '»'
            s = ''.join(sl)
            pos = q2
        else:
            pos = q1 + 1
print(s)

说明：

 Scan your string, 
   If not inside tag, 
       find first and second quotation marks,
       replace accordingly, 
       continue scanning from the second quotation marks 
   Else
       continue to end of tag

Answer 2

这对我有用：

res = re.sub('(?:"([^>]*)")(?!>)', '«\g<1>»', content)

来自文档：

除了如上所述的字符转义和反向引用之外，\ g还将使用由名为name的组匹配的子字符串，由（？P ...）语法定义。 \ g使用相应的组号;因此，\ g 2等于\ 2，但在诸如\ g 2的替换中不是模糊的。 \ 20将被解释为对组20的引用，而不是对组2的引用，后跟文字字符“0”。反向引用\ g <0>替换RE匹配的整个子字符串。

Answer 3

你是否愿意在三遍中做到这一点：[a]交换HTML中的引号; [b]交换明显的剩余报价; [c]恢复HTML中的引号？

请记住，在抱怨速度之前，前瞻是昂贵的。

[a] first = re.sub(r'<.*?>', lambda x: re.sub(r'"', '😀', x.group(0)), content)
[b] second = re.sub(r'"(.*?)"', r'«\1»', first)
[c] third = re.sub(r'😀', '"', second)

路易斯的评论：

first = re.sub(r'<.*?>', lambda x: re.sub(r'"', '😀WILLSWAPSOON', x.group(0)), content)

有一些场景，上述策略将起作用。也许OP正在其中一个中工作。否则，如果所有这些烦恼太多，OP可以转到BeautifulSoup并开始玩它...

除了标签之外，如何将普通引号转换为Guillemets（法语引号）

问题描述投票：3回答：3

3个回答

最新问题

除了标签之外，如何将普通引号转换为Guillemets（法语引号）

问题描述 投票：3回答：3

3个回答

最新问题

问题描述投票：3回答：3