我有一个用Python打开的txt文件。我正在尝试删除符号并按字母顺序排列其余单词。删除句点,逗号等不是问题。但是,当我将破折号与其余符号一起添加到列表中时,似乎无法删除带有空格的破折号。
这是我打开的示例:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
这就是我想要的(已删除句点,并且未附加到单词的破折号):
content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"
但是我要么得到这个(所有破折号都被删除):
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
或此(破折号未删除):
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"
这是我的全部代码。添加content.replace()即可。但这不是我想要的:
f = open("article.txt", "r")
# Create variable (Like this removing " - " works)
content = f.read()
content = content.replace(" - ", " ")
# Create list
wordlist = content.split()
# Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here)
chars = [",", ".", "'", "(", ")", "‘", "’", " - "]
# Remove symbols
words = []
for element in wordlist:
temp = ""
for ch in element:
if ch not in chars:
temp += ch
words.append(temp)
# Print words, sort alphabetically and do not print duplicates
for word in sorted(set(words)):
print(word)
它是这样的。但是,当我删除content = content.replace(" - ", " ")
时,chars
中的“空白+破折号+ whitspace”不会被删除。
并且如果我将其替换为“-”(没有空格),则会得到我不想要的内容:
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
是否可以使用chars
之类的列表来执行此操作,或者是我使用.replace()的唯一选择。
并且为什么有一个特殊的原因使得Python首先按字母顺序对大写字母排序,然后对不大写的单词分别排序?
类似(添加字母ABC以强调我要说的内容):
7-year
A
B
C
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
之后
wordlist = content.split()
您的列表不再包含带有开始/结束空格的任何内容。
str.split()
删除连续的空格。因此您的拆分列表中没有' - '
。
Doku:https://docs.python.org/3/library/stdtypes.html#str.split
- str。split(sep = None,maxsplit = -1)
如果未指定
sep
或为None,则应用不同的拆分算法:连续空白的运行被视为单个分隔符,并且结果开头将包含无空字符串或如果字符串具有前导或尾随空格,则结束。
替换' - '
似乎是正确的-保持与代码接近的另一种方法是从拆分列表中完全删除'-'
:
chars = [",", ".", "'", "(", ")"] # modified
# Remove symbols
words = []
for element in wordlist:
temp = ""
if element == '-': # skip pure -
continue
for ch in element: # handle characters to be removed
if ch not in chars:
temp += ch
words.append(temp)
您可以像这样使用re.sub
:
>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>>