I Would like to clean a data set similar to this one
我有一个大型数据集,其中有5列以上,10000行。每列中都有文本信息,我想对每列中的值进行编码,然后将它们发送到Multi Class分类器。
我想摆脱的字段值之间有很小的差异。
例如:如果我有“你好,这是星期天”和“这是星期天,我想将它们都编码为”这是星期天”。
有什么方法可以做到这一点吗?
如果您唯一的要求是一个是另一个的子串,则可以执行以下操作:
a = 'Hello all this is sunday'
b = 'This is sunday'
def replace_str(a: str, b: str) -> str:
longest, shortest = (a, b) if len(a) > len(b) else (b, a)
return shortest if shortest.lower() in longest.lower() else None
print(replace_str(a, b))
>>> 'This is sunday'
如果您知道可能的字符串,则可以使用以下代码作为示例:
a="This is sunday"
b="Hello all this is sunday"
if a.lower() in b:
b=a
print(b)
嗯,非常棘手,有点艺术风格,但请继续阅读,以一种非常简单的方法处理与您共享给我们的数据。
基本方法(下面的代码):
示例:
send
,send to
,send to india
都是send to India carefully
的子短语(但不是唯一的短语)
示例:您可以为每个子短语分配一个点它在细胞中发生的时间。如果
send to india
出现在30个单元格中,根据此规则,其得分为30
。[如果第二条规则同样适用,则如果
2
出现在第一个位置的单元格中,然后如果此规则适用于与send to india
匹配的5
个单元格,然后再将另外send to india
或2 * 5
点添加到得分对于10
其总分现在为send to india
。
示例:两个子短语
40
和send to india
位于包含文本send to india carefully
的单元格。但是,如果子短语send to India carefully
的得分为send to india
,则子词组40
的得分为send to india carefully
,那么单元格18
将被映射到send to India carefully
,分数较高的子短语。
以下是当前为子词组实施的评分规则。它们基于我自己的直觉,可能与您的用例不符。买者自负!
每个包含亚词组的单元得分1]((更受欢迎的亚词组)
在单元格的开头放置2分,将其设置为该单元格的分数
[[分数乘以词的词组长度
如果副词以某些单词(例如“ to”,“ and”,“ but”)结尾,则减半)(更可能是无意义的副词,或者不够具体)
send to india
结果表明,这些规则适用于您在问题和评论中提到的用例:
from collections import Counter
column_of_cells = [
# comment
'This is an apple from Asia',
'This is an apple',
'Send to the market',
'Send to the market carefully',
'send to India',
'send to India safely',
'Packed to send to India safely',
'send to India on Blue Dart',
'If safe send to India'
]
def subphrases(text, minwords=1):
"""
lazily compile a list of sub-phrases in a given text
>>> list(subphrases("send to india safely"))
['send', 'to', 'india', 'safely', 'send to', 'to india',
'india safely', 'send to india', 'to india safely',
'send to india safely']
>>> list(subphrases("send to india safely", minwords=2))
['send to', 'to india', 'india safely', 'send to india',
'to india safely', 'send to india safely']
"""
words = text.lower().split()
for phrase_length in list(range(minwords, len(words) + 2)):
n_length_phrases = (' '.join(words[r:r + phrase_length])
for r in range(len(words) - phrase_length + 1))
yield from n_length_phrases
# compile list of unique sub-phrases in all cells in the column
phrase_bank = set()
for cell in column_of_cells:
phrase_bank.update(subphrases(cell))
# compile scores for all sub-phrases
phrase_scores = Counter()
for phrase in phrase_bank:
for cell in column_of_cells:
lc_cell = cell.lower()
if phrase in lc_cell:
phrase_scores[phrase] += 1.0
# higher score for starting a cell
if lc_cell.startswith(phrase):
phrase_scores[phrase] += 2.0
# prefer longer phrases
for phrase in phrase_scores:
phrase_scores[phrase] = phrase_scores[phrase] * len(phrase.split())
# mark down sub-phrase if it ends in certain words
negative_endwords = ['the', 'a', 'an', 'carefully', 'to', 'and', 'but']
for phrase in phrase_scores:
last_word = phrase.split()[-1]
if last_word in negative_endwords:
phrase_scores[phrase] = phrase_scores[phrase] / 2
# sort by descending occurrence
phrase_scores = [
(phrase, score)
for phrase, score
in phrase_scores.most_common()
]
print('Top 10 Sub-phrases')
print('==================')
print()
headings = f'{"Sub-phrase":33s}Score'
print(headings)
print('-' * len(headings))
for phrase, score in phrase_scores[:10]:
print(f'{phrase:33s}{score:.1f}')
# remove scores-- rely on order now
mapping_phrases = [phrase for phrase, score in phrase_scores]
mappings = {}
for cell in column_of_cells:
for phrase in mapping_phrases:
if phrase in cell.lower():
mappings[cell] = phrase
break
# display mappings
print()
print('Mapping Cell Contents to Common Sub-phrases')
print('===========================================')
print()
headings = f'{"Cell text":33s}Maps to Sub-phrase'
print(headings)
print('-' * len(headings))
for cell, mapped_phrase in mappings.items():
print(f'{cell:33s}{mapped_phrase}')
技巧是编写有利于正确子短语的子短语评分规则。当每列最多扩展1,000个不同的单元格时,您可能需要调整代码,添加或删除评分规则。单元格的不同列可能需要不同的规则。手动检查至少一个自动选择的映射示例总是一个好主意。
[您应该记住,这种基于某些噪声的假设,正在从数据中消除“噪声”的过程正在篡改数据,并且有可能使机器学习结果产生偏差。