如何将一组句子映射到Python中的常见子短语上

问题描述 投票:1回答:3

I Would like to clean a data set similar to this one

我有一个大型数据集,其中有5列以上,10000行。每列中都有文本信息,我想对每列中的值进行编码,然后将它们发送到Multi Class分类器。

我想摆脱的字段值之间有很小的差异。

例如:如果我有“你好,这是星期天”和“这是星期天,我想将它们都编码为”这是星期天”。

有什么方法可以做到这一点吗?

python machine-learning data-science data-analysis data-cleaning
3个回答
0
投票

如果您唯一的要求是一个是另一个的子串,则可以执行以下操作:

a = 'Hello all this is sunday'
b = 'This is sunday'


def replace_str(a: str, b: str) -> str:
    longest, shortest = (a, b) if len(a) > len(b) else (b, a)
    return shortest if shortest.lower() in longest.lower() else None


print(replace_str(a, b))

>>> 'This is sunday'

0
投票

如果您知道可能的字符串,则可以使用以下代码作为示例:

a="This is sunday"
b="Hello all this is sunday"
if a.lower() in b:
    b=a
    print(b)

0
投票

嗯,非常棘手,有点艺术风格,但请继续阅读,以一种非常简单的方法处理与您共享给我们的数据。

基本方法(下面的代码):

  1. 从一列文本单元格中收集所有可能的子短语

示例: sendsend tosend to india都是send to India carefully的子短语(但不是唯一的短语)

  1. 根据一系列规则得出每个子短语的分数

示例:您可以为每个子短语分配一个点它在细胞中发生的时间。如果send to india出现在30个单元格中,根据此规则,其得分为30

[如果第二条规则同样适用,则如果2出现在第一个位置的单元格中,然后如果此规则适用于与send to india匹配的5个单元格,然后再将另外send to india2 * 5点添加到得分对于10其总分现在为send to india

  1. 在给定单元格中找到的最高得分子短语将在最终映射中代表该单元格。

示例:两个子短语40send to india位于包含文本send to india carefully的单元格。但是,如果子短语send to India carefully的得分为send to india,则子词组40的得分为send to india carefully,那么单元格18将被映射到send to India carefully,分数较高的子短语。

以下是当前为子词组实施的评分规则。它们基于我自己的直觉,可能与您的用例不符。买者自负!

  • 每个包含亚词组的单元得分1]((更受欢迎的亚词组)

  • 在单元格的开头放置2分,将其设置为该单元格的分数

  • (假设这使该子短语的相关性更高)
  • [[分数乘以词的词组长度

  • (首选较长的子词组,因为它们倾向于更具体)
  • 如果副词以某些单词(例如“ to”,“ and”,“ but”)结尾,则减半)(更可能是无意义的副词,或者不够具体)

  • send to india
    结果表明,这些规则适用于您在问题和评论中提到的用例:

from collections import Counter column_of_cells = [ # comment 'This is an apple from Asia', 'This is an apple', 'Send to the market', 'Send to the market carefully', 'send to India', 'send to India safely', 'Packed to send to India safely', 'send to India on Blue Dart', 'If safe send to India' ] def subphrases(text, minwords=1): """ lazily compile a list of sub-phrases in a given text >>> list(subphrases("send to india safely")) ['send', 'to', 'india', 'safely', 'send to', 'to india', 'india safely', 'send to india', 'to india safely', 'send to india safely'] >>> list(subphrases("send to india safely", minwords=2)) ['send to', 'to india', 'india safely', 'send to india', 'to india safely', 'send to india safely'] """ words = text.lower().split() for phrase_length in list(range(minwords, len(words) + 2)): n_length_phrases = (' '.join(words[r:r + phrase_length]) for r in range(len(words) - phrase_length + 1)) yield from n_length_phrases # compile list of unique sub-phrases in all cells in the column phrase_bank = set() for cell in column_of_cells: phrase_bank.update(subphrases(cell)) # compile scores for all sub-phrases phrase_scores = Counter() for phrase in phrase_bank: for cell in column_of_cells: lc_cell = cell.lower() if phrase in lc_cell: phrase_scores[phrase] += 1.0 # higher score for starting a cell if lc_cell.startswith(phrase): phrase_scores[phrase] += 2.0 # prefer longer phrases for phrase in phrase_scores: phrase_scores[phrase] = phrase_scores[phrase] * len(phrase.split()) # mark down sub-phrase if it ends in certain words negative_endwords = ['the', 'a', 'an', 'carefully', 'to', 'and', 'but'] for phrase in phrase_scores: last_word = phrase.split()[-1] if last_word in negative_endwords: phrase_scores[phrase] = phrase_scores[phrase] / 2 # sort by descending occurrence phrase_scores = [ (phrase, score) for phrase, score in phrase_scores.most_common() ] print('Top 10 Sub-phrases') print('==================') print() headings = f'{"Sub-phrase":33s}Score' print(headings) print('-' * len(headings)) for phrase, score in phrase_scores[:10]: print(f'{phrase:33s}{score:.1f}') # remove scores-- rely on order now mapping_phrases = [phrase for phrase, score in phrase_scores] mappings = {} for cell in column_of_cells: for phrase in mapping_phrases: if phrase in cell.lower(): mappings[cell] = phrase break # display mappings print() print('Mapping Cell Contents to Common Sub-phrases') print('===========================================') print() headings = f'{"Cell text":33s}Maps to Sub-phrase' print(headings) print('-' * len(headings)) for cell, mapped_phrase in mappings.items(): print(f'{cell:33s}{mapped_phrase}')

技巧是编写有利于正确子短语的子短语评分规则。当每列最多扩展1,000个不同的单元格时,您可能需要调整代码,添加或删除评分规则。

单元格的不同列可能需要不同的规则。手动检查至少一个自动选择的映射示例总是一个好主意。

[您应该记住,这种基于某些噪声的假设,正在从数据中消除“噪声”的过程正在篡改数据,并且有可能使机器学习结果产生偏差。

© www.soinside.com 2019 - 2024. All rights reserved.