Pandas和NLTK：如果NLTK标记中包含子串，则用相邻列的子串替换空单元格。

Question

我有一个由PRODUCT NAMEs和MAKERs组成的表格，其中一些MAKER单元格是空的，因此我想写一个代码，将MAKER列中的空单元格替换成产品名称的子串。有些制造商单元格是空的，因此我想写一个代码，用产品名称中的子串替换制造商列中的空单元格。

为了识别我想使用的子字符串，我使用了NLTK库。

这是我目前写的代码。

import pandas as pd
import nltk
from nltk.probability import FreqDist
import pandas as pd
import numpy as np


a=('Nokia 3518','Nokia 3313','Samsung S9','Samsung S10','Samsung S4')
b=('Nokia','','','Samsung','')

df=pd.DataFrame({'Product Name':(a) , 'Maker':(b)})
df.replace('', np.nan, inplace=True)


result = [' '.join([row for row in df['Product Name']])]

result=str(result).replace("'",'')

tokens = nltk.word_tokenize(result)
#iam taking only words greater than 4 letters

longwords= [wrd for wrd in tokens if len(wrd)>4] #Words containing 3 letters or less will be 
removed

print(longwords)

#keeping words only that occur more than once and putting it in a dataframe
fdist = FreqDist(longwords)
x=list(filter(lambda x: x[1]>1,fdist.items())) 
print(x)

# putting the tokens in a dataframe (Nokia and Samsung)
dfb=pd.DataFrame(x)

print(dfb[0])

到目前为止，我已经写好了生成标记的代码，但是我不知道如何继续下去。

最终，我想通过允许代码将产品名称中的子串与tokens数据框架(dfb)中的项目进行匹配，并相应地追加制造者列，将数据框架追加如下。

Answer 1

这是我找到的最简单的答案，添加到原来的代码中：。

z=[]
for i in df['Product Name']:
    for j in dfb[0]:
        if j in i:
            z.append(j)

df['Maker']=z

print(df)

Pandas和NLTK：如果NLTK标记中包含子串，则用相邻列的子串替换空单元格。

问题描述投票：0回答：1

1个回答

最新问题

Pandas和NLTK：如果NLTK标记中包含子串，则用相邻列的子串替换空单元格。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1