如何根据自定义词典将列中的文本转换为其他格式?

问题描述 投票:0回答:1

我正在根据大学/学院名称的词典来使数据集中的教育数据保持一致。如何针对字典运行代码并获得所需的输出?数据由缩写和俗称组成。

有人可以在R中提供此示例。我也愿意在python中尝试它,R只是我的偏爱。

这是我的字典的一个示例:

*University Name Dictionary
California Institute of Technology
New York University
Massachusetts Institute of Technology
Georgia Institute of Technology
Rutgers University
University of California, Berkley
University of California, Los Angeles

这是我的数据:

*Education
Cal Tech
NYU
MIT
Ga Tech
Georgia Tech
Rutgers
Berkley
UCLA

这就是我想要的:

*Education      *New Education
Cal Tech        California Institute of Technology
NYU             New York University
MIT             Massachusetts Institute of Technology
Ga Tech         Georgia Institute of Technology
Georgia Tech    Georgia Institute of Technology
Rutgers         Rutgers University
Berkley         University of California, Berkley
UCLA            University of California, Los Angeles

抱歉,如果已经有解决方案,我就是找不到。我将不胜感激。

python r nlp text-mining corpus
1个回答
0
投票

[pandas具有功能replace(dictionary),其中dictionary类似于

 {"Cal Tech": "California Institute of Technology"} 

因为pandas.DataFrameR的启发,所以R可能有相似的地方。


data = {
    'Cal Tech': 'California Institute of Technology',
    'NYU': 'New York University',
    'MIT': 'Massachusetts Institute of Technology',
    'Ga Tech': 'Georgia Institute of Technology',
    'Georgia Tech': 'Georgia Institute of Technology',
    'Rutgers': 'Rutgers University',
    'Berkley': 'University of California, Berkley',
    'UCLA': 'University of California, Los Angeles',
}

import pandas as pd

df = pd.DataFrame({
'Education': ['Cal Tech', 'NYU', 'MIT', 'Ga Tech', 'Georgia Tech', 'Rutgers', 'Berkley', 'UCLA']
})

df['New Education'] = df['Education'].replace(data)

print(df)

结果:

      Education                          New Education
0      Cal Tech     California Institute of Technology
1           NYU                    New York University
2           MIT  Massachusetts Institute of Technology
3       Ga Tech        Georgia Institute of Technology
4  Georgia Tech        Georgia Institute of Technology
5       Rutgers                     Rutgers University
6       Berkley      University of California, Berkley
7          UCLA  University of California, Los Angeles

如果使用regex=True,它也可以替换成更长的字符串

data = {
    'Cal Tech': 'California Institute of Technology',
    'NYU': 'New York University',
    'MIT': 'Massachusetts Institute of Technology',
    'Ga Tech': 'Georgia Institute of Technology',
    'Georgia Tech': 'Georgia Institute of Technology',
    'Rutgers': 'Rutgers University',
    'Berkley': 'University of California, Berkley',
    'UCLA': 'University of California, Los Angeles',
}

import pandas as pd

df = pd.DataFrame({
  'Education': ['I am from MIT']
})

df['New Education'] = df['Education'].replace(data, regex=True)

print(df)

结果:

       Education                                    New Education
0  I am from MIT  I am from Massachusetts Institute of Technology

文档:pandas.DataFrame.replace()

© www.soinside.com 2019 - 2024. All rights reserved.