我的目标是读取一个大的csv文件并打印出所有类似的值,因为它的所有关于酒店并且为了简单起见,我将在这里为这段代码制作一个dicts列表:
S1 = [{'name': 'Holiday Inn A','price': '552'},
{'name': 'Holiday Inn B','price': '568'},
{'name': 'Holiday Inn C','price': '589'},
{'name': 'Grand Palace','price': '768'}
and so on...]
我的意思是我想打印出名为'Holiday Inn'的所有值,这是我想要的结果:
Holiday Inn A
Holiday Inn B
Holiday Inn C
这是我的代码:
import csv
name = []
value = []
linked = []
a = []
def filereader():
line_count = 0
with open('hotelRev.csv','r', encoding ='utf-8') as fileIn:
reader = csv.reader(fileIn)
for row in reader:
line_count = line_count + 1
if line_count == 1:
name.append(row)
else:
value.append(row)
for x in name:
for y in value:
linked.append(dict(zip(x,y)))
filereader()
for row in linked:
a.append(row['name'])
b = sorted(set(a))
for row in linked:
print(row['name']['Holiday Inn'])
显然这不起作用,任何人都知道如何做到这一点?
edit-1:by similiar我的意思是将所有Holiday Inn元素分类为一个大组,以便更容易被调出和打印。数据集本身的直接示例:
Holiday Inn Express & Suites Austin South
Holiday Inn Express & Suites Baton Rouge East
Holiday Inn Express & Suites Bethlehem
Holiday Inn Express & Suites Bloomington
Holiday Inn Express & Suites Butte
Holiday Inn Express & Suites Carmel-north Indianapolis
Holiday Inn Express & Suites Carpinteria
Holiday Inn Express & Suites Columbus - Polaris Parkway
Holiday Inn Express & Suites Columbus Univ Area - Osu
Holiday Inn Express & Suites Denver Northeast - Brighton
如果可能的话,我希望找到一种方法,用尽可能少的线条打印出来
这是使用集合的基本解决方案。我认为对于非常大的数据集来说效率不高,但可以参考它来创建一个有效的解决方案。
import pandas as pd
import re
df = pd.read_csv('HotelNames.csv')
search_terms = input('Enter search terms: ')
#Convert to lower case
search_terms = search_terms.lower()
#Remove special characters except space
search_terms = re.sub(r"[^a-zA-Z0-9]+", ' ', search_terms)
#Make a list of words from the string
temp = search_terms.split(' ')
search_set = set()
for i in range(len(temp)):
#Make a set of unique words
search_set.add(temp[i])
for i in range(len(df)):
t = re.sub(r"[^a-zA-Z0-9]+", ' ', df.iloc[i][0])
t = t.lower()
temp = t.split(' ')
hotel_set = set()
for j in range(len(temp)):
hotel_set.add(temp[j])
#Find whether the searched terms are a subset of the hotel name in that particular row
if(search_set.issubset(hotel_set)):
print(df.iloc[i][0])
HotelNames.csv
目前包含1列,即酒店名称。
我认为缺少的是对类似内容的确切定义。如果两个字符串匹配您的类似定义,我建议您使用返回布尔值的函数或方法。一旦你解决了这个问题,剩下的就应该用if语句来实现。
一些测试字符串供你考虑..(你决定它们是否相似,为什么)
“假日酒店”“假日在”“假日酒店”“holiday_inn”“假日酒店”“lollyday inn”“^ * $%__假日酒店!” “旧金山假日套房酒店”等
您可能希望了解并熟悉的一点是Phonetic Distance的概念。这是一个Python库.. https://github.com/jamesturk/jellyfish