如何打印类似字符串的值?

问题描述 投票:1回答:2

我的目标是读取一个大的csv文件并打印出所有类似的值,因为它的所有关于酒店并且为了简单起见,我将在这里为这段代码制作一个dicts列表:

S1 = [{'name': 'Holiday Inn A','price': '552'},
{'name': 'Holiday Inn B','price': '568'},
{'name': 'Holiday Inn C','price': '589'},
{'name': 'Grand Palace','price': '768'}
and so on...]

我的意思是我想打印出名为'Holiday Inn'的所有值,这是我想要的结果:

Holiday Inn A
Holiday Inn B
Holiday Inn C

这是我的代码:

import csv

name = []
value = []
linked = []
a = []

def filereader():
    line_count = 0
    with open('hotelRev.csv','r', encoding ='utf-8') as fileIn:
        reader = csv.reader(fileIn)
        for row in reader:
            line_count = line_count + 1
            if line_count == 1:
                name.append(row)
            else:
                value.append(row)

    for x in name:
        for y in value:
            linked.append(dict(zip(x,y)))

filereader()
for row in linked:
    a.append(row['name'])

b = sorted(set(a))

for row in linked:
    print(row['name']['Holiday Inn'])

显然这不起作用,任何人都知道如何做到这一点?

edit-1:by similiar我的意思是将所有Holiday Inn元素分类为一个大组,以便更容易被调出和打印。数据集本身的直接示例:

Holiday Inn Express & Suites Austin South                             
Holiday Inn Express & Suites Baton Rouge East                         
Holiday Inn Express & Suites Bethlehem                                
Holiday Inn Express & Suites Bloomington                              
Holiday Inn Express & Suites Butte                                    
Holiday Inn Express & Suites Carmel-north Indianapolis                
Holiday Inn Express & Suites Carpinteria                              
Holiday Inn Express & Suites Columbus - Polaris Parkway               
Holiday Inn Express & Suites Columbus Univ Area - Osu                 
Holiday Inn Express & Suites Denver Northeast - Brighton

如果可能的话,我希望找到一种方法,用尽可能少的线条打印出来

python python-3.x
2个回答
1
投票

这是使用集合的基本解决方案。我认为对于非常大的数据集来说效率不高,但可以参考它来创建一个有效的解决方案。

import pandas as pd
import re

df = pd.read_csv('HotelNames.csv')

search_terms = input('Enter search terms: ')
#Convert to lower case
search_terms = search_terms.lower()
#Remove special characters except space
search_terms = re.sub(r"[^a-zA-Z0-9]+", ' ', search_terms)

#Make a list of words from the string
temp = search_terms.split(' ')

search_set = set()
for i in range(len(temp)):
    #Make a set of unique words
    search_set.add(temp[i])

for i in range(len(df)):

    t = re.sub(r"[^a-zA-Z0-9]+", ' ', df.iloc[i][0])
    t = t.lower()
    temp = t.split(' ')

    hotel_set = set()
    for j in range(len(temp)):
        hotel_set.add(temp[j])

    #Find whether the searched terms are a subset of the hotel name in that particular row
    if(search_set.issubset(hotel_set)):
        print(df.iloc[i][0])

HotelNames.csv目前包含1列,即酒店名称。


0
投票

我认为缺少的是对类似内容的确切定义。如果两个字符串匹配您的类似定义,我建议您使用返回布尔值的函数或方法。一旦你解决了这个问题,剩下的就应该用if语句来实现。

一些测试字符串供你考虑..(你决定它们是否相似,为什么)

“假日酒店”“假日在”“假日酒店”“holiday_inn”“假日酒店”“lollyday inn”“^ * $%__假日酒店!” “旧金山假日套房酒店”等

您可能希望了解并熟悉的一点是Phonetic Distance的概念。这是一个Python库.. https://github.com/jamesturk/jellyfish

© www.soinside.com 2019 - 2024. All rights reserved.