在两个文件中找到相同的单词,而在python中不包含重复的单词

问题描述 投票:0回答:2

我必须编写一个程序,将吸烟与肺癌风险相关联。为此,我有两个文件中的数据。我的代码正在计算同一行中给出的数据(例如:America,23.3和Spain,77.9和意大利(24.2,俄罗斯(60.8))如何修改我的代码,以便计算出相同国家/地区的数量,而只将一个国家/地区中的国家/地区排除在外(不应该计算德国,法国,中国,韩国,因为它们仅在一个文件中)非常感谢您的提前帮助:)

吸烟文件:

**国家,卷烟吸烟者百分比数据

美国,23.3

意大利,24.2

俄罗斯,23.7

法国,14.9

英格兰,17.9

西班牙,17

德国,21.7 *

第二个文件:

**每100000例肺癌病例

西班牙,77.9

俄罗斯,60.8

韩国,61.3

美国,73.3

中国,66.8

越南,64.5

意大利,43.9

*和我的代码:

    '''
        Reads the data from the provided file objects smoking_datafile
        and cancer_datafile. Returns a list of the data read from each
        in a tuple of the form (smoking_datafile, cancer_datafile).
    '''

    # init
    smoking_data = []
    cancer_data = []
    empty_str = ''

    # read past file headers
    smoking_datafile.readline()
    cancer_datafile.readline()

    # read data files
    eof = False

    while not eof:

        # read line of data from each file
        s_line = smoking_datafile.readline()
        c_line = cancer_datafile.readline()

        # check if at end-of-file of both files
        if s_line == empty_str and c_line == empty_str:
            eof = True

        # check if end of smoking data file only
        elif s_line == empty_str:
            raise OSError('Unexpected end-of-file for smoking data file')

        # check if at end of cancer data file only
        elif c_line == empty_str:
            raise OSError('Unexpected end-of-file for cancer data file')

        # append line of data to each list
        else:
            smoking_data.append(s_line.strip().split(','))
            cancer_data.append(c_line.strip().split(','))

    # return list of data from each file
    return (smoking_data, cancer_data)


def calculateCorrelation(smoking_data, cancer_data):
    '''
        Calculates and returns the correlation value for the data
        provided in lists smoking_data and cancer_data
    '''    

    # init
    sum_smoking_vals = sum_cancer_vals = 0
    sum_smoking_sqrd = sum_cancer_sqrd = 0
    sum_products = 0

    # calculate intermediate correlation values
    num_values = len(smoking_data)

    for k in range(0,num_values):

        sum_smoking_vals = sum_smoking_vals + float(smoking_data[k][1])
        sum_cancer_vals = sum_cancer_vals + float(cancer_data[k][1])

        sum_smoking_sqrd = sum_smoking_sqrd +  \
                              float(smoking_data[k][1]) ** 2
        sum_cancer_sqrd = sum_cancer_sqrd +  \
                              float(cancer_data[k][1]) ** 2

        sum_products = sum_products + float(smoking_data[k][1]) *  \
                       float(cancer_data[k][1])

    # calculate and display correlation value
    numer = (num_values * sum_products) - \
            (sum_smoking_vals * sum_cancer_vals)

    denom = math.sqrt(abs( \
        ((num_values * sum_smoking_sqrd) - (sum_smoking_vals ** 2)) * \
        ((num_values * sum_cancer_sqrd) - (sum_cancer_vals ** 2)) \
        ))

    return numer / denom```
My code is computing the data given in the same lines (eg:America,23.3 Spain,77.9
Italy,24.2 with Russia,60.8)
How to modify my code so that it computes the numbers of the same countries and leaves out the countries that occur only in one file (it shouldn't compute Germany, France, China, Korea because they are only in one file)
Thank you  so much for your help in advance:)
python database file self-modifying
2个回答
1
投票

我们只专注于将数据转换为易于使用的格式。下面的代码将为您提供形式为...

的字典
smokers_cancer_data = {
    'America': {
        'smokers': '23.3',
        'cancer': '73.3'
    }, 
    'Italy': {
        'smokers': '24.2',
        'cancer': '43.9'
    }, 
    ...
}

一旦有了这个,您就可以获取所需的任何值并执行计算。请参见下面的代码。

def read_data(filename: str) -> dict:
    with open(filename, 'r') as file:
        next(file) # Skip the header
        data = dict();
        for line in file:
            cleaned_line = line.rstrip()
            # Skip blank lines
            if cleaned_line: 
                data_item = (cleaned_line.split(','))
                data[data_item[0]] = float(data_item[1])
    return data


# Load data into python dictionaries
smokers_data = read_data('smokersData.txt')
cancer_data = read_data('lungCancerData.txt')


# Build one dictionary that is easy to work with
smokers_cancer_data = dict()
for (key, value) in smokers_data.items():
    if key in cancer_data:
        smokers_cancer_data[key] = {
            'smokers': smokers_data[key],
            'cancer' : cancer_data[key]  
        }

print(smokers_cancer_data)

例如,如果要计算吸烟者和癌症值的总和。

smokers_total = 0
cancer_total = 0
for (key, value) in smokers_cancer_data.items():
    smokers_total += value['smokers']
    cancer_total += value['cancer']

1
投票

这将返回所有具有数据的国家/地区以及数据的列表:

l3 = []
with open('smoking.txt','r') as f1, open('cancer.txt','r') as f2:
    l1, l2 = f1.readlines(), f2.readlines()

for s1 in l1:
    for s2 in l2:
        if s1.split(',')[0] == s2.split(',')[0]:
            cty = s1.split(',')[0]
            smk = s1.split(',')[1].strip()
            cnr = s2.split(',')[1].strip()
            l3.append(f"{cty}: smoking: {smk}, cancer: {cnr}")

print(l3)

输出:

['Spain: smoking: 77.9, cancer: 17', 'Russia: smoking: 60.8, cancer: 23.7', 'America: smoking: 73.3, cancer: 23.3', 'Italy: smoking: 43.9, cancer24.2']
© www.soinside.com 2019 - 2024. All rights reserved.