Python 中的极坐标不区分变音符号排序

问题描述 投票:0回答:1

我是一个糟糕的编码员,大部分时间都不知道自己在做什么,但编码在操作和管理语言数据方面非常有用且节省时间,所以我尝试。

我的大问题是是否有办法使 .sort() 对变音符号(重音符号,如 á 或 ô)不敏感。目前,当我排序时,当我想要“[aá][eé][ií][oó][uú]”或“aáeéiíoóuú”时,我会在“áéíóú”之前得到“aeiou”。尽管我怀疑它与这个问题相关,但这是我的代码(区域设置是尝试解决谷歌双子座建议的问题):

import os
import locale
import polars as pl

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')  # set the locale to en_US.UTF-8

# print the current working directory
print("Current working directory:", os.getcwd())

# Make paths for the lexicon and the sorted lexicon
LexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon.csv'
UnsortedLexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon_Sorted.csv'

# read the csv file name KalagyonManNyal_Lexicon and make it a DataFrame
df = pl.read_csv(LexiconPath)

df.fill_null("-")  # fill null values with a dash
# print the first 5 rows of the DataFrame
print(df)

# sort the DataFrame by the column "Lexeme" in alphabetical order

df_sorted = df.sort("Lexeme")

print(df_sorted)

write_csv = df_sorted.write_csv(UnsortedLexiconPath, include_bom=True)  # write the sorted DataFrame to a new csv file

# print(df.filter(df.is_duplicated())) # print duplicated rows
# print(df.columns)

# make a dataframe that filters by the part of speech input by the user

PoS = input("Enter the part of speech you want to filter by: ")

df_PoS = df.filter(df['PoS'] == PoS)
print(df_PoS)

所以我正在寻找的是不区分变音符号的排序或自定义排序指令,这样我就可以将一些变音符号计为单独的字母,而另一些则不计为单独的字母。

在生成人工智能 Gemini 建议后,我尝试使用

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
,但它似乎没有改变任何东西。

python sorting python-polars diacritics accent-insensitive
1个回答
0
投票

首先我认为

en_US
无法工作,因为它是
American English
不使用字符
áéíóú
所以它可能无法对它们进行排序。但我用
pl_PL.UTF-8
和 chars Polish chars
ąęść
测试了它,它也不起作用。

最后在排序的文档中,我找到了适合我的方法(即使使用

en_US

import locale
import functools

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

words = sorted(words, key=functools.cmp_to_key(locale.strcoll)) 

用于测试的代码 - 最后排序给出了预期结果:

import locale
import functools

words = 'ą a b ć c d'.split()

print('before setlocal (no key)  :', sorted(words))
print('before setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

print(' after setlocal (no key)  :', sorted(words))

print(' after setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))

结果:

before setlocal (no key)  : ['a', 'b', 'c', 'd', 'ą', 'ć']
before setlocal (with key): ['a', 'b', 'c', 'd', 'ą', 'ć']
 after setlocal (no key)  : ['a', 'b', 'c', 'd', 'ą', 'ć']
 after setlocal (with key): ['a', 'ą', 'b', 'c', 'ć', 'd']
© www.soinside.com 2019 - 2024. All rights reserved.