我是一个糟糕的编码员,大部分时间都不知道自己在做什么,但编码在操作和管理语言数据方面非常有用且节省时间,所以我尝试。
我的大问题是是否有办法使 .sort() 对变音符号(重音符号,如 á 或 ô)不敏感。目前,当我排序时,当我想要“[aá][eé][ií][oó][uú]”或“aáeéiíoóuú”时,我会在“áéíóú”之前得到“aeiou”。尽管我怀疑它与这个问题相关,但这是我的代码(区域设置是尝试解决谷歌双子座建议的问题):
import os
import locale
import polars as pl
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') # set the locale to en_US.UTF-8
# print the current working directory
print("Current working directory:", os.getcwd())
# Make paths for the lexicon and the sorted lexicon
LexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon.csv'
UnsortedLexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon_Sorted.csv'
# read the csv file name KalagyonManNyal_Lexicon and make it a DataFrame
df = pl.read_csv(LexiconPath)
df.fill_null("-") # fill null values with a dash
# print the first 5 rows of the DataFrame
print(df)
# sort the DataFrame by the column "Lexeme" in alphabetical order
df_sorted = df.sort("Lexeme")
print(df_sorted)
write_csv = df_sorted.write_csv(UnsortedLexiconPath, include_bom=True) # write the sorted DataFrame to a new csv file
# print(df.filter(df.is_duplicated())) # print duplicated rows
# print(df.columns)
# make a dataframe that filters by the part of speech input by the user
PoS = input("Enter the part of speech you want to filter by: ")
df_PoS = df.filter(df['PoS'] == PoS)
print(df_PoS)
所以我正在寻找的是不区分变音符号的排序或自定义排序指令,这样我就可以将一些变音符号计为单独的字母,而另一些则不计为单独的字母。
在生成人工智能 Gemini 建议后,我尝试使用
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
,但它似乎没有改变任何东西。
首先我认为
en_US
无法工作,因为它是 American English
不使用字符 áéíóú
所以它可能无法对它们进行排序。但我用 pl_PL.UTF-8
和 chars Polish chars ąęść
测试了它,它也不起作用。
最后在排序的文档中,我找到了适合我的方法(即使使用
en_US
)
import locale
import functools
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
words = sorted(words, key=functools.cmp_to_key(locale.strcoll))
用于测试的代码 - 最后排序给出了预期结果:
import locale
import functools
words = 'ą a b ć c d'.split()
print('before setlocal (no key) :', sorted(words))
print('before setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
print(' after setlocal (no key) :', sorted(words))
print(' after setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))
结果:
before setlocal (no key) : ['a', 'b', 'c', 'd', 'ą', 'ć']
before setlocal (with key): ['a', 'b', 'c', 'd', 'ą', 'ć']
after setlocal (no key) : ['a', 'b', 'c', 'd', 'ą', 'ć']
after setlocal (with key): ['a', 'ą', 'b', 'c', 'ć', 'd']