Python 中的极坐标不区分变音符号排序

Question

我是一个糟糕的编码员，大部分时间都不知道自己在做什么，但编码在操作和管理语言数据方面非常有用且节省时间，所以我尝试。

我的大问题是是否有办法使 .sort() 对变音符号（重音符号，如 á 或 ô）不敏感。目前，当我排序时，当我想要“[aá][eé][ií][oó][uú]”或“aáeéiíoóuú”时，我会在“áéíóú”之前得到“aeiou”。尽管我怀疑它与这个问题相关，但这是我的代码（区域设置是尝试解决谷歌双子座建议的问题）：

import os
import locale
import polars as pl

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')  # set the locale to en_US.UTF-8

# print the current working directory
print("Current working directory:", os.getcwd())

# Make paths for the lexicon and the sorted lexicon
LexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon.csv'
UnsortedLexiconPath = r'C:\Users\mawan\Documents\Code\WorldBuildingCode\KalagyonManNyal_Lexicon_Sorted.csv'

# read the csv file name KalagyonManNyal_Lexicon and make it a DataFrame
df = pl.read_csv(LexiconPath)

df.fill_null("-")  # fill null values with a dash
# print the first 5 rows of the DataFrame
print(df)

# sort the DataFrame by the column "Lexeme" in alphabetical order

df_sorted = df.sort("Lexeme")

print(df_sorted)

write_csv = df_sorted.write_csv(UnsortedLexiconPath, include_bom=True)  # write the sorted DataFrame to a new csv file

# print(df.filter(df.is_duplicated())) # print duplicated rows
# print(df.columns)

# make a dataframe that filters by the part of speech input by the user

PoS = input("Enter the part of speech you want to filter by: ")

df_PoS = df.filter(df['PoS'] == PoS)
print(df_PoS)

所以我正在寻找的是不区分变音符号的排序或自定义排序指令，这样我就可以将一些变音符号计为单独的字母，而另一些则不计为单独的字母。

在生成人工智能 Gemini 建议后，我尝试使用

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

，但它似乎没有改变任何东西。

Answer 1

首先我认为

en_US

无法工作，因为它是

American English

不使用字符

áéíóú

所以它可能无法对它们进行排序。但我用

pl_PL.UTF-8

和 chars Polish chars

ąęść

测试了它，它也不起作用。

最后在排序的文档中，我找到了适合我的方法（即使使用

en_US

）

import locale
import functools

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

words = sorted(words, key=functools.cmp_to_key(locale.strcoll))

用于测试的代码 - 最后排序给出了预期结果：

import locale
import functools

words = 'ą a b ć c d'.split()

print('before setlocal (no key)  :', sorted(words))
print('before setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

print(' after setlocal (no key)  :', sorted(words))

print(' after setlocal (with key):', sorted(words, key=functools.cmp_to_key(locale.strcoll)))

结果：

before setlocal (no key)  : ['a', 'b', 'c', 'd', 'ą', 'ć']
before setlocal (with key): ['a', 'b', 'c', 'd', 'ą', 'ć']
 after setlocal (no key)  : ['a', 'b', 'c', 'd', 'ą', 'ć']
 after setlocal (with key): ['a', 'ą', 'b', 'c', 'ć', 'd']

Python 中的极坐标不区分变音符号排序

问题描述投票：0回答：1

1个回答

最新问题

Python 中的极坐标不区分变音符号排序

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1