说话人数量估计的分段而不是二值化

Question

我正在使用 pyannote 的二值化来确定音频中的扬声器数量，其中扬声器的数量无法预先确定。这是通过二值化确定说话人数量的代码：

from pyannote.audio import Pipeline
MY_TOKEN = ""  # huggingface_auth_token
audio_file = "my_audio.wav"
pipeline = Pipeline.from_pretrained("pyannote/[email protected]", use_auth_token=MY_TOKEN)
output = pipeline(audio_file, min_speakers=2, max_speakers=10)
results = []
for turn, _, speaker in list(output.itertracks(yield_label=True)):
    results.append(speaker)
num_speakers = len(set(results))
print(num_speakers)

使用二值化来估计说话人数量似乎有点矫枉过正而且速度很慢。所以我试图将音频分割成块，嵌入音频段并对嵌入进行一些聚类以确定理想的聚类数量作为可能的说话者数量。在后端，pyannote 也可能会做一些类似的事情来估计发言者的数量。这是我在代码中尝试过的：

from sklearn.cluster import SpectralClustering, KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from spectralcluster import SpectralClusterer
from resemblyzer import VoiceEncoder, preprocess_wav
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Model
from pyannote.audio import Audio
from pyannote.core import Segment
from pyannote.audio.pipelines import VoiceActivityDetection
import numpy as np


audio_file = "my_audio.wav"
MY_TOKEN = ""  # huggingface_token
embedding_model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb")
encoder = VoiceEncoder()
model = Model.from_pretrained("pyannote/segmentation", 
                              use_auth_token=MY_TOKEN)
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline(audio_file)
audio_model = Audio()

segments = list(vad.itertracks(yield_label=True))
embeddings = np.zeros(shape=(len(segments), 192))
#embeddings = np.zeros(shape=(len(segments), 256))

for i, diaz in enumerate(segments):
    print(i, diaz)
    waveform, sample_rate = audio_model.crop(audio_file, diaz[0])
    embed = embedding_model(waveform[None])
    #wav = preprocess_wav(waveform[None].flatten().numpy())
    #embed = encoder.embed_utterance(wav)
    embeddings[i] = embed
embeddings = np.nan_to_num(embeddings)

max_clusters = 10
silhouette_scores = []
# clustering = SpectralClusterer(min_clusters=2, max_clusters=max_clusters, custom_dist="cosine")
# labels = clustering.predict(embeddings)
# print(labels)

for n_clusters in range(2, max_clusters+1):
    # clustering = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors').fit(embeddings)
    # clustering = KMeans(n_clusters=n_clusters).fit(embeddings)
    clustering = AgglomerativeClustering(n_clusters).fit(embeddings)
    labels = clustering.labels_
    score = silhouette_score(embeddings, labels)
    print(n_clusters, score)
    silhouette_scores.append(score)

# Choose the number of clusters that maximizes the silhouette score
number_of_speakers = np.argmax(silhouette_scores) + 2  # add 2 to account for starting at n_clusters=2
print(number_of_speakers)

但问题是我没有得到与 pyannote 二值化结果相同的结果，尤其是当发言者人数大于 2 时。Pyannote 二值化似乎返回了更真实的数字。如何获得与 pyannote 二值化相同的结果，但使用一些像分割这样更快的过程？

Answer 1

这两种方法给出不同的结果也就不足为奇了。说话人分类和说话人聚类是解决同一说话人计数问题的两种不同方法，它们对数据和问题做出不同的假设。

说话人二值化依赖于说话人变化检测和说话人嵌入等技术将音频分割成对应于不同说话人的区域，然后为每个片段分配一个唯一的说话人标签。这种方法对音频中的各种变化源具有鲁棒性，例如重叠语音、背景噪声和说话者特征，但计算量可能很大。

另一方面，说话人聚类假设音频可以分为固定数量的非重叠片段，并尝试根据一些相似性度量将它们分组到对应于不同说话人的集群中。这种方法比二值化更快，但可能不那么准确，尤其是当说话者的数量事先未知时。

为了提高说话人聚类方法的准确性，您可能需要考虑结合一些用于分类的技术，例如语音活动检测和说话人嵌入。例如，您可以使用 VAD 算法将音频分成语音和非语音区域，然后仅将聚类应用于语音区域。您还可以使用预训练的说话人嵌入模型从语音区域中提取特征，并将它们用作聚类算法的输入。

总的来说，单独使用聚类不太可能达到与二值化相同的精度水平，但通过结合这两种方法，您可以接近。

说话人数量估计的分段而不是二值化

问题描述投票：0回答：1

1个回答

最新问题

说话人数量估计的分段而不是二值化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1