我的本地客户端创建yml
version: '3.4'
services:
weaviate:
image: cr.weaviate.io/semitechnologies/weaviate:1.25.0
restart: on-failure:0
ports:
- 8080:8080
- 50051:50051
environment:
QUERY_DEFAULTS_LIMIT: 20
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: "./data"
DEFAULT_VECTORIZER_MODULE: text2vec-transformers
ENABLE_MODULES: text2vec-transformers
TRANSFORMERS_INFERENCE_API: http://t2v-transformers:8080
CLUSTER_HOSTNAME: 'node1'
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-multi-qa-MiniLM-L6-cos-v1
environment:
ENABLE_CUDA: 0
我创建了一个集合:
client.collections.create(name = "legal_sections",
properties = [wvc.config.Property(name = "content",
description = "The actual section chunk that the answer is to be extracted from",
data_type = wvc.config.DataType.TEXT,
index_searchable = True,
index_filterable = True,
skip_vectorization = True,
vectorize_property_name = False)])
我创建要上传的数据,然后上传它:
upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
upserts.append(wvc.data.DataObject(
properties = {
'content':content
},
vector = vector
))
client.collections.get("Legal_sections").data.insert_many(upserts)
我的自定义向量的长度为 1024
upserts[0].vector.shape
output:
(1024,)
我得到一个随机的 uuid:
coll = client.collections.get("legal_sections")
for i in coll.iterator():
print(i.uuid)
break
output:
386be699-71de-4bad-9022-31173b9df8d2
我检查此 uuid 处的该对象存储的向量的长度
coll.query.fetch_object_by_id('386be699-71de-4bad-9022-31173b9df8d2', include_vector=True).vector['default'].__len__()
output:
384
这应该是1024。我做错了什么?
这很可能是 weaviate 的一个错误(weaviate 的人可以确认)。嵌入模型的嵌入输出具有 dtype
np.float32
的每个元素。
这会导致两个问题:
collections.data.insert
引发错误,无法 json 序列化 float32collections.data.insert_many
只是抑制了这个错误,并简单地使用用于创建客户端的 yml 中给出的模型进行编码如果我使用
转换嵌入,上面的代码就可以正常工作vector = [float(i) for i in vector]
也就是说:
upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
upserts.append(wvc.data.DataObject(
properties = {
'content':content
},
vector = vector
))
转换为
upserts = []
for content, vector in zip(docs, embeddings.encode(docs)):
upserts.append(wvc.data.DataObject(
properties = {
'content':content
},
vector = [float(i) for i in vector]
))