如何在SQL Server中存储n维向量?

问题描述 投票:0回答:0

我想在 SQL Server 中存储一个大的

n
维向量(例如一个嵌入向量)作为与另一行关联的元数据。

在这个例子中,它将是一个384维的向量,例如:

[0.161391481757164,   -0.23294533789157867, -0.5648667216300964,  -0.3210797905921936,  -0.03274689242243767,  0.011770576238632202, -0.06612513959407806,
-0.14662186801433563, -0.17081189155578613,  0.2879514992237091,  -0.1932784765958786,   0.009713868610560894, 0.23330552875995636,   0.03551964834332466,
-0.20526213943958282,  0.06445703655481339, -0.3146169185638428,   0.5788811445236206,   0.09118294715881348, -0.0048667509108781815,-0.16503077745437622,
 0.25162017345428467, -0.36395764350891113, -0.34742429852485657,  0.0526515394449234,   0.08912508934736252,  0.48464590311050415,  -0.04224267974495888,
 0.32445403933525085, -0.6847451329231262,  -0.20959551632404327, -0.027657458558678627, 0.20439794659614563,  0.6859520077705383,   -0.4988805055618286,
-0.26204171776771545, -0.18842612206935883,  0.07067661732435226,  0.02633148804306984,  0.03182782977819443,  0.28935596346855164,  -0.0016041728667914867,
 0.14609676599502563, -0.36272501945495605,  0.10288259387016296, -0.3651926815509796,  -0.3823530375957489,   0.14052163064479828,   0.006418740376830101,
 0.11741586774587631, -0.6509529948234558,  -0.15997739136219025, -0.42837604880332947,  0.12351743131875992,  0.0485026054084301,    0.24820692837238312,
 0.46972623467445374, -0.47954055666923523, -0.5238635540008545,  -0.3543052673339844,   0.22626525163650513,  0.18406584858894348,   0.6463921070098877,
 0.11894208937883377, -0.07143554836511612,  0.004256516695022583, 0.10088140517473221,  0.3335645794868469,   0.16905969381332397,   0.056856121867895126,
 0.11355260014533997,  0.3708053231239319,  -0.7484591603279114,   0.17503942549228668, -0.3249044418334961,   0.5901510715484619,    0.41506800055503845,
 0.05852462351322174,  0.5119204521179199,   0.2750142216682434,  -0.2058306783437729,   0.8199670314788818,   0.16698679327964783,  -0.1572146713733673,
 0.014733579009771347 ,0.0168467964977026,   0.4688740372657776,  -0.07839230448007584,  0.49326324462890625, -0.29934313893318176,   0.21525822579860687,
 0.1396997570991516,  -0.3420834243297577,  -0.5197309851646423,   0.10842061042785645, -0.0338996984064579,   0.35846689343452454,  -0.1660442352294922,
 0.15579357743263245,  0.015674782916903496,-0.8510578870773315,  -0.07501569390296936, -0.1791406124830246,   0.14926102757453918,  -0.2269722819328308,
 0.42619261145591736,  0.09489753842353821, -0.13341256976127625,  0.3312526345252991,   0.22534190118312836,  0.0679713636636734,    0.17042726278305054,
 0.14300595223903656, -0.06654901057481766, -0.2170567661523819,  -0.454984188079834,   -0.5516679286956787,  -0.10752955824136734,  -0.05743071809411049,
 0.32108309864997864, -0.5445901155471802,  -0.43162357807159424,  0.08207866549491882,  0.0664522647857666,   0.4478979706764221,    0.2190810590982437,
-0.05722910910844803, -0.0932786613702774,   0.01758035272359848,  0.16166797280311584,  0.44004616141319275, -0.21601708233356476,   0.43121641874313354,
 0.32022470235824585, -0.014045504853129387,-0.24948528409004211, -0.4389941990375519,   0.3816317319869995,  -0.5687862038612366,    0.1088542640209198,
-0.403241366147995,    0.08174201846122742,  0.21350793540477753,  0.2396722435951233,   0.4973253607749939,   0.31202447414398193,  -0.5260801315307617,
-0.3351263403892517,  -0.04100760444998741,  0.6609364151954651,  -0.2047063261270523,   0.19385716319084167, -0.5661329627037048,   -0.27058693766593933,
-0.1637117713689804,   0.30641692876815796, -0.08894442766904831, -0.052735116332769394,-0.13839660584926605, -0.6741533875465393,    0.05569711700081825,
-0.04354270175099373,  0.20251914858818054,  0.24813368916511536,  0.1719648838043213,   0.26782000064849854,  0.3137670159339905,    0.18599936366081238,
 0.23953016102313995,  0.17769533395767212,  0.46293920278549194, -0.19122551381587982, -0.5595004558563232,   0.09755659103393555,   0.3125424385070801,
-0.5813230276107788,  -1.0698442459106445,  -0.09045401215553284, -0.08948248624801636, -0.051830895245075226,-0.0001317809073952958,-0.08400193601846695,
 0.25725823640823364, -0.10135184973478317,  0.07884480804204941,  0.2091679722070694,   0.3950233459472656,   0.2745698094367981,   -0.872776448726654,
-0.16590780019760132,  0.4308463931083679,  -0.24375642836093903, -0.02120584435760975,  0.05213866010308266, -0.19898287951946259,  -0.5506985187530518,
 0.40167248249053955,  0.1640072464942932,  -0.010167916305363178, 0.14038121700286865,  0.4958030879497528,  -0.7259818315505981,   -0.24387206137180328,
 0.08528701961040497,  0.03415993973612785, -0.16687284409999847,  0.3804749548435211,  -0.08561687171459198, -0.2752263844013214,    0.5883951783180237,
-0.3283255994319916,  -0.12724250555038452,  0.08751262724399567, -0.44206979870796204, -0.11079336702823639, -0.16302113234996796,   0.11022322624921799,
-0.09404750168323517, -0.256179541349411,    0.20473307371139526,  0.41829538345336914, -0.1095203086733818,   0.02342342585325241,  -0.18814104795455933,
-0.2540932893753052,   0.48397907614707947,  0.03593514859676361, -0.089835524559021,   -0.6478171944618225,  -0.1757517009973526,    0.0672023594379425,
 0.0695127546787262,  -0.6398074626922607,  -0.03958022966980934, -0.10351496934890747,  0.22433893382549286,  0.6756673455238342,   -0.2924160957336426,
 0.17503827810287476,  0.12915058434009552, -0.239552840590477,    0.15498916804790497, -0.4730042815208435,  -0.12289212644100189,  -0.004052990116178989,
 0.11593572050333023, -0.1965983510017395,   0.5210273265838623,  -0.18184830248355865,  0.2579534947872162,  -0.1920309066772461,   -0.389960378408432,
 0.04139290377497673, -0.11638019979000092, -0.10620912909507751, -0.5321099162101746,   0.13135096430778503, -0.07761876285076141,  -0.0830138698220253,
-0.01572849042713642,  0.31080499291419983, -0.41445496678352356,  0.1609737128019333,   0.5787453651428223,  -0.05459209159016609,   0.1318219006061554,
-0.06957206130027771,  0.15152350068092346, -0.07094550132751465, -0.196294367313385,    0.12644843757152557,  0.23419199883937836,   0.5845456719398499,
-0.19989481568336487, -0.19607964158058167, -0.19692276418209076, -0.08633144199848175, -0.004551170393824577, 0.09362921118736267,  -0.14167727530002594,
-0.14917594194412231,  0.31781134009361267,  0.18779256939888,     0.42154577374458313, -0.20578211545944214,  0.14142100512981415,  -0.5664211511611938,
 0.18177354335784912,  0.14776530861854553,  0.29254236817359924,  0.17831481993198395, -0.1894354224205017,  -0.2836195230484009,   -0.4065170884132385,
-0.14325398206710815,  0.17800962924957275,  0.7763587832450867,   0.5497004389762878,  -0.00946379080414772, -0.48568078875541687,  -0.022227048873901367,
-0.005903944373130798, 0.4351034462451935,   0.05010621249675751, -0.12799566984176636, -0.06675072759389877,  0.167253315448761,    -0.1653994619846344,
 0.21004730463027954,  0.2765181362628937,   0.5885812640190125,  -0.326379656791687,   -0.007390940561890602, 0.27159956097602844,  -0.043763305991888046,
-0.39229199290275574, -0.19412016868591309,  0.4250912666320801,   0.6105153560638428,  -0.06168382614850998, -0.5341082811355591,   -0.611929714679718,
 0.08125612139701843, -0.1779184639453888,   0.5319408774375916,  -0.23601730167865753,  0.22285249829292297, -0.32505497336387634,   0.2152460366487503,
 0.4679816663265228,   0.048206135630607605,-0.24099768698215485, -0.30208054184913635,  0.13667792081832886,  0.3552468717098236,   -0.12280546128749847,
-0.006191314198076725,-0.10851636528968811,  0.08330328017473221, -0.09545236080884933, -0.02249046228826046,  0.0003346469602547586,-0.12273653596639633,
-0.05594412609934807,  0.027804357931017876,-0.4045255482196808,  -0.18987023830413818, -0.0027474926318973303,0.30244430899620056,   0.2323288917541504,
-0.2729185223579407,   0.12836921215057373,  0.27967774868011475,  0.3031359016895294,   0.41273725032806396, -0.06173351779580116,   0.33845168352127075,
 0.26775869727134705, -0.2933143079280853,  -0.0485006645321846,   0.11777450144290924,  0.6205862760543823,  -0.07637807726860046,  -0.19466432929039001,
-0.3994691073894501,   0.15689416229724884, -0.11139731854200363, -0.2333720475435257,   0.2364773154258728,   0.30898618698120117,  -0.1263875812292099,
-0.231489360332489,    0.34536853432655334,  0.6001318097114563,  -0.44741731882095337,  0.07382357120513916, -0.019649405032396317, -0.1029537245631218,
 0.369470477104187,   -0.032077688723802567,-0.13972929120063782,  0.24549521505832672, -0.13091856241226196, -0.029257331043481827]

尝试#1 - n 维向量 → n 列

我的第一个想法是将 384 个

real
值存储在一个单独的表中,并使用原始行的键(垂直分区):

CREATE TABLE Embeddings (
   RowGUID uniquedientifier NOT NULL PRIMARY KEY,
   f1 real NOT NULL, 
   f2 real NOT NULL,
   f3 real NOT NULL,
   f4 real NOT NULL,
   f5 real NOT NULL,
   f6 real NOT NULL,
   f7 real NOT NULL,
   f8 real NOT NULL,
   f9 real NOT NULL,
   f10 real NOT NULL,
   ...snip...
   f384 real NOT NULL)
行GUID f1 f2 f3 f4 f5 f6 f7 ... f384
6ba7b814-9dad-11d1-80b4-00c04fd430c8 0.161391481757164 -0.23294533789157867 -0.5648667216300964 -0.3210797905921936 -0.03274689242243767 0.011770576238632202 -0.06612513959407806 ... -0.029257331043481827

这......有点......有效。但它很笨重。另外,我今天的向量恰好是 385 维的;但它们可能很快就会达到 1556 维,这超过了每个表的 SQL Server 最大值1,024 列。

Attempt#2 - n-dimensional → n-IEEE 32-bit floats → varbinary(n*4)

下一个想法是将 4 字节(32 位)的浮点数打包成一个

varbinary
列:

CREATE TABLE Embeddings (
   RowGUID uniquedientifier NOT NULL PRIMARY KEY,
   PackedVector varbinary(1516) NOT NULL -- 384 floats * 4 bytes = 1540 bytes
)
0x0000000100000002000000030000000400000005000000060000000700000008...0000017F
  \______/\______/\______/\______/\______/\______/\______/\______/   \______/
     f1      f2      f3      f4      f5      f6      f7      f8        f384

然后当需要读取每个

Single
时,使用
SUBSTRING
将 4 字节浮点数从 varbinary 中提取出来,然后将其转换为
real

DECLARE @f1 real = CAST(SUBSTRING(PackedVector, 0*4, 4) AS real);

除了两个缺点:

  • 缺点 #1:您不能将

    binary(4)
    转换为
    real
    (即使您可以将
    real
    转换为
    binary(4)
    ;只是不能反过来:

    也许可以使用

    decimal
    numeric
    解决它)。

  • 缺点#2:计算两个向量之间的欧氏距离的数学在概念上是有效的:

    DECLARE @target VARBINARY(1536) -- packed 384-dimensional vector
    
    SELECT TOP(10) RowGUID, SUM(POWER(CAST(SUBSTRING(Embedding, i*4+1, 4) AS real) - CAST(SUBSTRING(@target, i*4 + 1, 4) AS real), 2)) as distance
    FROM Embeddings
    CROSS APPLY (VALUES (0), (1), (2), ..., (383)) AS sequence(i) -- Fill in the values from 0 to 383
    GROUP BY RowGUID
    ORDER BY distance ASC
    

    但这会表现得很差(即使问题 #1 不存在)。

尝试 3 - 做 Joe Celko 做的事

很多年前,微软新闻组上有人有同样的问题

任何人都可以指点我参考或讨论存储的最佳方式 数据库中 120 到 480 个数字的向量?行似乎已经出来了 因为我们很快就会突破十亿行大关。一桌480 columns 太不规范了。单个 varchar(max) 列?这 似乎是目前最好的答案,除非有更有效的方法 存储它。

感谢任何帮助或意见,

然后--CELKO--回复:

我认为向量是一种特殊的数学结构,并且 您似乎在谈论某种清单。向量有一个固定的 维数等。这是一个猜测:

CREATE TABLE Vectors (
   vector_id CHAR(3) NOT NULL, --whatever
   dim_nbr   INTEGER NOT NULL,
   CHECK (dim_nbr BETWEEN 1 AND 480),
   PRIMARY KEY (vector_id, dim_nbr),
   dim_val   INTEGER NOT NULL
);

将向量的值放入行中:

嵌入

行GUID dimNumber dimValue
6ba7b814-9dad-11d1-80b4-00c04fd430c8 1 0.161391481757164
6ba7b814-9dad-11d1-80b4-00c04fd430c8 2 -0.23294533789157867
6ba7b814-9dad-11d1-80b4-00c04fd430c8 3 -0.5648667216300964
6ba7b814-9dad-11d1-80b4-00c04fd430c8 4 -0.3210797905921936
6ba7b814-9dad-11d1-80b4-00c04fd430c8 5 -0.03274689242243767
6ba7b814-9dad-11d1-80b4-00c04fd430c8 6 0.011770576238632202
6ba7b814-9dad-11d1-80b4-00c04fd430c8 7 -0.06612513959407806
... ... ...
6ba7b814-9dad-11d1-80b4-00c04fd430c8 384 -0.029257331043481827

这是可能最好的方法。

没有更好的了?

SQL Server 对向量的支持不是更好吗?我知道有 GEOSPATIAL/GEOGRAPHY 类型,但我收集到那些只适用于二维向量(例如 lattuitude+logitude)?难道不能滥用来解决问题吗?

由于目标是计算两个向量之间的欧几里得距离,是否有一种数据结构可以更好地实现数学运算? (varchar?xml?json?varbinary?变体?)

红利阅读

arrays sql-server vector word-embedding
© www.soinside.com 2019 - 2024. All rights reserved.