我想在 SQL Server 中存储一个大的
n
维向量(例如一个嵌入向量)作为与另一行关联的元数据。
在这个例子中,它将是一个384维的向量,例如:
[0.161391481757164, -0.23294533789157867, -0.5648667216300964, -0.3210797905921936, -0.03274689242243767, 0.011770576238632202, -0.06612513959407806,
-0.14662186801433563, -0.17081189155578613, 0.2879514992237091, -0.1932784765958786, 0.009713868610560894, 0.23330552875995636, 0.03551964834332466,
-0.20526213943958282, 0.06445703655481339, -0.3146169185638428, 0.5788811445236206, 0.09118294715881348, -0.0048667509108781815,-0.16503077745437622,
0.25162017345428467, -0.36395764350891113, -0.34742429852485657, 0.0526515394449234, 0.08912508934736252, 0.48464590311050415, -0.04224267974495888,
0.32445403933525085, -0.6847451329231262, -0.20959551632404327, -0.027657458558678627, 0.20439794659614563, 0.6859520077705383, -0.4988805055618286,
-0.26204171776771545, -0.18842612206935883, 0.07067661732435226, 0.02633148804306984, 0.03182782977819443, 0.28935596346855164, -0.0016041728667914867,
0.14609676599502563, -0.36272501945495605, 0.10288259387016296, -0.3651926815509796, -0.3823530375957489, 0.14052163064479828, 0.006418740376830101,
0.11741586774587631, -0.6509529948234558, -0.15997739136219025, -0.42837604880332947, 0.12351743131875992, 0.0485026054084301, 0.24820692837238312,
0.46972623467445374, -0.47954055666923523, -0.5238635540008545, -0.3543052673339844, 0.22626525163650513, 0.18406584858894348, 0.6463921070098877,
0.11894208937883377, -0.07143554836511612, 0.004256516695022583, 0.10088140517473221, 0.3335645794868469, 0.16905969381332397, 0.056856121867895126,
0.11355260014533997, 0.3708053231239319, -0.7484591603279114, 0.17503942549228668, -0.3249044418334961, 0.5901510715484619, 0.41506800055503845,
0.05852462351322174, 0.5119204521179199, 0.2750142216682434, -0.2058306783437729, 0.8199670314788818, 0.16698679327964783, -0.1572146713733673,
0.014733579009771347 ,0.0168467964977026, 0.4688740372657776, -0.07839230448007584, 0.49326324462890625, -0.29934313893318176, 0.21525822579860687,
0.1396997570991516, -0.3420834243297577, -0.5197309851646423, 0.10842061042785645, -0.0338996984064579, 0.35846689343452454, -0.1660442352294922,
0.15579357743263245, 0.015674782916903496,-0.8510578870773315, -0.07501569390296936, -0.1791406124830246, 0.14926102757453918, -0.2269722819328308,
0.42619261145591736, 0.09489753842353821, -0.13341256976127625, 0.3312526345252991, 0.22534190118312836, 0.0679713636636734, 0.17042726278305054,
0.14300595223903656, -0.06654901057481766, -0.2170567661523819, -0.454984188079834, -0.5516679286956787, -0.10752955824136734, -0.05743071809411049,
0.32108309864997864, -0.5445901155471802, -0.43162357807159424, 0.08207866549491882, 0.0664522647857666, 0.4478979706764221, 0.2190810590982437,
-0.05722910910844803, -0.0932786613702774, 0.01758035272359848, 0.16166797280311584, 0.44004616141319275, -0.21601708233356476, 0.43121641874313354,
0.32022470235824585, -0.014045504853129387,-0.24948528409004211, -0.4389941990375519, 0.3816317319869995, -0.5687862038612366, 0.1088542640209198,
-0.403241366147995, 0.08174201846122742, 0.21350793540477753, 0.2396722435951233, 0.4973253607749939, 0.31202447414398193, -0.5260801315307617,
-0.3351263403892517, -0.04100760444998741, 0.6609364151954651, -0.2047063261270523, 0.19385716319084167, -0.5661329627037048, -0.27058693766593933,
-0.1637117713689804, 0.30641692876815796, -0.08894442766904831, -0.052735116332769394,-0.13839660584926605, -0.6741533875465393, 0.05569711700081825,
-0.04354270175099373, 0.20251914858818054, 0.24813368916511536, 0.1719648838043213, 0.26782000064849854, 0.3137670159339905, 0.18599936366081238,
0.23953016102313995, 0.17769533395767212, 0.46293920278549194, -0.19122551381587982, -0.5595004558563232, 0.09755659103393555, 0.3125424385070801,
-0.5813230276107788, -1.0698442459106445, -0.09045401215553284, -0.08948248624801636, -0.051830895245075226,-0.0001317809073952958,-0.08400193601846695,
0.25725823640823364, -0.10135184973478317, 0.07884480804204941, 0.2091679722070694, 0.3950233459472656, 0.2745698094367981, -0.872776448726654,
-0.16590780019760132, 0.4308463931083679, -0.24375642836093903, -0.02120584435760975, 0.05213866010308266, -0.19898287951946259, -0.5506985187530518,
0.40167248249053955, 0.1640072464942932, -0.010167916305363178, 0.14038121700286865, 0.4958030879497528, -0.7259818315505981, -0.24387206137180328,
0.08528701961040497, 0.03415993973612785, -0.16687284409999847, 0.3804749548435211, -0.08561687171459198, -0.2752263844013214, 0.5883951783180237,
-0.3283255994319916, -0.12724250555038452, 0.08751262724399567, -0.44206979870796204, -0.11079336702823639, -0.16302113234996796, 0.11022322624921799,
-0.09404750168323517, -0.256179541349411, 0.20473307371139526, 0.41829538345336914, -0.1095203086733818, 0.02342342585325241, -0.18814104795455933,
-0.2540932893753052, 0.48397907614707947, 0.03593514859676361, -0.089835524559021, -0.6478171944618225, -0.1757517009973526, 0.0672023594379425,
0.0695127546787262, -0.6398074626922607, -0.03958022966980934, -0.10351496934890747, 0.22433893382549286, 0.6756673455238342, -0.2924160957336426,
0.17503827810287476, 0.12915058434009552, -0.239552840590477, 0.15498916804790497, -0.4730042815208435, -0.12289212644100189, -0.004052990116178989,
0.11593572050333023, -0.1965983510017395, 0.5210273265838623, -0.18184830248355865, 0.2579534947872162, -0.1920309066772461, -0.389960378408432,
0.04139290377497673, -0.11638019979000092, -0.10620912909507751, -0.5321099162101746, 0.13135096430778503, -0.07761876285076141, -0.0830138698220253,
-0.01572849042713642, 0.31080499291419983, -0.41445496678352356, 0.1609737128019333, 0.5787453651428223, -0.05459209159016609, 0.1318219006061554,
-0.06957206130027771, 0.15152350068092346, -0.07094550132751465, -0.196294367313385, 0.12644843757152557, 0.23419199883937836, 0.5845456719398499,
-0.19989481568336487, -0.19607964158058167, -0.19692276418209076, -0.08633144199848175, -0.004551170393824577, 0.09362921118736267, -0.14167727530002594,
-0.14917594194412231, 0.31781134009361267, 0.18779256939888, 0.42154577374458313, -0.20578211545944214, 0.14142100512981415, -0.5664211511611938,
0.18177354335784912, 0.14776530861854553, 0.29254236817359924, 0.17831481993198395, -0.1894354224205017, -0.2836195230484009, -0.4065170884132385,
-0.14325398206710815, 0.17800962924957275, 0.7763587832450867, 0.5497004389762878, -0.00946379080414772, -0.48568078875541687, -0.022227048873901367,
-0.005903944373130798, 0.4351034462451935, 0.05010621249675751, -0.12799566984176636, -0.06675072759389877, 0.167253315448761, -0.1653994619846344,
0.21004730463027954, 0.2765181362628937, 0.5885812640190125, -0.326379656791687, -0.007390940561890602, 0.27159956097602844, -0.043763305991888046,
-0.39229199290275574, -0.19412016868591309, 0.4250912666320801, 0.6105153560638428, -0.06168382614850998, -0.5341082811355591, -0.611929714679718,
0.08125612139701843, -0.1779184639453888, 0.5319408774375916, -0.23601730167865753, 0.22285249829292297, -0.32505497336387634, 0.2152460366487503,
0.4679816663265228, 0.048206135630607605,-0.24099768698215485, -0.30208054184913635, 0.13667792081832886, 0.3552468717098236, -0.12280546128749847,
-0.006191314198076725,-0.10851636528968811, 0.08330328017473221, -0.09545236080884933, -0.02249046228826046, 0.0003346469602547586,-0.12273653596639633,
-0.05594412609934807, 0.027804357931017876,-0.4045255482196808, -0.18987023830413818, -0.0027474926318973303,0.30244430899620056, 0.2323288917541504,
-0.2729185223579407, 0.12836921215057373, 0.27967774868011475, 0.3031359016895294, 0.41273725032806396, -0.06173351779580116, 0.33845168352127075,
0.26775869727134705, -0.2933143079280853, -0.0485006645321846, 0.11777450144290924, 0.6205862760543823, -0.07637807726860046, -0.19466432929039001,
-0.3994691073894501, 0.15689416229724884, -0.11139731854200363, -0.2333720475435257, 0.2364773154258728, 0.30898618698120117, -0.1263875812292099,
-0.231489360332489, 0.34536853432655334, 0.6001318097114563, -0.44741731882095337, 0.07382357120513916, -0.019649405032396317, -0.1029537245631218,
0.369470477104187, -0.032077688723802567,-0.13972929120063782, 0.24549521505832672, -0.13091856241226196, -0.029257331043481827]
我的第一个想法是将 384 个
real
值存储在一个单独的表中,并使用原始行的键(垂直分区):
CREATE TABLE Embeddings (
RowGUID uniquedientifier NOT NULL PRIMARY KEY,
f1 real NOT NULL,
f2 real NOT NULL,
f3 real NOT NULL,
f4 real NOT NULL,
f5 real NOT NULL,
f6 real NOT NULL,
f7 real NOT NULL,
f8 real NOT NULL,
f9 real NOT NULL,
f10 real NOT NULL,
...snip...
f384 real NOT NULL)
行GUID | f1 | f2 | f3 | f4 | f5 | f6 | f7 | ... | f384 |
---|---|---|---|---|---|---|---|---|---|
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 0.161391481757164 | -0.23294533789157867 | -0.5648667216300964 | -0.3210797905921936 | -0.03274689242243767 | 0.011770576238632202 | -0.06612513959407806 | ... | -0.029257331043481827 |
这......有点......有效。但它很笨重。另外,我今天的向量恰好是 385 维的;但它们可能很快就会达到 1556 维,这超过了每个表的 SQL Server 最大值1,024 列。
下一个想法是将 4 字节(32 位)的浮点数打包成一个
varbinary
列:
CREATE TABLE Embeddings (
RowGUID uniquedientifier NOT NULL PRIMARY KEY,
PackedVector varbinary(1516) NOT NULL -- 384 floats * 4 bytes = 1540 bytes
)
0x0000000100000002000000030000000400000005000000060000000700000008...0000017F
\______/\______/\______/\______/\______/\______/\______/\______/ \______/
f1 f2 f3 f4 f5 f6 f7 f8 f384
然后当需要读取每个
Single
时,使用 SUBSTRING
将 4 字节浮点数从 varbinary 中提取出来,然后将其转换为 real
:
DECLARE @f1 real = CAST(SUBSTRING(PackedVector, 0*4, 4) AS real);
除了两个缺点:
binary(4)
转换为 real
(即使您可以将 real
转换为 binary(4)
;只是不能反过来:
也许可以使用
decimal
或 numeric
解决它)。
缺点#2:计算两个向量之间的欧氏距离的数学在概念上是有效的:
DECLARE @target VARBINARY(1536) -- packed 384-dimensional vector
SELECT TOP(10) RowGUID, SUM(POWER(CAST(SUBSTRING(Embedding, i*4+1, 4) AS real) - CAST(SUBSTRING(@target, i*4 + 1, 4) AS real), 2)) as distance
FROM Embeddings
CROSS APPLY (VALUES (0), (1), (2), ..., (383)) AS sequence(i) -- Fill in the values from 0 to 383
GROUP BY RowGUID
ORDER BY distance ASC
但这会表现得很差(即使问题 #1 不存在)。
很多年前,微软新闻组上有人有同样的问题:
任何人都可以指点我参考或讨论存储的最佳方式 数据库中 120 到 480 个数字的向量?行似乎已经出来了 因为我们很快就会突破十亿行大关。一桌480 columns 太不规范了。单个 varchar(max) 列?这 似乎是目前最好的答案,除非有更有效的方法 存储它。
感谢任何帮助或意见,
然后--CELKO--回复:
我认为向量是一种特殊的数学结构,并且 您似乎在谈论某种清单。向量有一个固定的 维数等。这是一个猜测:
CREATE TABLE Vectors ( vector_id CHAR(3) NOT NULL, --whatever dim_nbr INTEGER NOT NULL, CHECK (dim_nbr BETWEEN 1 AND 480), PRIMARY KEY (vector_id, dim_nbr), dim_val INTEGER NOT NULL );
将向量的值放入行中:
嵌入
行GUID | dimNumber | dimValue |
---|---|---|
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 1 | 0.161391481757164 |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 2 | -0.23294533789157867 |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 3 | -0.5648667216300964 |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 4 | -0.3210797905921936 |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 5 | -0.03274689242243767 |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 6 | 0.011770576238632202 |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 7 | -0.06612513959407806 |
... | ... | ... |
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 384 | -0.029257331043481827 |
这是可能最好的方法。
SQL Server 对向量的支持不是更好吗?我知道有 GEOSPATIAL/GEOGRAPHY 类型,但我收集到那些只适用于二维向量(例如 lattuitude+logitude)?难道不能滥用来解决问题吗?
由于目标是计算两个向量之间的欧几里得距离,是否有一种数据结构可以更好地实现数学运算? (varchar?xml?json?varbinary?变体?)