连接越多,PSQL查询的结果就越少

问题描述 投票:0回答:2

我有以下psql表。实际上,它有大约20亿行。

 id  word      lemma     pos              textid  country_genre     
 1  Stuffing   stuff      vvg             190568  AN         
 2  her        her        appge           190568  AN         
 3  key        key        nn1             190568  AN         
 4  into       into       ii              190568  AN         
 5  the        the        at              190568  AN         
 6  lock       lock       nn1             190568  AN         
 7  she        she        appge           190568  AN         
 8  pushed     push       vvd             190568  AN         
 9  her        her        appge           190568  AN         
10  way        way        nn1             190568  AN         
11  into       into       ii              190568  AN         
12  the        the        appge           190568  AN         
13  house      house      nn1             190568  AN         
14  .                     .               190568  AN         
15  She        she        appge           190568  AN         
16  had        have       vhd             190568  AN         
17  also       also       rr              190568  AN         
18  cajoled    cajole     vvd             190568  AN         
19  her        her        appge           190568  AN         
20  way        way        nn1             190568  AN         
21  into       into       ii              190568  AN         
22  the        the        at              190568  AN         
23  home       home       nn1             190568  AN         
24  .                     .               190568  AN         
..  ...        ...        ..              ...     ..

我想创建下表,其中显示所有“方式” - 并排的单词和“country_genre”,“lemma”和“pos”列中的一些数据。

country_genre word   word       word       lemma      pos        word       word     word       word       word       lemma      pos        word       word       
AN         lock   she        pushed     push       vvd        her        way      into       the        house      house      nn1        .          she
AN         had    also       cajoled    cajole     vvd        her        way      into       the        home       home       nn1        .          A          
AN         tried  to         force      force      vvi        her        way      into       the        palace     palace     nn1        ,          officials  

我使用以下代码(感谢Bohemian:https://stackoverflow.com/a/47496945/3957383!):

copy(

 SELECT
   c1.id, c1.country_genre, c1.textid, c1.wordid, c1.word,  c2.word, c3.word,  c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word

 FROM

 orderedflatcorpus AS c1
 JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id
 JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id 
 JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id
 JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id
 JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id
 JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id
 JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id
 JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id
 JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id
 JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id

 WHERE

 c4.pos LIKE 'vv%'
 AND c5.pos = 'appge'
 AND c6.word = 'way'
 AND c7.pos LIKE 'i%'
 AND c8.word = 'the'
 AND c9.pos LIKE 'n%'
 )

 TO

 '/home/postgres/Results/OUTPUT.csv'
 DELIMITER E'\t'
 csv header;

此查询返回18706相关结构。

但是,如果我使用以下代码,它提取更多上下文(21而不是11个单词)但在其他方面与前一个相同,则会发生令人担忧的事情:我只得到18555个相关结构。

 copy(
 SELECT c1.id, c1.country_genre, c1.textid, c1.wordid, c1.word, c1.pos, c2.word, c2.pos, c3.word, c3.pos, c4.word, c4.pos, c5.word, c5.pos, c6.word, c6.pos, 
 c7.word, c7.pos, c8.word, c8.pos, c8.lemma, c9.word, c9.pos, c10.word, c10.pos, c11.word, c11.pos, c12.word, c12.pos, c13.word, c13.pos, c13.lemma, c14.word, 
 c14.pos, c15.word, c15.pos, c16.word, c16.pos, c17.word, c17.pos, c18.word, c18.pos, c19.word, c19.pos, c20.word, c20.pos, c21.word, c21.pos 

 FROM 

 orderedflatcorpus AS c1 
 JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id 
 JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id 
 JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id 
 JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id 
 JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id 
 JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id 
 JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id 
 JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id 
 JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id 
 JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id 
 JOIN orderedflatcorpus AS c12 ON c1.id + 11 = c12.id 
 JOIN orderedflatcorpus AS c13 ON c1.id + 12 = c13.id 
 JOIN orderedflatcorpus AS c14 ON c1.id + 13 = c14.id 
 JOIN orderedflatcorpus AS c15 ON c1.id + 14 = c15.id 
 JOIN orderedflatcorpus AS c16 ON c1.id + 15 = c16.id 
 JOIN orderedflatcorpus AS c17 ON c1.id + 16 = c17.id 
 JOIN orderedflatcorpus AS c18 ON c1.id + 17 = c18.id 
 JOIN orderedflatcorpus AS c19 ON c1.id + 18 = c19.id 
 JOIN orderedflatcorpus AS c20 ON c1.id + 19 = c20.id 
 JOIN orderedflatcorpus AS c21 ON c1.id + 20 = c21.id 

 WHERE 

 c8.pos LIKE 'vv%' 
 AND c9.pos = 'appge' 
 AND c10.word = 'way' 
 AND c11.pos LIKE 'i%' 
 AND c12.word = 'the' 
 AND c13.pos LIKE 'n%' 
 ) 
 TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;

我查看了第二个查询中缺少的行,但是我无法检测到遗漏的任何模式。

有没有人知道这里会发生什么?谢谢!

sql postgresql join data-loss
2个回答
0
投票

Postgres JOIN相当于INNER JOIN。例如,请参阅this

换句话说,您的查询需要与所有ON谓词匹配才能得到结果。添加更多JOINS也增加ON谓词无法匹配,从而减少结果的数量。

尝试将JOIN更改为LEFT JOIN(又名LEFT OUTER JOIN)以查看是否可以获得更多结果。

如果没有看到您的示例数据和结果集中缺少的示例,则很难说其他问题可能是什么。


0
投票

Pirho的答案是正确的 - 就目前而言。但是,仅将连接更改为外连接可能无法解决您的问题。

首先,两个查询的where条件不同,所以我没有理由假设它们将返回相同的结果集。

第二个查询是连续查找21个单词 - 不是19个单词,而不是13个单词。因此,如果模式位于文档的末尾,则无法找到它。

根据你的条件,有13个单词需要连续出现(为你的条件得到计数“13”。然后,我猜,可选7行。这需要内部和外部联接的混合:

FROM orderedflatcorpus c1 JOIN
     orderedflatcorpus c2
     ON c1.id + 1 = c2.id JOIN
     orderedflatcorpus c3
     ON c1.id + 2 = c3.id JOIN
     orderedflatcorpus c4
     ON c1.id + 3 = c4.id JOIN
     orderedflatcorpus c5
     ON c1.id + 4 = c5.id JOIN
     orderedflatcorpus c6
     ON c1.id + 5 = c6.id JOIN
     orderedflatcorpus c7
     ON c1.id + 6 = c7.id JOIN
     orderedflatcorpus c8
     ON c1.id + 7 = c8.id JOIN
     orderedflatcorpus c9
     ON c1.id + 8 = c9.id JOIN
     orderedflatcorpus c10
     ON c1.id + 9 = c10.id JOIN
     orderedflatcorpus c11
     ON c1.id + 10 = c11.id JOIN
     orderedflatcorpus c12
     ON c1.id + 11 = c12.id JOIN
     orderedflatcorpus c13
     ON c1.id + 12 = c13.id 
     orderedflatcorpus c14
     ON c1.id + 13 = c14.id LEFT JOIN
     orderedflatcorpus c15
     ON c1.id + 14 = c15.id LEFT JOIN
     orderedflatcorpus c16
     ON c1.id + 15 = c16.id LEFT JOIN
     orderedflatcorpus c17
     ON c1.id + 16 = c17.id LEFT JOIN
     orderedflatcorpus c18
     ON c1.id + 17 = c18.id LEFT JOIN
     orderedflatcorpus c19
     ON c1.id + 18 = c19.id LEFT JOIN
     orderedflatcorpus c20
     ON c1.id + 19 = c20.id LEFT JOIN
     orderedflatcorpus c21
     ON c1.id + 20 = c21.id 
WHERE c8.pos LIKE 'vv%' AND
      c9.pos = 'appge' AND
      c10.word = 'way' AND
      c11.pos LIKE 'i%' AND
      c12.word = 'the' AND
      c13.pos LIKE 'n%' ;

下一篇:我怀疑country和/或textid应该是join条件的一部分。

© www.soinside.com 2019 - 2024. All rights reserved.