我有一个语料库文件,我需要将其与另一个文件“垂直”进行比较并列出唯一的剩余字符串。例如:
exclude.txt
:
ed
s
ing
第二个文件是:
corpus.txt
:
worked
working
works
tested
tests
find
found
预期输出:
work/ed,ing,s
test/ed,s
其他词(find和found)也可以选择性地返回。
我尝试过这样的:
with open ('/home/corpus.txt') as corpus:
for i in corpus:
i = i.strip('\n')
with open ('/home/exclude.txt') as exclude:
for x in exclude:
x = x.strip('\n')
if i.endswith(x):
print (x, i, re.sub(r'(.*)'+x, r'\1', i)+'/'+x)
输出是这样的...
ed worked work/ed
ing working work/ing
s works work/s
ed tested test/ed
s tests test/s
如您所见,我没有得到预期的输出。如果语料库文件很大,这个 Python 代码也没有用。
TXR 口齿不清:
(let ((suffix-regex (flow "exclude.txt"
file-get-lines
(cons 'or)
regex-compile))
(wh (hash)))
(each ((w (file-get-lines "corpus.txt")))
(iflet ((rng (r$ suffix-regex w)))
(push [w rng] [wh [w 0..(from rng)]])))
(dohash (w suffs wh)
(put-line `@w/@{(reverse suffs) ","}`)))
$ txr condense.tl
work/ed,ing,s
test/ed,s