使用 Linux 命令行提取单词

问题描述 投票:0回答:1

我有一个语料库文件,我需要将其与另一个文件“垂直”进行比较并列出唯一的剩余字符串。例如:

exclude.txt

ed
s
ing

第二个文件是:

corpus.txt

worked
working
works
tested
tests
find
found

预期输出:

work/ed,ing,s
test/ed,s

其他词(find和found)也可以选择性地返回。

我尝试过这样的:

with open ('/home/corpus.txt') as corpus:
    for i in corpus:
        i = i.strip('\n')
        with open ('/home/exclude.txt') as exclude:
            for x in exclude:
                x = x.strip('\n')
                if i.endswith(x):
                    print (x, i, re.sub(r'(.*)'+x, r'\1', i)+'/'+x)

输出是这样的...

ed worked work/ed
ing working work/ing
s works work/s
ed tested test/ed
s tests test/s

如您所见,我没有得到预期的输出。如果语料库文件很大,这个 Python 代码也没有用。

python awk sed grep
1个回答
0
投票

TXR 口齿不清:

(let ((suffix-regex (flow "exclude.txt"
                      file-get-lines
                      (cons 'or)
                      regex-compile))
      (wh (hash)))
  (each ((w (file-get-lines "corpus.txt")))
    (iflet ((rng (r$ suffix-regex w)))
      (push [w rng] [wh [w 0..(from rng)]])))
  (dohash (w suffs wh)
    (put-line `@w/@{(reverse suffs) ","}`)))
$ txr condense.tl
work/ed,ing,s
test/ed,s
© www.soinside.com 2019 - 2024. All rights reserved.