使用 Linux 命令行提取单词

Question

我有一个语料库文件，我需要将其与另一个文件“垂直”进行比较并列出唯一的剩余字符串。例如：

exclude.txt

：

ed
s
ing

第二个文件是：

corpus.txt

：

worked
working
works
tested
tests
find
found

预期输出：

work/ed,ing,s
test/ed,s

其他词（find和found）也可以选择性地返回。

我尝试过这样的：

with open ('/home/corpus.txt') as corpus:
    for i in corpus:
        i = i.strip('\n')
        with open ('/home/exclude.txt') as exclude:
            for x in exclude:
                x = x.strip('\n')
                if i.endswith(x):
                    print (x, i, re.sub(r'(.*)'+x, r'\1', i)+'/'+x)

输出是这样的...

ed worked work/ed
ing working work/ing
s works work/s
ed tested test/ed
s tests test/s

如您所见，我没有得到预期的输出。如果语料库文件很大，这个 Python 代码也没有用。

Answer 1

TXR 口齿不清：

(let ((suffix-regex (flow "exclude.txt"
                      file-get-lines
                      (cons 'or)
                      regex-compile))
      (wh (hash)))
  (each ((w (file-get-lines "corpus.txt")))
    (iflet ((rng (r$ suffix-regex w)))
      (push [w rng] [wh [w 0..(from rng)]])))
  (dohash (w suffs wh)
    (put-line `@w/@{(reverse suffs) ","}`)))

$ txr condense.tl
work/ed,ing,s
test/ed,s

使用 Linux 命令行提取单词

问题描述投票：0回答：1

1个回答

最新问题

使用 Linux 命令行提取单词

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1