如何通过名字查找器运行令牌后反转标记化?

问题描述 投票:0回答:1

在使用NameFinderME查找一系列标记中的名称之后,我想反转标记化并使用已修改的名称重建原始文本。有没有办法可以按照执行的确切方式反转标记化操作,以便输出是输入的确切结构?

你好,我的名字是约翰。这是另一句话。

找到句子

你好,我的名字是约翰。 这是另一句话。

Tokenize句子。

> Hello 
> my 
> name 
> is 
> John.
> 
> This 
> is 
> another 
> sentence.

到目前为止,我分析上述令牌的代码看起来像这样。

            TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
            NameFinderME nameFinder = new NameFinderME(model3);

            List<Span[]> spans = new List<Span[]>();
            foreach (string sentence in sentences)
            {
                String[] tokens = tokenizer.tokenize(sentence);

                Span[] nameSpans = nameFinder.find(tokens);
                string[] namedEntities = Span.spansToStrings(nameSpans, tokens);


                //I want to modify each of the named entities found
                //foreach(string s in namedEntities) { modifystring(s) };


                spans.Add(nameSpans);

            }

期望的输出,可能掩盖了找到的名称。

您好,我的名字是XXXX。这是另一句话。

在文档中,有一个链接到这篇文章描述如何使用detokenizer。我不明白操作数组如何与原始标记化相关(如果有的话)

https://issues.apache.org/jira/browse/OPENNLP-216

Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++) 
{ operations[i] = Operation.parse(oper); } 
System.out.println(operations.length); 
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker 
String st = dictDetokenize.detokenize(tokens, " ");
Output:
opennlp
1个回答
0
投票

使用Detokenizer

String text = detokenize(myTokens, null);

© www.soinside.com 2019 - 2024. All rights reserved.