如何从Stanford NLP工具获得增强的依赖解析？

Question

我正在开发一个关于波兰语依赖解析的项目。我们正在尝试使用波兰语（使用.conllu格式的Universal Dependencies树库）训练斯坦福神经网络依赖性解析器。数据已经被标记化和注释，因此我们既没有训练标记器，也没有训练CORE NLP提供的解析器。到目前为止，通过从命令行运行解析器，我们已经能够在标准依赖项中使用pl_lfg-ud Treebank取得一些成功。但我们也希望训练解析器来重现增强的通用依赖关系，它们也在树库中表示。到目前为止，我还没有在文档中找到这样做的方法，也没有找到NNDEP和CORE NLP的常见问题解答，尽管据我所知，有可能使用Stanford NLP解析器。是否增强的依赖关系解析仅适用于英语（或其他官方支持的语言），或者我只是做错了什么？

我会非常感谢任何线索！

Answer 1

这里有关于如何训练模型的信息：

https://stanfordnlp.github.io/CoreNLP/depparse.html

示例命令：

java -Xmx12g edu.stanford.nlp.parser.nndep.DependencyParser -trainFile fr-ud-train.conllu -devFile fr-ud-dev.conllu -model new-french-UD-model.txt.gz -embedFile wiki.fr.vec -embeddingSize 300 -tlp edu.stanford.nlp.trees.international.french.FrenchTreebankLanguagePack -cPOS

您还需要训练一个词性模型：

https://nlp.stanford.edu/software/pos-tagger-faq.html

https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html

示例命令：

java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.props

您可以在文档中找到适当的培训文件样式。

示例文件：


## tagger training invoked at Sun Sep 23 19:24:37 PST 2018 with arguments:
                   model = english-left3words-distsim.tagger
                    arch = left3words,naacl2003unknowns,wordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1)
            wordFunction = edu.stanford.nlp.process.AmericanizeFunction
               trainFile = /path/to/training-data
         closedClassTags = 
 closedClassTagThreshold = 40
 curWordMinFeatureThresh = 2
                   debug = false
             debugPrefix = 
            tagSeparator = _
                encoding = UTF-8
              iterations = 100
                    lang = english
    learnClosedClassTags = false
        minFeatureThresh = 2
           openClassTags = 
rareWordMinFeatureThresh = 10
          rareWordThresh = 5
                  search = owlqn
                    sgml = false
            sigmaSquared = 0.0
                   regL1 = 0.75
               tagInside = 
                tokenize = true
        tokenizerFactory = 
        tokenizerOptions = 
                 verbose = false
          verboseResults = true
    veryCommonWordThresh = 250
                xmlInput = 
              outputFile = 
            outputFormat = slashTags
     outputFormatOptions = 
                nthreads = 1

这里有一个详尽的示例培训属性文件列表：

https://github.com/stanfordnlp/CoreNLP/tree/master/scripts/pos-tagger

如果使用Java管道，则需要编写标记生成器或提供预标记化的文本。

您可能对我们的Python项目感兴趣，该项目具有用于标记化，句子分割，词典化和依赖性解析的波兰模型。您也可以训练自己的模型：

https://github.com/stanfordnlp/stanfordnlp

如何从Stanford NLP工具获得增强的依赖解析？

问题描述投票：0回答：1

1个回答

最新问题

如何从Stanford NLP工具获得增强的依赖解析？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1