限制斯坦福NER的迭代次数

Question

我正在定制数据集上训练斯坦福NER CRF模型，但用于训练模型的迭代次数现已达到333次迭代 - 即。而这个培训过程现在已经持续了好几个小时。以下是终端中打印的信息 -

Iter 335 evals 400 <D> [M 1.000E0] 2.880E3 38054.87s |5.680E1| {6.652E-6} 4.488E-4 - 
Iter 336 evals 401 <D> [M 1.000E0] 2.880E3 38153.66s |1.243E2| {1.456E-5} 4.415E-4 -
 -

正在使用的属性文件如下所示 - 在某种程度上我可以将迭代次数限制为20。

location of the training file
trainFile = TRAIN5000.tsv
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model_TRAIN5000.ser.gz

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
printFeatures=true
flag useObservedSequencesOnly=true
featureDiffThresh=0.05

Answer 1

我尝试通过Stanford CoreNLP CRF classifier在https://nlp.stanford.edu/software/crf-faq.html上描述的IOB标记的标记化文本训练生物医学（BioNER）模型。

我的语料库 - 来自下载的来源 - 非常大（约1.5M行; 6个特征：GENE; ......）。由于培训似乎无限期地运行，我绘制了值的比率以了解进度：

抓住Java源代码，我发现TOL中指定的默认值tolerance（.../CoreNLP/src/edu/stanford/nlp/optimization/QNMinimizer.java;用于决定何时终止训练会话）的值为1E-6（0.000001）。

看看那个情节，我原来的训练课程永远不会完成。 [该图还显示设置更大的TOL值，例如tolerance=0.05将触发提前终止训练，因为TOL值是由训练开始时附近发生的“噪音”触发的。我在tolerance=0.05文件中用.prop条目证实了这一点;然而，TOL，0.01等的0.005值是“OK。”]

除非我在maxIterations=20属性文件中添加并更改了tolerance=值，否则@StanfordNLPHelp（此线程中的其他位置）所描述的“bioner.prop”添加到属性文件中似乎被忽略了;例如

tolerance=0.005
maxIterations=20    ## optional

在这种情况下，分类器快速训练模型（bioner.ser.gz）。 [当我将maxIterations线添加到我的.prop文件时，没有添加tolerance线，模型就像以前一样“永远”运行。

可在此处找到可包含在.prop文件中的参数列表：

https://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ie/NERFeatureFactory.html

Answer 2

简短回答：使用tolerance（默认为1e-4）。还有另一个参数maxIterations被忽略了。

Answer 3

在你的prop文件中使用maxQNItr=21。它将运行多达20次迭代。得到了David's answer的帮助

Answer 4

将maxIterations=20添加到属性文件中。

限制斯坦福NER的迭代次数

问题描述投票：3回答：4

4个回答

最新问题

限制斯坦福NER的迭代次数

问题描述 投票：3回答：4

4个回答

最新问题

问题描述投票：3回答：4