拉姆达没有找到通过AWS CodeBuild下载的数据NLTK

问题描述 投票:1回答:1

我有一个lambda服务使用NLTK的脚本。我使用管道自动化所有的开发步骤。当一个新的提交发生在GitHub的仓库,AWS CodeBuild处理项目,并实现它在我的lambda函数。

The script

  • 环境:Python的3.6.5
  • 使用NLTK与包禁用词和共发现

我用这个解决方案,我的代码:Installing NLTK/WORDNET on AWS Lambda via CodeBuild

version: 0.2
phases:
 install:
   commands:
     - echo "install step"
     - apt-get update
     - apt-get install zip -y
     - apt-get install python3-pip -y
     - pip install --upgrade pip
     - pip install --upgrade awscli
     # Define directories
     - export HOME_DIR=`pwd`
     - export NLTK_DATA=$HOME_DIR/nltk_data
 pre_build:
   commands:
     - echo "pre_build step"
     - cd $HOME_DIR
     - virtualenv venv
     - . venv/bin/activate
     # Install modules
     - pip install -U requests
     # NLTK download
     - pip install -U nltk
     - python -m nltk.downloader -d $NLTK_DATA wordnet stopwords
     - pip freeze > requirements.txt
 build:
   commands:
     - echo 'build step'
     - cd $HOME_DIR
     - mv $VIRTUAL_ENV/lib/python3.6/site-packages/* .
     - sudo zip -r9 algo.zip .
     - aws s3 cp --recursive --acl public-read ./ s3://hilightalgo/
     # Put the zip on the lambda function
     - aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip
 post_build:
   commands:
     - echo "Build: end"

不同步骤的工作。有没有错误,但是当我尝试使用我的lambda函数,好像我没有NLTK数据。请参见Lambda执行的结果如下:

{"errorMessage":"\n**********************************************************************\n Resource \u001b[93mstopwords\u001b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \u001b[0m\n Attempted to load \u001b[93mcorpora/stopwords\u001b[0m\n\n Searched in:\n - '/home/sbx_user1060/nltk_data'\n - '/var/lang/nltk_data'\n - '/var/lang/share/nltk_data'\n - '/var/lang/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n","errorType":"LookupError","stackTrace":[" File \"/var/task/lambda_function.py\", line 13, in lambda_handler\n return preprocessing.find_sentences('twitter.txt', 'english')\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 100, in find_sentences\n (data, data_stopwords) = sentence_tokenize(file, language)\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 52, in sentence_tokenize\n stop_words = set(stopwords.words(language))\n"," File \"/var/task/nltk/corpus/util.py\", line 123, in __getattr__\n self.__load()\n"," File \"/var/task/nltk/corpus/util.py\", line 88, in __load\n raise e\n"," File \"/var/task/nltk/corpus/util.py\", line 83, in __load\n root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))\n"," File \"/var/task/nltk/data.py\", line 699, in find\n raise LookupError(resource_not_found)\n"]}

我不知道为什么拉姆达没有找到NLTK数据。有没有人有一个想法,以解决我的问题?

python-3.x amazon-web-services aws-lambda nltk aws-codebuild
1个回答
1
投票

根据错误信息,NLTK搜索这些目录的语料库:

Searched in:
 - '/home/sbx_user1060/nltk_data'
 - '/var/lang/nltk_data'
 - '/var/lang/share/nltk_data'
 - '/var/lang/lib/nltk_data'
 - '/usr/share/nltk_data'
 - '/usr/local/share/nltk_data'
 - '/usr/lib/nltk_data'
 - '/usr/local/lib/nltk_data'

然而,在LAMBDA执行环境中,访问该文件系统是有点约束;这些甚至可能不存在,更不用说读你的代码。此外,你的代码(创建.zip存档)提取到/var/task。这基本上是主目录。

幸运的是,it seems可以让nltk知道通过设置环境变量来寻找语料库。如果我正确理解你的构建过程中,捆绑NLTK全集成子目录nltk_data,就在你的Python代码和所需的库。所以在lambda执行环境,将在/var/task/nltk_data被发现。

因此,尝试你CodeBuild过程结束设置你的函数的NLTK_DATA环境变量:

aws lambda update-function-configuration \
--function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight \
--environment 'Variables={NLTK_DATA=/var/task/nltk_data}'
© www.soinside.com 2019 - 2024. All rights reserved.