我正在使用 Twitter 数据在 Python 中进行文本挖掘,以研究印度公司对 IPO(首次公开募股)的看法。我需要帮助来提取包含多个术语的推文 - 全部包含在内。例如,我想要推文中包含所有三个词“Mahindra”、“Logistics”和“IPO”。有没有办法使用 Python 中的流函数来做到这一点?
我也附上了我的代码
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'Mahindra' 'Logistics' 'IPO'
stream.filter(track=['Mahindra,Logistics,IPO'])
我无法对您的问题发表评论,所以我不得不发布一个答案。
我没有研究过 Twitter API,但我确实有其他选择。您可以使用 Twitter Scraper 并实现相同的目标,而无需进行大量编码。
你的代码似乎只是一个(不完整的)python 片段,但它对我来说仍然看起来很熟悉。 我使用以下脚本从 Twitter Stream API 获取数据:
# To run this code, first edit config.py with your configuration (Auth data), then install necessary modules, then:
#
# Call
#
# mkdir data
# python twitter_stream_download.py -q apple -d data
#
#
# It will produce the list of tweets for the query "apple"
# in the file data/stream_apple.json
# analyse tweets with jq:
# cat stream_apple.json | jq -s '.[] | {user: .user.name}
import tweepy
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import argparse
import string
import config
import json
def get_parser():
"""Get parser for command line arguments."""
parser = argparse.ArgumentParser(description="Twitter Downloader")
parser.add_argument("-q",
"--query",
dest="query",
help="Query/Filter",
default='-')
parser.add_argument("-l",
"--lang",
dest="languages",
help="Languages",
default='en')
parser.add_argument("-d",
"--data-dir",
dest="data_dir",
help="Output/Data Directory")
return parser
class MyListener(StreamListener):
"""Custom StreamListener for streaming data."""
def __init__(self, data_dir=".", query=""):
query_fname = format_filename(query)
self.outfile = "%s/stream_%s.json" % (data_dir, query_fname)
print("Writing to '{}'").format(self.outfile)
def on_data(self, data):
try:
with open(self.outfile, 'a') as f:
f.write(data)
print(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
time.sleep(5)
return True
def on_error(self, status):
if status_code == 420:
#returning False in on_data disconnects the stream
print("rate limited - to many connection attempts. Please wait.")
return False
else:
print(status)
return True
def format_filename(fname):
"""Convert file name into a safe string.
Arguments:
fname -- the file name to convert
Return:
String -- converted file name
"""
return ''.join(convert_valid(one_char) for one_char in fname)
def convert_valid(one_char):
"""Convert a character into '_' if invalid.
Arguments:
one_char -- the char to convert
Return:
Character -- converted char
"""
valid_chars = "-_.%s%s" % (string.ascii_letters, string.digits)
if one_char in valid_chars:
return one_char
else:
return '_'
@classmethod
def parse(cls, api, raw):
status = cls.first_parse(api, raw)
setattr(status, 'json', json.dumps(raw))
return status
if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
auth = OAuthHandler(config.consumer_key, config.consumer_secret)
auth.set_access_token(config.access_token, config.access_secret)
api = tweepy.API(auth)
twitter_stream = Stream(auth, MyListener(args.data_dir, args.query))
twitter_stream.filter(track=[args.query], languages=[args.languages], async=False)
首先创建一个输出目录,然后是一个文件config.py
consumer_key = "7r..."
consumer_secret = "gp..."
access_token = "5Q..."
access_secret = "a3..."
然后这样称呼它:
python twitter_stream_download.py --query #Logistics" -d data
我遇到了这个确切的问题(我需要寻找超过一周的推文)。由于现有包太慢,我决定创建一个名为 Twper 的小包。我想你可能会觉得很有趣。自述文件中有一个示例可以解决您的确切问题。
免责声明:我是这个包的作者,它相对较新,但希望它能有所帮助。