使用tweepy和tesseract在一条推文中提取img并获取文本

问题描述 投票:0回答:1

我正在尝试使用tesseract对我的Twitter监视器实施ocr。我的问题是:如何从用户那里获取图像并立即运行ocr。我正在监视某些Twitter帐户的最新推文,如果有新推文并包含URL,则我正在浏览器中打开它,现在我想检查推文中是否还有图像,并在控制台中打印内容。我的代码如下:

import tweepy
import re
import webbrowser
import time
import urllib
from datetime import datetime
# a bunch of access keys
keys = [(example_keys)]

# which key is in use right now
key_index = 0
test = 0
url_store = ''



# Function to extract url from newest tweet 
def get_tweets(username, tweet_mode='extended'):
        # Authorization to consumer key and consumer secret 
        auth = tweepy.OAuthHandler(keys[key_index][0], keys[key_index][1]) 

        # Access to user's access key and access secret 
        auth.set_access_token(keys[key_index][2], keys[key_index][3]) 

        # Calling api 
        api = tweepy.API(auth) 

        # try to get latest tweet until rate limit is reached
        try:
            # Get newest tweet from profile
            tweets = api.user_timeline(screen_name=username, count=1)
            tweet = [tweet.text for tweet in tweets][0]
            print(tweet)



            global url_store
            # regex through tweet for url
            url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(tweet))

            # check if url was found and isn't the same as the url from the last tweet
            if (url!=[] and url[0]!=url_store):
                # store url in variable
                url_store=url[0]
                # open the url in webbrowser
                webbrowser.open(url[0])

                # save the html dom to a text file
                urllib.request.urlretrieve(url[0], "test.txt")

        # when rate limit is reached
        except tweepy.TweepError:
            # select the next key from array
            changeKeys() 

        # right now function always returns false
        return False


def changeKeys():
        global key_index
        # increment key_index by 1 unless end of key array is reached -> start from the beginning
        if key_index >= len(keys) - 1:
            key_index = 0
        else:
            key_index += 1

def getIMG():



# Driver code 
if __name__ == '__main__': 
    # boolean if url was found (right now its always false)
    found=False
    # never ending for loop
    while not found:
        # get tweets from specific twitter handle
        found = get_tweets("Trump",)
        time.sleep(0.02)
python twitter ocr tweepy
1个回答
0
投票

这是一个很好的问题。您使用RegEx的方法是查找图像的错误方法。

每个推文包含“实体”-参见https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object

您可以使用它们直接从推文中获取图像。

例如:

tweet.entities.urls

将为您提供Tweet中的所有URL。

© www.soinside.com 2019 - 2024. All rights reserved.