我使用try / except来逐行读取文件时捕获问题。 try块包含一系列操作,最后一个操作通常是异常的原因。令人惊讶的是,我注意到即使引发异常,所有先前的操作都在try块中执行。尝试将我创建的字典转换为数据框时,这是一个问题,因为列表的长度不相等。
此代码会产生问题:
d = {'dates':[],'states':[], 'longitude':[], 'latitude':[], 'tweet_ids':[], 'user_ids':[], 'source':[]}
for file in f:
print("Processing file "+file)
t1 = file.split('/')[-1].split("_")
date = t1[0]
state_code = t1[1]
state = list(states_ref.loc[states_ref.code==state_code]['abbr'])[0]
collection = JsonCollection(file)
counter = 0
for tweet in collection.get_iterator():
counter += 1
try:
d['dates'].append(date)
d['states'].append(state)
t2 = tweet_parser.get_entity_field('geo', tweet)
if t2 == None:
d['longitude'].append(t2)
d['latitude'].append(t2)
else:
d['longitude'].append(t2['coordinates'][1])
d['latitude'].append(t2['coordinates'][0])
#note: the 3 lines bellow are the ones that can raise an exception
temp = tweet_parser.get_entity_field('source', tweet)
t5 = re.findall(r'>(.*?)<', temp)[0]
d['source'].append(t5)
except:
c += 1
print("Tweet {} in file {} had a problem and got skipped".format(counter, file))
print("This is a total of {} tweets I am missing from the {} archive I process.".format(c, sys.argv[1]))
next
tab = pd.DataFrame.from_dict(d)
我通过移动易于在顶部给出错误的操作来解决问题,但我想更好地理解为什么try / except表现得像这样。有任何想法吗?
此代码有效:
d = {'dates':[],'states':[], 'longitude':[], 'latitude':[], 'tweet_ids':[], 'user_ids':[], 'source':[]}
for file in f:
print("Processing file "+file)
t1 = file.split('/')[-1].split("_")
date = t1[0]
state_code = t1[1]
state = list(states_ref.loc[states_ref.code==state_code]['abbr'])[0]
collection = JsonCollection(file)
counter = 0
for tweet in collection.get_iterator():
counter += 1
try:
#note: the 3 lines bellow are the ones that can raise an exception
temp = tweet_parser.get_entity_field('source', tweet)
t5 = re.findall(r'>(.*?)<', temp)[0]
d['source'].append(t5)
d['dates'].append(date)
d['states'].append(state)
t2 = tweet_parser.get_entity_field('geo', tweet)
if t2 == None:
d['longitude'].append(t2)
d['latitude'].append(t2)
else:
d['longitude'].append(t2['coordinates'][1])
d['latitude'].append(t2['coordinates'][0])
except:
c += 1
print("Tweet {} in file {} had a problem and got skipped".format(counter, file))
print("This is a total of {} tweets I am missing from the {} archive I process.".format(c, sys.argv[1]))
next
tab = pd.DataFrame.from_dict(d)
在附加到目标对象之前,您始终可以使用时态对象来保存函数的输出。这样,如果某些内容失败,它会在将数据放入目标对象之前引发异常。
try:
#Put all data into a temporal Dictionary
#Can raise an exception here
temp = tweet_parser.get_entity_field('source', tweet)
t2 = tweet_parser.get_entity_field('geo', tweet)
tempDictionary = {
"source" : re.findall(r'>(.*?)<', temp)[0],
"latitude" : None if (t2 is None) else t2['coordinates'][1],
"longitude" : None if (t2 is None) else t2['coordinates'][0]
}
#Append data from temporal Dictionary
d['source'].append(tempDictionary['source'])
d['latitude'].append(tempDictionary['latitude'])
d['longitude'].append(tempDictionary['longitude'])
d['dates'].append(date)
d['states'].append(state)
except:
c += 1
print("Tweet {} in file {} had a problem and got skipped".format(counter, file))
print("This is a total of {} tweets I am missing from the {} archive I process.".format(c, sys.argv[1]))