所以我提出这个聊天机器人训练了一个月的reddit的意见。脚本我目前正在上创建一个数据库,并加载它与来自JSON文件的一些数据。
当我运行的代码,但它实际上是管理创造了sqlite3的数据库,但它打印出一个错误:
Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
Extra data: line 1 column 16 (char 15)
Extra data: line 1 column 8 (char 7)
Extra data: line 1 column 11 (char 10)
Extra data: line 1 column 8 (char 7)
Extra data: line 1 column 9 (char 8)
Extra data: line 1 column 15 (char 14)
Extra data: line 1 column 9 (char 8)
Extra data: line 1 column 10 (char 9)
Extra data: line 1 column 17 (char 16)
Extra data: line 1 column 6 (char 5)
Extra data: line 1 column 12 (char 11)
Extra data: line 1 column 13 (char 12)
Extra data: line 1 column 13 (char 12)
Extra data: line 1 column 26 (char 25)
Extra data: line 1 column 21 (char 20)
Extra data: line 1 column 10 (char 9)
Extra data: line 1 column 16 (char 15)
Extra data: line 1 column 7 (char 6)
Extra data: line 1 column 20 (char 19)
Extra data: line 1 column 16 (char 15)
Extra data: line 1 column 10 (char 9)
Expecting value: line 1 column 1 (char 0)
任何人都可以告诉我,我能做些什么来解决这个问题?
BTW继承人的全部代码:
import sqlite3
import json
from datetime import datetime
import time
import ast
timeframe = '2015-01'
sql_transaction = []
start_row = 0
cleanup = 1000000
connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()
def create_table():
c.execute("CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT)")
def format_data(data):
data = data.replace('\n', ' newlinechar ').replace('\r', ' newlinechar ').replace('"', "'")
return data
def transaction_bldr(sql):
global sql_transaction
sql_transaction.append(sql)
if len(sql_transaction) > 1000:
c.execute('BEGIN TRANSACTION')
for s in sql_transaction:
try:
c.execute(s)
except:
pass
connection.commit()
sql_transaction = []
def sql_insert_replace_comment(commentid, parentid, parent, comment, subreddit, time, score):
try:
sql = """UPDATE parent_reply SET parent_id = ?, comment_id = ?, parent = ?, comment = ?, subreddit = ?, unix = ?, score = ? WHERE parent_id =?;""".format(
parentid, commentid, parent, comment, subreddit, int(time), score, parentid)
transaction_bldr(sql)
except Exception as e:
print('s0 insertion', str(e))
def sql_insert_has_parent(commentid, parentid, parent, comment, subreddit, time, score):
try:
sql = """INSERT INTO parent_reply (parent_id, comment_id, parent, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}","{}",{},{});""".format(
parentid, commentid, parent, comment, subreddit, int(time), score)
transaction_bldr(sql)
except Exception as e:
print('s0 insertion', str(e))
def sql_insert_no_parent(commentid, parentid, comment, subreddit, time, score):
try:
sql = """INSERT INTO parent_reply (parent_id, comment_id, comment, subreddit, unix, score) VALUES ("{}","{}","{}","{}",{},{});""".format(
parentid, commentid, comment, subreddit, int(time), score)
transaction_bldr(sql)
except Exception as e:
print('s0 insertion', str(e))
def acceptable(data):
if len(data.split(' ')) > 1000 or len(data) < 1:
return False
elif len(data) > 32000:
return False
elif data == '[deleted]':
return False
elif data == '[removed]':
return False
else:
return True
def find_parent(pid):
try:
sql = "SELECT comment FROM parent_reply WHERE comment_id = '{}' LIMIT 1".format(pid)
c.execute(sql)
result = c.fetchone()
if result != None:
return result[0]
else:
return False
except Exception as e:
# print(str(e))
return False
def find_existing_score(pid):
try:
sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(pid)
c.execute(sql)
result = c.fetchone()
if result != None:
return result[0]
else:
return False
except Exception as e:
# print(str(e))
return False
if __name__ == '__main__':
create_table()
row_counter = 0
paired_rows = 0
with open(r'C:\Users\hermans\Desktop\RedditBot\RC_2015-01.json', buffering=1000) as f:
for row in f:
# print(row)
# time.sleep(555)
row_counter += 1
if row_counter > start_row:
try:
row = json.loads(row)
parent_id = row['parent_id'].split('_')[1]
body = format_data(row['body'])
created_utc = row['created_utc']
score = row['score']
comment_id = row['id']
subreddit = row['subreddit']
parent_data = find_parent(parent_id)
existing_comment_score = find_existing_score(parent_id)
if existing_comment_score:
if score > existing_comment_score:
if acceptable(body):
sql_insert_replace_comment(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
else:
if acceptable(body):
if parent_data:
if score >= 2:
sql_insert_has_parent(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
paired_rows += 1
else:
sql_insert_no_parent(comment_id, parent_id, body, subreddit, created_utc, score)
except Exception as e:
print(str(e))
if row_counter % 100000 == 0:
print('Total Rows Read: {}, Paired Rows: {}, Time: {}'.format(row_counter, paired_rows, str(datetime.now())))
#if row_counter > start_row:
# if row_counter % cleanup == 0:
# print("Cleanin up!")
# sql = "DELETE FROM parent_reply WHERE parent IS NULL"
# c.execute(sql)
# connection.commit()
# c.execute("VACUUM")
# connection.commit()
和JSON文件(它包含的方式比这更多的评论,但在200.000线不想要粘贴...):
{
"score_hidden": false,
"name": "t1_cnas8zv",
"link_id": "t3_2qyr1a",
"body": "Most of us have some family members like this. *Most* of my family is like this. ",
"downs": 0,
"created_utc": "1420070400",
"score": 14,
"author": "YoungModern",
"distinguished": null,
"id": "cnas8zv",
"archived": false,
"parent_id": "t3_2qyr1a",
"subreddit": "exmormon",
"author_flair_css_class": null,
"author_flair_text": null,
"gilded": 0,
"retrieved_on": 1425124282,
"ups": 14,
"controversiality": 0,
"subreddit_id": "t5_2r0gj",
"edited": false
} {
"distinguished": null,
"id": "cnas8zw",
"archived": false,
"author": "RedCoatsForever",
"score": 3,
"created_utc": "1420070400",
"downs": 0,
"body": "But Mill's career was way better. Bentham is like, the Joseph Smith to Mill's Brigham Young.",
"link_id": "t3_2qv6c6",
"name": "t1_cnas8zw",
"score_hidden": false,
"controversiality": 0,
"subreddit_id": "t5_2s4gt",
"edited": false,
"retrieved_on": 1425124282,
"ups": 3,
"author_flair_css_class": "on",
"gilded": 0,
"author_flair_text": "Ontario",
"subreddit": "CanadaPolitics",
"parent_id": "t1_cnas2b6"
}
编辑:我现在已经试图删除尝试:除了:,但现在我遇到一个新的错误,我不明白,实际上更早遇到:
Traceback (most recent call last):
File "C:\Users\hermans\Desktop\RedditBot\Current_Create_DB.py", line 121, in <module>
row = json.loads(row)
File "C:\Program Files (x86)\Python 3.5\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Program Files (x86)\Python 3.5\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files (x86)\Python 3.5\lib\json\decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
和JSON文件(它包含的方式比这更多的评论,但在200.000线不想要粘贴...):
你已经证明是无效的JSON。剪断了一堆的数据线,我们看到了普遍的问题:
{
"score_hidden": false,
} {
"distinguished": null,
}
所述} {
是因为你的数据包含多个JSON对象(如JSON标准调用它们)一个接一个 - 而不是让它们嵌套在另一个层内(大概JSON数组,又在该标准的术语)。这应该不是看起来像:
[
{
"score_hidden": false,
}, {
"distinguished": null,
}
]
你得到的错误是让您对JSON解析器未能解释无效的JSON(因为它是无效的)细节。通过查看例外回溯 - 当你正确读取的错误信息,这变得清晰。然而,你的代码编写阻止您这样做,由只打印出异常信息,然后继续,好像什么可怕的事情:
try:
row = json.loads(row)
# lots more code not relevant to the reported error
except Exception as e:
print(str(e))
不要这样做。您只能使事情变得更难自己。的方式来解决问题,你的代码是在同一时间写更少的代码,并确保它,然后再继续工作。这种异常处理是相反的,并导致对所以这是无关紧要的问题,因为你已经失去了相关指导意见张贴大量的代码:)
如果你已经离开了这个尝试/除块,你的代码将立即救助的第一个错误,但它会告诉你更多的东西有用。这将指向row = json.loads(row)
线,并会标注错误的json.decoder.JSONDecodeError
,这是一个很大的提示。但更重要的是,不断的东西后运行的代码出现问题,没有一个真正的尝试来解决这个问题(或者至少是正确决定,它可以安全地被忽略),有机会弄乱你的数据进一步。从长远来看,这将导致你的痛苦和苦难,所以这是我试图动摇你改掉这个习惯现在:)