如何转换从YouTube提取的JavaScript代码段以符合JSON

问题描述 投票:0回答:1

是否有一种方法可以处理以下字符串以使其符合JSON标准? (键和值应用双引号引起来)

{ 'VIDEO_ID': "3xOYjRcgibA", 'LIST_ID': "PLfKGJrRXSczdRU1RCcEOJ9TDtvWUA1VU2", 'WAIT_TO_DELAYLOAD_FRAME_CSS': true, 'IS_UNAVAILABLE_PAGE': false, 'DROPDOWN_ARROW_URL': "\/yts\/img\/pixel-vfl3z5WfW.gif", 'AUTONAV_EXTRA_CHECK': false, 'JS_PAGE_MODULES': [ 'www/watch', 'www/ypc_bootstrap', 'www/watch_speedyg', 'www/watch_autoplayrenderer', '' ], "text": [ 'It shouldn\'t replace here', 'And don't here' ], 'test': 5 }

摘录摘录自YouTube网页JavaScript代码。

以下python代码是尝试从网页中提取字符串并将其转换为数据结构的尝试

import codecs
import requests
import json
import re

url  = 'https://www.youtube.com/watch?v=3xOYjRcgibA&list=PLfKGJrRXSczdRU1RCcEOJ9TDtvWUA1VU2'

page = requests.get(url)
html = page.content
html = html.decode('utf-8')
html = html.replace('\n','')
html = re.sub(' +',' ',html)

p = re.compile('yt.setConfig\((.*?)\);')
str = p.findall(html)
s = str[0]
data = json.loads(s)

输出

C:\temp\work\python>temp.py
Traceback (most recent call last):
  File "C:\temp\work\python\temp.py", line 17, in <module>
    data = json.loads(s)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python38-32\lib\json\__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python38-32\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Alex\AppData\Local\Programs\Python\Python38-32\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

使用以下代码可以达到预期的结果

use strict;
use warnings;
use feature 'say';

my $data = <DATA>;

$data =~ s/'(.*?)'([:, \]])/"$1"$2/gs;

say $data;

__DATA__
{ 'VIDEO_ID': "3xOYjRcgibA", 'LIST_ID': "PLfKGJrRXSczdRU1RCcEOJ9TDtvWUA1VU2", 'WAIT_TO_DELAYLOAD_FRAME_CSS': true, 'IS_UNAVAILABLE_PAGE': false, 'DROPDOWN_ARROW_URL': "\/yts\/img\/pixel-vfl3z5WfW.gif", 'AUTONAV_EXTRA_CHECK': false, 'JS_PAGE_MODULES': [ 'www/watch', 'www/ypc_bootstrap', 'www/watch_speedyg', 'www/watch_autoplayrenderer', '' ], "text": [ 'It shouldn\'t replace here', 'And don't here' ], 'test': 5 }

输出

{ "VIDEO_ID": "3xOYjRcgibA", "LIST_ID": "PLfKGJrRXSczdRU1RCcEOJ9TDtvWUA1VU2", "WAIT_TO_DELAYLOAD_FRAME_CSS": true, "IS_UNAVAILABLE_PAGE": false, "DROPDOWN_ARROW_URL": "\/yts\/img\/pixel-vfl3z5WfW.gif", "AUTONAV_EXTRA_CHECK": false, "JS_PAGE_MODULES": [ "www/watch", "www/ypc_bootstrap", "www/watch_speedyg", "www/watch_autoplayrenderer", "" ], "text": [ "It shouldn\'t replace here", "And don't here" ], "test": 5 }
javascript python json youtube
1个回答
1
投票

尝试一下,

import codecs
import requests
import json
import re

url  = 'https://www.youtube.com/watch?v=3xOYjRcgibA&list=PLfKGJrRXSczdRU1RCcEOJ9TDtvWUA1VU2'

page = requests.get(url)
html = page.content
html = html.decode('utf-8')
html = html.replace('\n','')
html = re.sub(' +',' ',html)

p = re.compile('yt.setConfig\((.*?)\);')
str = p.findall(html)
s = str[0]
s = json.dumps(s)
data = json.loads(s)
print(data)

使用json.dumps()将字符串转换为正确的json字符串

© www.soinside.com 2019 - 2024. All rights reserved.