在 Bigquery /Python 中解码 url(Unicode 到 UTF-8)

问题描述 投票:0回答:1

我正在bigquery中使用uri,但是,我们的一些uri有中文的utm参数,我无法解码它们。我在 Bigquery 中尝试了用户定义的函数,但没有成功。这是我尝试过的示例 udf

DECLARE uri STRING;
SET uri = 'https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF';


CREATE TEMP FUNCTION DecodeKeywords(encodedKeyword STRING) RETURNS STRING LANGUAGE js AS R"""
try {
    return decodeURIComponent(encodedKeyword);
  } catch (error) {
    // Handle the error gracefully
    return encodedKeyword ;
  }
""";

select uri, DecodeKeywords(uri) as decoded_uri  

我也尝试了与 python 中的 urllib.parse.unquote 相同的方法(由 chatGPT 生成的解决方案)

import urllib.parse

# Example URL with encoded characters
url = "https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF"

try:
  # Decode the URL with utf-8 encoding (common for web)
  decoded_url = urllib.parse.unquote(url, encoding='utf-8')
except UnicodeDecodeError:
  # If decoding with utf-8 fails, try using latin-1 encoding (fallback)
  decoded_url = urllib.parse.unquote(url, encoding='latin-1')

# Print the decoded URL
print(decoded_url)

但是我无法将其翻译为正确的关键字,但是如果我将 utm_term(

%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF
) 粘贴到 chatGPT 中,它会显示输出(
"设施周期开始"
)

我尝试了 BigQuery 中的用户定义函数以及 python 中的 urllib.parse.unquote。

python-3.x url google-bigquery bigquery-udf
1个回答
0
投票

这个功能你可以尝试一下吗

DECLARE uri STRING;
SET uri = 'https://www.random.cn/services/modifcation?utm_medium=cpc&utm_source=baidu&utm_term=%2525E8%2525AE%2525BA%2525E6%252596%252587%2525E6%25259C%25259F%2525E5%252588%25258A%2525E6%25258A%252595%2525E7%2525A8%2525BF';

CREATE OR REPLACE TEMP FUNCTION DecodeKeywords(encodedKeyword STRING) RETURNS STRING LANGUAGE js AS R"""
try {
  return decodeURIComponent(encodedKeyword, 'utf-8');  // Specify UTF-8 encoding
} catch (error) {
  // Handle the error gracefully
  return encodedKeyword;
}
""";

SELECT uri, DecodeKeywords(REGEXP_EXTRACT(uri, r'utm_term=(.*)')) AS decoded_term  
FROM your_table;
© www.soinside.com 2019 - 2024. All rights reserved.