我刚开始使用 Python 进行网页抓取。我需要从以下 URL 中提取数据:https://comtradeplus.un.org/TradeFlow?Frequency=A&Flows=X&CommodityCodes=TOTAL&Partners=0&Reporters=all&period=2023&AggregateBy=none&BreakdownMode=plus
在此页面底部有一个“下载”按钮,默认情况下会创建一个 CSV 文件并将其复制到 Windows 下载文件夹。我不需要整个数据库;例如,我需要一个代码为 070110 且国家/地区的记者代码为 688 的产品。
有人可以帮助我吗?
我尝试了各种方法,但没有成功,因为我是网络抓取新手。
以下是如何查询 REST API 并以 Json 形式获取数据(进入 pandas 数据帧)的示例:
import pandas as pd
import requests
api_url = "https://comtradeapi.un.org/public/v1/preview/C/A/HS"
params = {
"period": "2023",
"reporterCode": "688",
"partnerCode": "0",
"flowCode": "x",
"cmdCode": "total,070110",
"customsCode": "c00",
"motCode": "0",
"partner2Code": "0",
"undefinednone": "",
"breakdownMode": "plus",
"includeDesc": "True",
"countOnlyFalse": "",
}
data = requests.get(api_url, params=params).json()
df = pd.DataFrame(data["data"])
print(df.head())
打印:
typeCode freqCode refPeriodId refYear refMonth period reporterCode reporterISO reporterDesc flowCode flowDesc partnerCode partnerISO partnerDesc partner2Code partner2ISO partner2Desc classificationCode classificationSearchCode isOriginalClassification cmdCode cmdDesc aggrLevel isLeaf customsCode customsDesc mosCode motCode motDesc qtyUnitCode qtyUnitAbbr qty isQtyEstimated altQtyUnitCode altQtyUnitAbbr altQty isAltQtyEstimated netWgt isNetWgtEstimated grossWgt isGrossWgtEstimated cifvalue fobvalue primaryValue legacyEstimationFlag isReported isAggregate
0 C A 20230101 2023 52 2023 688 SRB Serbia X Export 0 W00 World 0 W00 World H6 HS True TOTAL All Commodities 0 False C00 TOTAL CPC 0 0 TOTAL MOT -1 N/A 0.0 False -1 N/A 0.0 False NaN False 0.0 False None 3.093460e+10 3.093460e+10 0 False True
1 C A 20230101 2023 52 2023 688 SRB Serbia X Export 0 W00 World 0 W00 World H6 HS True 070110 Vegetables; seed potatoes, fresh or chilled 6 True C00 TOTAL CPC 0 0 TOTAL MOT 8 kg 114750.0 False 8 kg 114750.0 False 114750.0 False 0.0 False None 9.332200e+04 9.332200e+04 0 False True