如何解决这个文本?

问题描述 投票:-1回答:1

当我尝试拉amazon.com的内容,Node.js的请求模块提供了文本安慰是这样的:

?y?H>5Z??{???O???↔??????|◄???♦?<∟?h??j??B??43?!>ã?l???∟???        ??M??v6?
eP$r|???$;??☼?Thc???ea??l?p??k?▼??☺↓i???L?v7?x?6??M#tA??↕v?Z)?p1´?vQ??9?ET?1???J
?_c?☼↨u?Î%5↨??q??¶▬l??→1↨?$??h?_??J-;??r???+?▲?F?Hw?♦????lE?Qs?Hx??o9@??V3??
L?bk?fcb??????o?E9??]????"??}x♦7?7r→?z0KE??Z?▬?4?I?A??R↨???/s<???☻V`?f!????3?2;?
?????L???????!?OA9↕iC?/????r?0??U?♫M?♂?}y???=,e?M?↔Q[¶`xn?|B??w?D♫f?↓?↨☻n¶????
zH??4p??☻???O?☻♠????w↨?????P'???z?etXN'?U??`??Z??♀">♀j????????????5???!?????#u??
0X?i?zb?☺?[?&∟?>??‼??Q??+???}???z▲A???9§????O????♠????  ?∟?es???j??D0J?s?[?;U??!
???l0???u       i35_???∟x???2<RF???{???\d♥<?8?W?p>◄→]?????¶+???|(???☻z??♦??v??8⌂
,?"▲??∟?l???1?A?7zt??Q,?'??♥?n???♦,??r?N?H\?-?YA>)?♦??|X?C;I?q⌂]r↓?H??¶D??????>C
?X??? ?b???o?_+R?9??8??^??_???‼????_*v?↓?♣  ??"♠?♀!J1?Ib????u??Bg?a?S??↕?d1??&hZ
?H?↑?N♣???!?⌂|b?.0?&'▬?→?C*5ukp?▲4?☻>?7♣??,????2?\??$?X??4?T7???H7?$5?"?????,I?→
h??zy↕?▼???☻7??J]Ab1|rF?&^?↔??J]SG??<??►4?☺?↕?♥B?~P? 9∟?e|.BR?0♥?           ???]

然而,当我尝试从amazon.co.uk提取数据,正如我们所料,它给HTML输出的结构类似的:

<html><head>...</head><body>...</body></html>

我该如何解决第一种情形?我怎样才能获得HTML内容?是否有办法做到这一点?

码:

const rp = require('request-promise');
const url = 'http://www.amazon.co.uk/gp/product/B0085EY4MS';
const fs = require('fs');

rp(url)
  .then(function(html){

    fs.writeFile('mynewfile3.txt', html, function (err) {
      if (err) throw err;
      console.log('Saved!');
    });
  })
  .catch(function(err){

  });
node.js web-scraping request
1个回答
0
投票

好吧,原来的问题是与请求的编码。您可以为编码类型增加一个选项,这将解决您的问题。更新后的代码如下:

const rp = require('request-promise');
const fs = require('fs');

const opts = {
    uri: "http://www.amazon.com/Fallout-76-PlayStation-4/dp/B07DD9571S",
    headers: {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36",
    },
    gzip: true, //Added to fix the issue
}

rp(opts)
  .then(function(html){

    fs.writeFile('mynewfile3.txt', html, function (err) {
      if (err) throw err;
      console.log('Saved!');
    });
  })
  .catch(function(err){
    console.log(err);
  });

不是必需的报头。

© www.soinside.com 2019 - 2024. All rights reserved.