网络抓取(Rvest)后“日期”列导入不正确

问题描述 投票:0回答:0

我正在尝试从在线社交论坛中抓取多个链接/来源,但这些帖子来自不同的日期。例如,一个论坛主题可能会在 2020 年 12 月开放,而另一个论坛主题可能会在 2021 年 7 月开放,这对我来说按时间顺序组织在线帖子至关重要。

# Load the required libraries
library(tidyverse)
library(rvest)
library(writexl)
library(purrr)
library(pacman)
library(httr)
library(lubridate)
library(readr) 
library(zoo)
#install.packages("pacman")
#install.packages("tidyverse")

初始化向量来存储抓取的数据

username <- vector() 
post <- vector() 
date <- vector()
user_status <- vector()

下面的抓取代码运行良好,没有错误,但由于某种原因,“date”变量显示了从 2021 年 9 月 21 日起的所有日期,这是不正确的,因为下面 url_2 下主题的日期是从 2020 年 11 月开始,所以我认为数据集应该从 2020 年 11 月写的社交媒体帖子开始,而不是 2021 年 9 月。

#按如下方式组织数据:用户名、帖子、日期和用户状态。

# Loop through the pages of the forum thread
for (i in 1:100) {
# Construct the url for sources
  url_1 <- paste0("https://forums.hardwarezone.com.sg/threads/companies-may-exit-singapore-if-they-do-not-have-access-to-the-complementary-foreign-manpower-they-need-tan-see-leng.6817819/page-", i)  
  
    url_2 <- paste0("https://forums.hardwarezone.com.sg/threads/glgt-you-can-see-that-f-b-jobs-are-really-not-on-top-of-the-minds-of-singaporeans.6404486/page-", i) 
  
   url_3 <- paste0("https://forums.hardwarezone.com.sg/threads/disappointing-hard-truth-the-singaporean-worker-is-more-expensive-than-ft-coz-of-cpf-even-if-paid-same-wages-from-mom-data.6493727/page-", i)   

# Get the html content of the all sources
  page1 <- GET(url_1) 
   page2 <- GET(url_2) 
   page3 <- GET(url_3) 


 Parse the html content
  soup <- read_html(page1) 
  soup <- read_html(page2) 
  soup <- read_html(page3) 

# Extract the section containing the messages
  section <- html_nodes(soup, "article.message")
  
  # Loop through each message in the section
  for (j in section) {
    # Append the username of the message author to the username vector
    username <- c(username, html_text(html_node(j, "a.username"))) 
    
    # Append the post content of the message to the post vector
    post <- c(post, html_text(html_node(j, "div.bbWrapper")))
    
    # Extract the date string of the message
    date_str <- html_text(html_node(j, "time.u-dt"))
    
    # Check if the date string is not empty
    if (date_str != "") { 
      # Convert the date string to a date object and append it to the date vector
      date <- c(date, as.Date(date_str, format = "%b %d, %Y"))
    } else {
      # If the date string is empty, append NA to the date vector
      date <- c(date, NA) 
    }
    
    # Append the user status of the message author to the user_status vector
    user_status <- c(user_status, html_text(html_node(j, "h5.userTitle.message-userTitle")))
  }
}

从向量创建数据框

hardwarezone_posts <- data.frame(username, post, date, user_status)
# Format the date column as a date object
hardwarezone_posts$date <- format(as.Date(hardwarezone_posts$date, origin = "1970-01-01"), "%d/%m/%Y")

#打印数据示例

dput(hardwarezone_posts[1:6,c(1,2,3)])

输出:

structure(list(username = c("jonesftw", "matrix05", "whitecabbage", 
"walceab", "jonesftw", "Ianyhowtalk"), post = c("\n\t\n\t\n\t\t\n\t\t\n\t\t\tOn 14 September 2021, the Ministry of Manpower (MOM) mounted a 12-hour long enforcement operation at 22 locations island-wide as part of an investigation involving a syndicate suspected of bringing foreigners into Singapore on work passes obtained through false declarations. A total of 18 persons were arrested. The investigation is ongoing.\nModus Operandi\n2MOM began its investigations in July 2021 upon obtaining information of a foreigner’s attempts to acquire a work pass illegally. Through detailed analyses over a few months, MOM uncovered a potential syndicate suspected of setting up several shell companies to apply for work passes, even though they had no legitimate business operations.\n3Such syndicates typically recruit Singapore citizens and Singapore permanent residents to receive CPF contributions as “phantom local workers” in order to illegally inflate the companies’ quota to hire foreigners. Based on the inflated quota, the companies would apply for work passes for the foreigners through false declarations and collect kickbacks from them. These foreigners would then enter and remain in Singapore via these illegally obtained work passes. These practices undermine the integrity of our work pass framework.\nPenalties\n4Under the Employment of Foreign Manpower Act (EFMA), individuals convicted of obtaining work passes for a business that does not exist, is not in operation, or does not require the employment of foreigners may be liable to a fine not exceeding $6,000, imprisonment for up to two years, or both, per charge. If convicted for six or more charges, caning will also be imposed.\n5 Employers who hire foreigners seeking illegal employment may be liable to a fine not exceeding $30,000, imprisonment for up to 12 months, or both, per charge. Upon conviction, they will be barred from employing foreigners.\n6Foreigners who undertake employment without a valid work pass may be liable to a fine not exceeding $20,000, imprisonment for up to two years, or both. Upon conviction, they will be permanently barred from working in Singapore.\n7Members of the public who are aware of suspicious employment activities such as companies employing foreigners without valid work passes, persons receiving CPF contributions from unknown companies, or know of persons or employers who contravene the EFMA should report the matter to MOM at 64385122 or [email protected]. All information will be kept strictly confidential.\n\t\t\n\t\tClick to expand...\n\t\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t18 Arrested for Suspected Illegal Labour Importation\n\t\t\t\t\t\n\t\t\t\t\n\n\t\t\t\tOn 14 September 2021, the Ministry of Manpower (MOM) mounted a 12-hour long enforcement operation at 22 locations island-wide as part of an investigation involving a syndicate suspected of bringing foreigners into Singapore on work passes obtained...\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\twww.mom.gov.sg\n\t\t\t\t\n\t\t\t\n\t\t\n\t", 
"Dr Tan is doing enforcement. Jo closes 1( or 3) eyes", "All these 18 people should face 24 strokes of the cane EVERYDAY for the rest of their lives for being traitors", 
"you know what? this practice had been ongoing since many donkey years ago so why they only acting now? cause of PSP debate that brings fire to PAP doorsteps then they must act act a bit? I think whole MOM should be sacked, acting blur for so many donkey years showing that they are lacking the skills and trust of citizens. Why am i paying tax to pay their high salaries?", 
"\n\t\t\t{\n\t\t\t\t\"lightbox_close\": \"Close\",\n\t\t\t\t\"lightbox_next\": \"Next\",\n\t\t\t\t\"lightbox_previous\": \"Previous\",\n\t\t\t\t\"lightbox_error\": \"The requested content cannot be loaded. Please try again later.\",\n\t\t\t\t\"lightbox_start_slideshow\": \"Start slideshow\",\n\t\t\t\t\"lightbox_stop_slideshow\": \"Stop slideshow\",\n\t\t\t\t\"lightbox_full_screen\": \"Full screen\",\n\t\t\t\t\"lightbox_thumbnails\": \"Thumbnails\",\n\t\t\t\t\"lightbox_download\": \"Download\",\n\t\t\t\t\"lightbox_share\": \"Share\",\n\t\t\t\t\"lightbox_zoom\": \"Zoom\",\n\t\t\t\t\"lightbox_new_window\": \"New window\",\n\t\t\t\t\"lightbox_toggle_sidebar\": \"Toggle sidebar\"\n\t\t\t}\n\t\t\t\n\t\t\n\n\n\n\t\t\n\n\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t18 people in Singapore arrested for illegal labour importation\n\t\t\t\t\t\n\t\t\t\t\n\n\t\t\t\tInvestigation involves syndicate suspected of bringing foreigners here on work passes obtained through false declarations. Read more at straitstimes.com.\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\twww.straitstimes.com\n\t\t\t\t\n\t\t\t\n\t\t\n\t", 
"Last time don’t raid, last year don’t raid. Why raid now.\ngo google and search \nwork permit singapore \neasy PR singapore\nPR singapore\na lot can be find, a lot can be catch. Why only now. ???"
), date = structure(c(18891, 18891, 18891, 18891, 18891, 18891
), class = "Date")), row.names = c(NA, 6L), class = "data.frame")
r web-scraping dplyr rvest lubridate
© www.soinside.com 2019 - 2024. All rights reserved.