使用物种和菌株名称、使用网络抓取(使用 BeautifulSoup 或 Selenium)难以提取 GenBank 登录号

问题描述 投票:0回答:1

我需要使用 BeautifulSoup 和/或 Selenium 从网页中提取特定信息。我正在尝试从网页中提取与特定生物体相关的信息,但遇到了困难。

我试过这个

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

from selenium.webdriver.common.by import By

# Find elements containing the text "JCM 5058"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'JCM 5058')]")

if elements:
  print("Text 'JCM 5058' found on the webpage.")
  # Loop through elements and extract text
  text_to_print = ""
  for element in elements:
    text_to_print += element.text + "\n"  # Add newline for readability
  # Print the extracted text
  print(text_to_print)

else:
  print("Text 'JCM 5058' not found on the webpage.")

我就变成这样了

Text 'JCM 5058' found on the webpage.

JCM 5058
("Streptomyces anthocyanicus"[Organism] AND ("Streptomyces anthocyanicus"[Organism] OR JCM 5058[All Fields])) AND (latest[filter] AND all[filter] NOT anomalous[filter])
Streptomyces anthocyanicus JCM 5058 AND (latest[filter] AND all[f... (6)

但是匹配部分在网页中看起来像这样

ASM1465115v1

Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]

我想提取或打印所有这些信息或将其打印在表格中。

python selenium-webdriver beautifulsoup biopython
1个回答
1
投票

我在周围工作时得到了答案,但不知道这是正确的方法,

from selenium import webdriver
from bs4 import BeautifulSoup

# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"

# Open a Chrome browser
driver = webdriver.Chrome()

# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"

# Navigate to the search URL
driver.get(search_url)

# Get the page source after Selenium waits for the page to fully load
page_source = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(page_source, 'html.parser')

# Find all div elements containing assembly information
assembly_divs = soup.find_all("div", class_="rprt")

# Loop through each div and check if it contains the desired information
for div in assembly_divs:
    if "JCM 5058" in div.get_text():
        # Print the assembly information
        print(div.get_text().strip())
        break
else:
    print("No matched section found on the webpage.")

# Close the browser
driver.quit()

将打印此内容

Select item 81211415.ASM1465115v1Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)Infraspecific name: Strain: JCM 5058Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)Date: 2020/09/12Assembly level: ScaffoldGenome representation: fullRelation to type material: assembly from type materialGenBank assembly accession: GCA_014651155.1 (latest) RefSeq assembly accession: GCF_014651155.1 (latest) IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]

另一个简单的方法是

from selenium import webdriver
from selenium.webdriver.common.by import By

# Open a Chrome browser
driver = webdriver.Chrome()

# Load the webpage
driver.get("https://www.ncbi.nlm.nih.gov/assembly/?term=Streptomyces+anthocyanicus+JCM+5058")

# Find the element containing the GenBank assembly accession using XPath
genbank_element = driver.find_element(By.XPATH, "//dl[contains(., 'JCM 5058')]/following-sibling::dl[6]")

# Extract the GenBank assembly accession text
genbank_accession = genbank_element.text.split(": ")[1]

# Print the GenBank assembly accession
print(genbank_accession)

# Close the browser
driver.quit()

打印

GCA_014651155.1 (latest)
© www.soinside.com 2019 - 2024. All rights reserved.