如何避免数据重复到mysql

问题描述 投票:0回答:1

我写了这段代码来从ebay.com搜刮汽车信息(title, make, model, transmission, year, price)数据并保存在mysql中,我想如果所有行的(title, make, model, ...)项目与另一行相似,则避免将这些数据插入到mysql中,*只有当所有行的项目相似时(因为一些title是simialr或者一些model或者...)

代码:

import requests
from bs4 import BeautifulSoup
import re
import mysql.connector

conn = mysql.connector.connect(user='root', password='******', 
host='127.0.0.1', database='web_scraping')
cursor = conn.cursor()
url = 'https://www.ebay.com/b/Cars-Trucks/6001?_ fsrp=0&_sacat=6001&LH_BIN=1&LH_ItemCondition=3000%7C1000%7C2500&rt=nc&_stpos=95125&Model%2520Year=2020%7C2019%7C2018%7C2017%7C2016%7C2015'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
ebay_cars = soup.find_all('li', class_='s-item')
for car_info in ebay_cars:
    title_div = car_info.find('div', class_='s-item__wrapper clearfix')
    title_sub_div = title_div.find('div', class_='s-item__info clearfix')
    title_p = title_sub_div.find('span', class_='s-item__price')
    title_tag = title_sub_div.find('a', class_='s-item__link')
    title_maker = title_sub_div.find('span', class_='s-item__dynamic s- 
    item__dynamicAttributes1')
    title_model = title_sub_div.find('span', class_='s-item__dynamic s- 
    item__dynamicAttributes2')
    title_trans = title_sub_div.find('span', class_='s-item__dynamic s- 
    item__dynamicAttributes3')



name_of_car = re.sub(r'\d{4}', '', title_tag.text)
maker_of_car = re.sub(r'Make: ','', title_maker.text)
model_of_car = re.sub(r'Model: ', '', title_model.text)
try:
    if title_trans.text.startswith(r'Transmission: '):
        trans_of_car = re.sub(r'Transmission: ', '', title_trans.text)
    else:
        trans_of_car = ''
except AttributeError:
    trans_of_car = ''
year_of_car = re.findall(r'\d{4}', title_tag.text)
year_of_car = ''.join(str(x) for x in year_of_car)

price_of_car = title_p.text
print(name_of_car ,trans_of_car )
sql = 'INSERT INTO car_info(Title, Maker, Model, Transmission, Year, Price) 
VALUES (%s, %s, %s, %s, %s, %s)'
cursor.execute(sql , (name_of_car, maker_of_car, model_of_car, trans_of_car, 
year_of_car, price_of_car))



conn.commit()
conn.close()
python mysql sql sql-update sql-insert
1个回答
2
投票

一个选项使用 not exists:

insert into car_info (title, maker, model, transmission, year, price) 
select v.*
from (select %s title, %s maker, %s model, %s transmission, %s year, %s price) v
where not exists (
    select 1
    from car_info c
    where 
        (c.title, c.maker, c.model, c.transmission, c.year, c.price)
         = (v.title, v.maker, v.model, v.transmission, v.year, v.price)
);

但更简单的做法是在表中所有列上创建一个唯一的键,如。

create unique index idx_car_info_uniq
    on car_info(title, maker, model, transmission, year, price); 

这样可以防止任何程序在表中插入重复的内容。你可以很优雅地忽略否则会引起的错误。on duplicate key 语法。

insert into car_info (title, maker, model, transmission, year, price) 
values (%s, %s, %s, %s, %s, %s)
on duplicate key update title = values(title);

1
投票

你可以把这个查询的结果保存在一个变量里。

SELECT COUNT(*) FROM car_info WHERE Title = <titleValue>, Maker = <makerValue>, Model = <modelValue>, Transmission = <transmisionValue>, Year = <yearValue>, Price = <priceValue>

然后,如果变量的值是

  • 1、你跳过INSERT,因为你在表中已经有了这个条目。
  • 0,你进行INSERT,因为你在表中没有该条目。

这只是一种方法。


0
投票

将主键声明为表中的所有列。参见 https:/www.mysqltutorial.orgmysql-primary-key

© www.soinside.com 2019 - 2024. All rights reserved.