拆分抓取的数据

问题描述 投票:0回答:1

使用 selenium 我从 Expedia 网页上抓取了以下数据

['Select and show fare information for Eurowings flight, departing at 3:20pm from Helsinki, arriving at 4:30pm in Berlin, Priced at $53 One way per traveler.  2 hours 10 minutes total travel time, Nonstop.', 'Select and show fare information for Air Baltic flight, departing at 5:20am from Helsinki, arriving at 7:00pm in Berlin, Priced at $84 One way per traveler, 3 left at this price.  14 hours 40 minutes total travel time, One stop, Layover for 12 hours 0 minutes in Riga.', 'Select and show fare information for Air Baltic flight, departing at 9:20pm from Helsinki, arriving at 7:55am in Berlin, Priced at $94 One way per traveler. Arrives 1 day later. 11 hours 35 minutes total travel time, One stop, Layover for 8 hours 55 minutes in Riga.', 'Select and show fare information for Scandinavian Airlines flight, departing at 10:05am from Helsinki, arriving at 3:15pm in Berlin, Priced at $142 One way per traveler.  6 hours 10 minutes total travel time, One stop, Layover for 3 hours 25 minutes in Copenhagen.• Scandinavian Airlines 1707 and 1673 operated by Sas Connect', 'Select and show fare information for Lufthansa flight, departing at 7:30pm from Helsinki, arriving at 8:05am in Berlin, Priced at $150 One way per traveler. Arrives 1 day later. 13 hours 35 minutes total travel time, One stop, Layover for 9 hours 35 minutes in Munich.', 'Select and show fare information for KLM flight, departing at 1:55pm from Helsinki, arriving at 5:40pm in Berlin, Priced at $163 One way per traveler.  4 hours 45 minutes total travel time, One stop, Layover for 0 hours 55 minutes in Amsterdam.', 'Select and show fare information for KLM flight, departing at 7:00am from Helsinki, arriving at 11:05am in Berlin, Priced at $163 One way per traveler.  5 hours 5 minutes total travel time, One stop, Layover for 1 hour 0 minutes in Amsterdam.', 'Select and show fare information for KLM flight, departing at 10:40am from Helsinki, arriving at 3:40pm in Berlin, Priced at $163 One way per traveler, 1 left at this price.  6 hours 0 minutes total travel time, One stop, Layover for 2 hours 0 minutes in Amsterdam.• KLM 1166 and 1829 operated by KLM Cityhopper', 'Select and show fare information for KLM flight, departing at 10:40am from Helsinki, arriving at 5:40pm in Berlin, Priced at $163 One way per traveler.  8 hours 0 minutes total travel time, One stop, Layover for 4 hours 5 minutes in Amsterdam.• KLM 1166 operated by KLM Cityhopper', 'Select and show fare information for KLM flight, departing at 1:55pm from Helsinki, arriving at 9:05pm in Berlin, Priced at $163 One way per traveler.  8 hours 10 minutes total travel time, One stop, Layover for 4 hours 2

现在我想分割数据,并创建一个数据框

data_e = {
    "Airline":lst_airline_e,
    "Price":lst_prices_e,
    "Departure Time": departure_time_expedia,
    "Arrival Time": arrival_time_expedia,
    "Duration":lst_duration_e,
    "No of Stops":lst_stops_e,
    "Layover Time":lst_layover_e
}
df_m = pd.DataFrame.from_dict(data_e, orient='index')
df_m

可以分割数据吗?

python split
1个回答
0
投票

只要数据始终遵循相同的格式,您就可以拆分数据。

试试这个:

import re

data = [
    'Select and show fare information for Eurowings flight, departing at 3:20pm from Helsinki, arriving at 4:30pm in Berlin, Priced at $53 One way per traveler. 2 hours 10 minutes total travel time, Nonstop.',
    'Select and show fare information for Air Baltic flight, departing at 5:20am from Helsinki, arriving at 7:00pm in Berlin, Priced at $84 One way per traveler, 3 left at this price. 14 hours 40 minutes total travel time, One stop, Layover for 12 hours 0 minutes in Riga.',
    'Select and show fare information for Air Baltic flight, departing at 9:20pm from Helsinki, arriving at 7:55am in Berlin, Priced at $94 One way per traveler. Arrives 1 day later. 11 hours 35 minutes total travel time, One stop, Layover for 8 hours 55 minutes in Riga.',
    'Select and show fare information for Scandinavian Airlines flight, departing at 10:05am from Helsinki, arriving at 3:15pm in Berlin, Priced at $142 One way per traveler. 6 hours 10 minutes total travel time, One stop, Layover for 3 hours 25 minutes in Copenhagen.• Scandinavian Airlines 1707 and 1673 operated by Sas Connect',
    'Select and show fare information for Lufthansa flight, departing at 7:30pm from Helsinki, arriving at 8:05am in Berlin, Priced at $150 One way per traveler. Arrives 1 day later. 13 hours 35 minutes total travel time, One stop, Layover for 9 hours 35 minutes in Munich.',
    'Select and show fare information for KLM flight, departing at 1:55pm from Helsinki, arriving at 5:40pm in Berlin, Priced at $163 One way per traveler. 4 hours 45 minutes total travel time, One stop, Layover for 0 hours 55 minutes in Amsterdam.',
    'Select and show fare information for KLM flight, departing at 7:00am from Helsinki, arriving at 11:05am in Berlin, Priced at $163 One way per traveler. 5 hours 5 minutes total travel time, One stop, Layover for 1 hour 0 minutes in Amsterdam.',
    'Select and show fare information for KLM flight, departing at 10:40am from Helsinki, arriving at 3:40pm in Berlin, Priced at $163 One way per traveler, 1 left at this price. 6 hours 0 minutes total travel time, One stop, Layover for 2 hours 0 minutes in Amsterdam.• KLM 1166 and 1829 operated by KLM Cityhopper',
    'Select and show fare information for KLM flight, departing at 10:40am from Helsinki, arriving at 5:40pm in Berlin, Priced at $163 One way per traveler. 8 hours 0 minutes total travel time, One stop, Layover for 4 hours 5 minutes in Amsterdam.• KLM 1166 operated by KLM Cityhopper'
]

# Initialize lists to store parsed data
airlines = []
prices = []
departure_times = []
arrival_times = []
durations = []
num_stops = []
layover_times = []

for item in data:
    # Extract airline name
    airline_match = re.search(r'for (.+?) flight', item)
    if airline_match:
        airlines.append(airline_match.group(1))
    else:
        airlines.append("N/A")

    # Extract price
    price_match = re.search(r'Priced at \$([\d.]+)', item)
    if price_match:
        prices.append(float(price_match.group(1)))
    else:
        prices.append(0.0)

    # Extract departure time and arrival time
    time_matches = re.findall(r'(\d{1,2}:\d{2}[apm]+) from (\w+), arriving at (\d{1,2}:\d{2}[apm]+) in', item)
    if time_matches:
        departure_times.append(time_matches[0][0])
        arrival_times.append(time_matches[0][2])
    else:
        departure_times.append("N/A")
        arrival_times.append("N/A")

    # Extract duration
    duration_match = re.search(r'(\d+ hours? \d+ minutes?) total travel time', item)
    if duration_match:
        durations.append(duration_match.group(1))
    else:
        durations.append("N/A")

    # Extract number of stops
    stops_match = re.search(r'One stop', item)
    if stops_match:
        num_stops.append(1)
    else:
        num_stops.append(0)

    # Extract layover time
    layover_match = re.search(r'Layover for (\d+ hours? \d+ minutes?)', item)
    if layover_match:
        layover_times.append(layover_match.group(1))
    else:
        layover_times.append("N/A")

# Create a data table
data_table = {
    'Airline': airlines,
    'Price': prices,
    'Departure Time': departure_times,
    'Arrival Time': arrival_times,
    'Duration': durations,
    'Number of Stops': num_stops,
    'Layover Time': layover_times
}

# Print the data table
for key, value in data_table.items():
    print(f'{key}: {value}')

输出

Airline: ['Eurowings', 'Air Baltic', 'Air Baltic', 'Scandinavian Airlines', 'Lufthansa', 'KLM', 'KLM', 'KLM', 'KLM']
Price: [53.0, 84.0, 94.0, 142.0, 150.0, 163.0, 163.0, 163.0, 163.0]
Departure Time: ['3:20pm', '5:20am', '9:20pm', '10:05am', '7:30pm', '1:55pm', '7:00am', '10:40am', '10:40am']
Arrival Time: ['4:30pm', '7:00pm', '7:55am', '3:15pm', '8:05am', '5:40pm', '11:05am', '3:40pm', '5:40pm']
Duration: ['2 hours 10 minutes', '14 hours 40 minutes', '11 hours 35 minutes', '6 hours 10 minutes', '13 hours 35 minutes', '4 hours 45 minutes', '5 hours 5 minutes', '6 hours 0 minutes', '8 hours 0 minutes']
Number of Stops: [0, 1, 1, 1, 1, 1, 1, 1, 1]
Layover Time: ['N/A', '12 hours 0 minutes', '8 hours 55 minutes', '3 hours 25 minutes', '9 hours 35 minutes', '0 hours 55 minutes', '1 hour 0 minutes', '2 hours 0 minutes', '4 hours 5 minutes']

您还可以从中创建一个 pandas 数据框,如下所示:

将 pandas 导入为 pd pd.DataFrame(data_table)

输出

Airline Price   Departure Time  Arrival Time    Duration    Number of Stops Layover Time
0   Eurowings   53.0    3:20pm  4:30pm  2 hours 10 minutes  0   N/A
1   Air Baltic  84.0    5:20am  7:00pm  14 hours 40 minutes 1   12 hours 0 minutes
2   Air Baltic  94.0    9:20pm  7:55am  11 hours 35 minutes 1   8 hours 55 minutes
3   Scandinavian Airlines   142.0   10:05am 3:15pm  6 hours 10 minutes  1   3 hours 25 minutes
4   Lufthansa   150.0   7:30pm  8:05am  13 hours 35 minutes 1   9 hours 35 minutes
5   KLM 163.0   1:55pm  5:40pm  4 hours 45 minutes  1   0 hours 55 minutes
6   KLM 163.0   7:00am  11:05am 5 hours 5 minutes   1   1 hour 0 minutes
7   KLM 163.0   10:40am 3:40pm  6 hours 0 minutes   1   2 hours 0 minutes
8   KLM 163.0   10:40am 5:40pm  8 hours 0 minutes   1   4 hours 5 minutes
© www.soinside.com 2019 - 2024. All rights reserved.