字符串拆分循环通过Dataframe

问题描述 投票:2回答:5

我正在尝试使用Python循环遍历数据帧列,其格式如下:

Town 1, AL, USA
Town 2, AL, USA
Town 3, AK, USA
Town 4, CA, USA
Town 5, DE, USA
Town 6, MI, USA

我一直在尝试使用split()方法和原始数据框(包括犯罪描述和URL列)以及它自己的列,作为数据框和Series对象。这些对象都没有可用的方法split()。

所需的输出将是STATE缩写的另一列,所以我理解我正在尝试找到df.split(', ')的等价物,并为该系列或数据帧的该分割附加第二个[1]索引。 (如果我弄错了,请纠正我)。

我该怎么做呢?

python string pandas split dataframe
5个回答
7
投票

你可以使用vectorized string methods,例如df["col"].str.split(", ").str[1]

>>> df
               col
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA
>>> df["col"].str.split(", ")
0    [Town 1, AL, USA]
1    [Town 2, AL, USA]
2    [Town 3, AK, USA]
3    [Town 4, CA, USA]
4    [Town 5, DE, USA]
5    [Town 6, MI, USA]
Name: col, dtype: object
>>> df["col"].str.split(", ").str[1]
0    AL
1    AL
2    AK
3    CA
4    DE
5    MI
Name: col, dtype: object

3
投票

使用.apply()对列中的每个元素执行一些函数

import pandas as pd

data=[
    'Town 1, AL, USA',
    'Town 2, AL, USA',
    'Town 3, AK, USA',
    'Town 4, CA, USA',
    'Town 5, DE, USA',
    'Town 6, MI, USA',
]

df = pd.DataFrame( data )

print df

df['state'] = df[0].apply(lambda x: x.split(',')[1])

print df

结果

                 0
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA

                 0 state
0  Town 1, AL, USA    AL
1  Town 2, AL, USA    AL
2  Town 3, AK, USA    AK
3  Town 4, CA, USA    CA
4  Town 5, DE, USA    DE
5  Town 6, MI, USA    MI

编辑:

顺便说一句:我在互联网上搜索pandas split column to new columns,你甚至可以用这种方式将它分成3个新列:

def split_more(x):
    return pd.Series( x.split(',') )

df[ ['town', 'state','country'] ] = df[0].apply(split_more)

print df

结果:

                 0    town state country
0  Town 1, AL, USA  Town 1    AL     USA
1  Town 2, AL, USA  Town 2    AL     USA
2  Town 3, AK, USA  Town 3    AK     USA
3  Town 4, CA, USA  Town 4    CA     USA
4  Town 5, DE, USA  Town 5    DE     USA
5  Town 6, MI, USA  Town 6    MI     USA

2
投票

Series have string methods可以通过他们的str属性访问。例如,您可以使用df['addr'].str.extract

In [34]: df = pd.read_table('data', sep='-', header=None, names=['addr'])

In [35]: df
Out[35]: 
              addr
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA

In [36]: df[['Town', 'State', 'Country']] = df['addr'].str.extract(r'([^,]+),([^,]+),([^,]+)')

In [38]: del df['addr']

产量

In [39]: df
Out[39]: 
     Town State Country
0  Town 1    AL     USA
1  Town 2    AL     USA
2  Town 3    AK     USA
3  Town 4    CA     USA
4  Town 5    DE     USA
5  Town 6    MI     USA

0
投票

在使用%timeit比较不同方法的基础上,我发现在列中使用字符串时,列表推导通常是赢家。

In [1]: %paste 
import pandas as pd

data=[
    'Town 1, AL, USA',
    'Town 2, AL, USA',
    'Town 3, AK, USA',
    'Town 4, CA, USA',
    'Town 5, DE, USA',
    'Town 6, MI, USA',
]

df = pd.DataFrame(data)
df

## -- End pasted text --
Out[1]: 
                 0
0  Town 1, AL, USA
1  Town 2, AL, USA
2  Town 3, AK, USA
3  Town 4, CA, USA
4  Town 5, DE, USA
5  Town 6, MI, USA

%timeit测试:

In [2]: %timeit df['state'] = [x.split(',')[1] for x in df[0]]
1000 loops, best of 3: 350 µs per loop

In [3]: %timeit df['state'] = df[0].apply(lambda x: x.split(',')[1])
1000 loops, best of 3: 671 µs per loop

In [4]: %timeit df['state'] = df[0].str.split(", ").str[1]
100 loops, best of 3: 1.1 ms per loop

0
投票

函数split_str_columns_df循环一次拆分所有字符串列。

Also Generate new columns with the splits, and delete the old ones.

You choose your splitter: " " or "," or ....

只需在上面看到的函数定义中介绍它:

new = df[col].str.split(" ", n = 1, expand = True) 

或者如果你想要一个,并分成3列(n = 2),你将不得不调整一点功能来合并一个3rth列

new = df[col].str.split(", ", n = 2, expand = True) 

Example Data: (the whole example data is located at the end of this post)

data_df.head(3)

.

    Rating          Score    Ocupation
0   RATINGSTUFE F   NaN      Animator Senior
1   RATINGSTUFE B   4.0      Animator
2   NaN             7.0      Art administrator

Call the function: split_str_columns_df(data_df,columns)

我要拆分的列是'Rating''Ocupation'

columns=['Rating','Ocupation']
dff=split_str_columns_df(data_df,columns)

输出:

   Score     Rating_a Rating_b Ocupation_a    Ocupation_b
0    NaN  RATINGSTUFE        F    Animator         Senior
1    4.0  RATINGSTUFE        B    Animator           None
2    7.0          NaN      NaN         Art  administrator

split_str_columns_df(data_df,columns)

我使用的函数definitiopn是:

def split_str_columns_df(dataframe,str_columns):
    ''' Function that splits the str columns " " is the separation, create 2 new 
        columns and remove the original. If the column's name is 'Name' the 2 new columns will be 'Name_a' and 'Name_b'.'''
    # new data frame with split value columns 
    df=dataframe
    for i in range(len(str_columns)):
        col=str_columns[i]
        new_col1=col+'_a'
        new_col2=col+'_b'

        #Split
        new = df[col].str.split(" ", n = 1, expand = True)   
        # making seperate first name column from new data frame 
        df[new_col1]= new[0]   
        # making seperate last name column from new data frame 
        df[new_col2]= new[1] 

        # Dropping old Name columns 
        df.drop(columns =[col], inplace = True)     
    return df

意识到!:

  1. 当分割NaN值时,新的2列得到nan(两者)(Col Rating_aRating_b
  2. 如果一行包含1个单词,当您拆分第2列时,您将获得None(列Ocupation_b
  3. 意识到原始列RatingOcupations被删除,我们有Rating_aRating_b。和Ocupations_aOcupations_b

生成示例的数据:

data_df=pd.DataFrame(['RATINGSTUFE F', 'RATINGSTUFE B',np.nan, 'RATINGSTUFE L',
   'RATINGSTUFE G', np.nan, 'RATINGSTUFE M', 'RATINGSTUFE L',
   'RATINGSTUFE F', 'RATINGSTUFE M'], columns=['Rating'])

data_df['Score']=[np.nan,4,7,4,9,4,3,1,2,5]
data_df['Ocupation']=['Animator Senior', 'Animator', 'Art administrator', 'Animator Junior', 'Dancer', 'Colorist Junior', 'Ceramics artist', 'Chief creative officer','Colorist', 'Dancer']
© www.soinside.com 2019 - 2024. All rights reserved.