如何将数据框列拆分为更多列,以另一个列值为条件?

问题描述 投票:0回答:1

我被卡住了,因为我无法将数据框列拆分为更多列,条件是另一个列值。我有一个 pandas 数据框,它是我直接从一个超过 100K 行的“.csv”文件生成的。

摘录1:

我想用','(逗号)将

dca
列拆分为更多列。分裂的数量将受到
n_mppts
.

中的值的限制

编辑于 2023-04-12:

我可以使用以下代码在从此 .csv 文件生成的数据框中成功执行拆分列操作(感谢@Abdulmajeed 的解决方案):

def split_dca(row):
    values = row['dca'].split(',') if row['dca'] else []
    values += [float('NaN')] * (row['n_mppts'] - len(values))
    values = values[:row['n_mppts']]
    return pd.Series(values)

df_dca_dcv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418643 entries, 0 to 418642
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   pipe_id      418643 non-null  int64 
 1   date         418643 non-null  object
 2   inverter_id  418643 non-null  object
 3   n_mppts      418643 non-null  int64 
 4   dca          418538 non-null  object
 5   dcv          418538 non-null  object
dtypes: int64(2), object(4)
memory usage: 19.2+ MB

df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('{', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('}', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)

摘录2:

mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
mppts_dca['dca_mppt_0'] = pd.to_numeric(mppts_dca[0], errors='coerce')
mppts_dca['dca_mppt_1'] = pd.to_numeric(mppts_dca[1], errors='coerce')
mppts_dca['dca_mppt_2'] = pd.to_numeric(mppts_dca[2], errors='coerce')
mppts_dca['dca_mppt_3'] = pd.to_numeric(mppts_dca[3], errors='coerce')
mppts_dca['dca_mppt_4'] = pd.to_numeric(mppts_dca[4], errors='coerce')
mppts_dca['dca_mppt_5'] = pd.to_numeric(mppts_dca[5], errors='coerce')
mppts_dca['dca_mppt_6'] = pd.to_numeric(mppts_dca[6], errors='coerce')
mppts_dca['dca_mppt_7'] = pd.to_numeric(mppts_dca[7], errors='coerce')
mppts_dca['dca_mppt_8'] = pd.to_numeric(mppts_dca[8], errors='coerce')
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)

摘录3:

但是,当我从指定

inverter_id
=a2 的 pandas sql 查询生成数据框时,我遇到了一个问题,因此当前的解决方案不会成功(该问题也存在于其他
inverter_id
值):

df_dca_dcv = pd.read_sql_query("select pipe_id,created_at as date,inverter_id,n_mppts,dca,dcv from inverters where inverter_id = 'a2' order by pipe_id, inverter_id, date;", con=con) # connected to a postgreSQL db
df_dca_dcv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16507 entries, 0 to 16506
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   pipe_id      16507 non-null  object        
 1   date         16507 non-null  datetime64[ns]
 2   inverter_id  16507 non-null  object        
 3   n_mppts      16507 non-null  int64         
 4   dca          16428 non-null  object        
 5   dcv          16428 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 773.9+ KB

Column

dca
Dtype 仍然是对象,但现在它的值介于“[]”而不是“{}”之间(与摘录 1 不同),当我执行此操作时:

df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('[', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace(']', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)

我收到以下错误:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[6], line 2
      1 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("[", "")
----> 2 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("]", "")

File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)

File ~\Anaconda3\lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
    179 if obj is None:
    180     # we're accessing the attribute of the class, i.e., Dataset.geo
    181     return self._accessor
--> 182 accessor_obj = self._accessor(obj)
    183 # Replace the property with the accessor object. Inspired by:
    184 # https://www.pydanny.com/cached-property.html
    185 # We need to use object.__setattr__ because we overwrite __setattr__ on
    186 # NDFrame
    187 object.__setattr__(obj, self._name, accessor_obj)

File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:177, in StringMethods.__init__(self, data)
    174 def __init__(self, data):
    175     from pandas.core.arrays.string_ import StringDtype
--> 177     self._inferred_dtype = self._validate(data)
    178     self._is_categorical = is_categorical_dtype(data.dtype)
    179     self._is_string = isinstance(data.dtype, StringDtype)

File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:231, in StringMethods._validate(data)
    228 inferred_dtype = lib.infer_dtype(values, skipna=True)
    230 if inferred_dtype not in allowed_types:
--> 231     raise AttributeError("Can only use .str accessor with string values!")
    232 return inferred_dtype

AttributeError: Can only use .str accessor with string values!

我预料到了“.astype(str)”操作,然后执行了“.str.replace(...)”操作。但是,当我现在查看数据框时

摘录4:

column

dca
值的格式与 Excerpt2 中的格式不同(例如“Decimal('2.2'),Decimal('2.2'...”)。当我继续执行

mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
df_dca_dcv['date'] = df_dca_dcv['date'].astype('datetime64[ns]')
df_dca_dcv['dca_mppt_0'] = pd.to_numeric(df_dca_dcv[0], errors='coerce')
df_dca_dcv['dca_mppt_1'] = pd.to_numeric(df_dca_dcv[1], errors='coerce')

dca
值未传递到新拆分的列,(我想)这是因为“pd.to_numeric(”无法读取“Decimal(...)”:

摘录5:

我尝试了以下所有方法将

dca
列转换为字符串:

METHOD1: df_dca_dcv['dca'] = df_dca_dcv['dca'].map(str) #produced same output format as before
METHOD2: df_dca_dcv['dca'] = df_dca_dcv['dca'].apply(str) #produced same output format as before
METHOD3: df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str) #generated the following error:

ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str)
ValueError: setting an array element with a sequence

METHOD4: df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str) #generated same error as METHOD3

METHOD5: df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)  #generated the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'applymap'

METHOD6:
def convert_float_string(row):
    float_list = row['dca']
    if len(float_list) > 0:
        string_list = ["%.2f" % i for i in float_list]
    else:
        string_list = float('NaN')
    return string_list
df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1) #generated the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   8828 from pandas.core.apply import frame_apply
   8830 op = frame_apply(
   8831     self,
   8832     func=func,
   (...)
   8837     kwargs=kwargs,
   8838 )
-> 8839 return op.apply().__finalize__(self, method="apply")
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:727, in FrameApply.apply(self)
    724 elif self.raw:
    725     return self.apply_raw()
--> 727 return self.apply_standard()
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:851, in FrameApply.apply_standard(self)
    850 def apply_standard(self):
--> 851     results, res_index = self.apply_series_generator()
    853     # wrap results
    854     return self.wrap_results(results, res_index)
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:867, in FrameApply.apply_series_generator(self)
    864 with option_context("mode.chained_assignment", None):
    865     for i, v in enumerate(series_gen):
    866         # ignore SettingWithCopy here in case the user mutates
--> 867         results[i] = self.f(v)
    868         if isinstance(results[i], ABCSeries):
    869             # If we have a view on v, we need to make a copy because
    870             #  series_generator will swap out the underlying data
    871             results[i] = results[i].copy(deep=False)
Cell In[8], line 1, in <lambda>(row)
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
Cell In[6], line 3, in convert_float_string(row)
      1 def convert_float_string(row):
      2     float_list = row['dca']
----> 3     if len(float_list) > 0:
      4         string_list = ["%.2f" % i for i in float_list]
      5     else:
TypeError: object of type 'NoneType' has no len()

...如果我只是跳过将

dca
转换为字符串并使用

df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("[", "")
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("]", "")

更换不会发生。

如果有任何关于如何解决该问题的建议,我将不胜感激。

编辑于 2023-04-19:

这是@Corralien 请求的输出:

{'pipe_id': {0: '10755', 1: '10755', 2: '10755', 3: '10755', 4: '10755'}, 'date': {0: Timestamp('2022-09-06 12:15:58.451439'), 1: Timestamp('2022-09-06 12:21:16.626511'), 2: Timestamp('2022-09-06 12:26:31.371399'), 3: Timestamp('2022-09-06 12:47:13.346493'), 4: Timestamp('2022-09-06 12:52:37.908956')}, 'inverter_id': {0: 'a2', 1: 'a2', 2: 'a2', 3: 'a2', 4: 'a2'}, 'n_mppts': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2}, 'dca': {0: [Decimal('2.3'), Decimal('2.3'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 1: [Decimal('2.6'), Decimal('2.6'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 2: [Decimal('2.9'), Decimal('2.9'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 3: [Decimal('6'), Decimal('5.9'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 4: [Decimal('3.9'), Decimal('3.9'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')]}, 'dcv': {0: [Decimal('388.3'), Decimal('432.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 1: [Decimal('390.7'), Decimal('432.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 2: [Decimal('388.2'), Decimal('430.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 3: [Decimal('390.4'), Decimal('435.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 4: [Decimal('382.9'), Decimal('424.3'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')]}}
python pandas string split tuples
1个回答
2
投票

我不确定我是否正确理解了你的问题。 但是您可以将自定义函数与应用一起使用。下面的例子:

希望有帮助

def split_dca(row):
    values = row['dca'].split(',') if row['dca'] else []
    
    values += [float('NaN')] * (row['n_mppts'] - len(values))
    
    values = values[:row['n_mppts']]
    
    return pd.Series(values)

split_columns = df.apply(split_dca, axis=1)
© www.soinside.com 2019 - 2024. All rights reserved.