我被卡住了,因为我无法将数据框列拆分为更多列,条件是另一个列值。我有一个 pandas 数据框,它是我直接从一个超过 100K 行的“.csv”文件生成的。
摘录1:
我想用','(逗号)将
dca
列拆分为更多列。分裂的数量将受到n_mppts
.中的值的限制
编辑于 2023-04-12:
我可以使用以下代码在从此 .csv 文件生成的数据框中成功执行拆分列操作(感谢@Abdulmajeed 的解决方案):
def split_dca(row):
values = row['dca'].split(',') if row['dca'] else []
values += [float('NaN')] * (row['n_mppts'] - len(values))
values = values[:row['n_mppts']]
return pd.Series(values)
df_dca_dcv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418643 entries, 0 to 418642
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pipe_id 418643 non-null int64
1 date 418643 non-null object
2 inverter_id 418643 non-null object
3 n_mppts 418643 non-null int64
4 dca 418538 non-null object
5 dcv 418538 non-null object
dtypes: int64(2), object(4)
memory usage: 19.2+ MB
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('{', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('}', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)
摘录2:
mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
mppts_dca['dca_mppt_0'] = pd.to_numeric(mppts_dca[0], errors='coerce')
mppts_dca['dca_mppt_1'] = pd.to_numeric(mppts_dca[1], errors='coerce')
mppts_dca['dca_mppt_2'] = pd.to_numeric(mppts_dca[2], errors='coerce')
mppts_dca['dca_mppt_3'] = pd.to_numeric(mppts_dca[3], errors='coerce')
mppts_dca['dca_mppt_4'] = pd.to_numeric(mppts_dca[4], errors='coerce')
mppts_dca['dca_mppt_5'] = pd.to_numeric(mppts_dca[5], errors='coerce')
mppts_dca['dca_mppt_6'] = pd.to_numeric(mppts_dca[6], errors='coerce')
mppts_dca['dca_mppt_7'] = pd.to_numeric(mppts_dca[7], errors='coerce')
mppts_dca['dca_mppt_8'] = pd.to_numeric(mppts_dca[8], errors='coerce')
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
摘录3:
但是,当我从指定
inverter_id
=a2 的 pandas sql 查询生成数据框时,我遇到了一个问题,因此当前的解决方案不会成功(该问题也存在于其他 inverter_id
值):
df_dca_dcv = pd.read_sql_query("select pipe_id,created_at as date,inverter_id,n_mppts,dca,dcv from inverters where inverter_id = 'a2' order by pipe_id, inverter_id, date;", con=con) # connected to a postgreSQL db
df_dca_dcv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16507 entries, 0 to 16506
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pipe_id 16507 non-null object
1 date 16507 non-null datetime64[ns]
2 inverter_id 16507 non-null object
3 n_mppts 16507 non-null int64
4 dca 16428 non-null object
5 dcv 16428 non-null object
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 773.9+ KB
Column
dca
Dtype 仍然是对象,但现在它的值介于“[]”而不是“{}”之间(与摘录 1 不同),当我执行此操作时:
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace('[', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace(']', '')
df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str)
我收到以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[6], line 2
1 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("[", "")
----> 2 df_dca_dcv['dca'] = df_dca_dcv['dca'].str.replace("]", "")
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
File ~\Anaconda3\lib\site-packages\pandas\core\accessor.py:182, in CachedAccessor.__get__(self, obj, cls)
179 if obj is None:
180 # we're accessing the attribute of the class, i.e., Dataset.geo
181 return self._accessor
--> 182 accessor_obj = self._accessor(obj)
183 # Replace the property with the accessor object. Inspired by:
184 # https://www.pydanny.com/cached-property.html
185 # We need to use object.__setattr__ because we overwrite __setattr__ on
186 # NDFrame
187 object.__setattr__(obj, self._name, accessor_obj)
File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:177, in StringMethods.__init__(self, data)
174 def __init__(self, data):
175 from pandas.core.arrays.string_ import StringDtype
--> 177 self._inferred_dtype = self._validate(data)
178 self._is_categorical = is_categorical_dtype(data.dtype)
179 self._is_string = isinstance(data.dtype, StringDtype)
File ~\Anaconda3\lib\site-packages\pandas\core\strings\accessor.py:231, in StringMethods._validate(data)
228 inferred_dtype = lib.infer_dtype(values, skipna=True)
230 if inferred_dtype not in allowed_types:
--> 231 raise AttributeError("Can only use .str accessor with string values!")
232 return inferred_dtype
AttributeError: Can only use .str accessor with string values!
我预料到了“.astype(str)”操作,然后执行了“.str.replace(...)”操作。但是,当我现在查看数据框时
摘录4:
column
dca
值的格式与 Excerpt2 中的格式不同(例如“Decimal('2.2'),Decimal('2.2'...”)。当我继续执行
mppts_dca = df_dca_dcv.apply(split_dca, axis=1)
df_dca_dcv = pd.concat([df_dca_dcv, mppts_dca], axis=1)
df_dca_dcv['date'] = df_dca_dcv['date'].astype('datetime64[ns]')
df_dca_dcv['dca_mppt_0'] = pd.to_numeric(df_dca_dcv[0], errors='coerce')
df_dca_dcv['dca_mppt_1'] = pd.to_numeric(df_dca_dcv[1], errors='coerce')
dca
值未传递到新拆分的列,(我想)这是因为“pd.to_numeric(”无法读取“Decimal(...)”:
摘录5:
我尝试了以下所有方法将
dca
列转换为字符串:
METHOD1: df_dca_dcv['dca'] = df_dca_dcv['dca'].map(str) #produced same output format as before
METHOD2: df_dca_dcv['dca'] = df_dca_dcv['dca'].apply(str) #produced same output format as before
METHOD3: df_dca_dcv['dca'] = df_dca_dcv['dca'].astype(str) #generated the following error:
ValueError Traceback (most recent call last)
Cell In[6], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str)
ValueError: setting an array element with a sequence
METHOD4: df_dca_dcv['dca'] = df_dca_dcv['dca'].values.astype(str) #generated same error as METHOD3
METHOD5: df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str) #generated the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[7], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv['dca'].applymap(str)
File ~\Anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'applymap'
METHOD6:
def convert_float_string(row):
float_list = row['dca']
if len(float_list) > 0:
string_list = ["%.2f" % i for i in float_list]
else:
string_list = float('NaN')
return string_list
df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1) #generated the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 1
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
File ~\Anaconda3\lib\site-packages\pandas\core\frame.py:8839, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
8828 from pandas.core.apply import frame_apply
8830 op = frame_apply(
8831 self,
8832 func=func,
(...)
8837 kwargs=kwargs,
8838 )
-> 8839 return op.apply().__finalize__(self, method="apply")
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:727, in FrameApply.apply(self)
724 elif self.raw:
725 return self.apply_raw()
--> 727 return self.apply_standard()
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:851, in FrameApply.apply_standard(self)
850 def apply_standard(self):
--> 851 results, res_index = self.apply_series_generator()
853 # wrap results
854 return self.wrap_results(results, res_index)
File ~\Anaconda3\lib\site-packages\pandas\core\apply.py:867, in FrameApply.apply_series_generator(self)
864 with option_context("mode.chained_assignment", None):
865 for i, v in enumerate(series_gen):
866 # ignore SettingWithCopy here in case the user mutates
--> 867 results[i] = self.f(v)
868 if isinstance(results[i], ABCSeries):
869 # If we have a view on v, we need to make a copy because
870 # series_generator will swap out the underlying data
871 results[i] = results[i].copy(deep=False)
Cell In[8], line 1, in <lambda>(row)
----> 1 df_dca_dcv['dca'] = df_dca_dcv.apply(lambda row: convert_float_string(row), axis=1)
Cell In[6], line 3, in convert_float_string(row)
1 def convert_float_string(row):
2 float_list = row['dca']
----> 3 if len(float_list) > 0:
4 string_list = ["%.2f" % i for i in float_list]
5 else:
TypeError: object of type 'NoneType' has no len()
...如果我只是跳过将
dca
转换为字符串并使用
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("[", "")
df_dca_dcv['dca'] = df_dca_dcv['dca'].replace("]", "")
更换不会发生。
如果有任何关于如何解决该问题的建议,我将不胜感激。
编辑于 2023-04-19:
这是@Corralien 请求的输出:
{'pipe_id': {0: '10755', 1: '10755', 2: '10755', 3: '10755', 4: '10755'}, 'date': {0: Timestamp('2022-09-06 12:15:58.451439'), 1: Timestamp('2022-09-06 12:21:16.626511'), 2: Timestamp('2022-09-06 12:26:31.371399'), 3: Timestamp('2022-09-06 12:47:13.346493'), 4: Timestamp('2022-09-06 12:52:37.908956')}, 'inverter_id': {0: 'a2', 1: 'a2', 2: 'a2', 3: 'a2', 4: 'a2'}, 'n_mppts': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2}, 'dca': {0: [Decimal('2.3'), Decimal('2.3'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 1: [Decimal('2.6'), Decimal('2.6'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 2: [Decimal('2.9'), Decimal('2.9'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 3: [Decimal('6'), Decimal('5.9'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 4: [Decimal('3.9'), Decimal('3.9'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')]}, 'dcv': {0: [Decimal('388.3'), Decimal('432.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 1: [Decimal('390.7'), Decimal('432.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 2: [Decimal('388.2'), Decimal('430.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 3: [Decimal('390.4'), Decimal('435.7'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')], 4: [Decimal('382.9'), Decimal('424.3'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0'), Decimal('0')]}}
我不确定我是否正确理解了你的问题。 但是您可以将自定义函数与应用一起使用。下面的例子:
希望有帮助
def split_dca(row):
values = row['dca'].split(',') if row['dca'] else []
values += [float('NaN')] * (row['n_mppts'] - len(values))
values = values[:row['n_mppts']]
return pd.Series(values)
split_columns = df.apply(split_dca, axis=1)