我想从这个xarray中删除重复的行:
<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan]],
[[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ]],
...,
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30
在上面的示例中,自动收报机重复4次。我的目标是获得一个看起来如下的输出:
<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan],
[ 5. , nan, ..., 2.1333, 70.02 ],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30
请注意,字段“代码”从4减少到1。
这是代码(不包括库导入):
def _get_historical_data_cache():
path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'cached_values_v2_clean.cache')
data = cached_value(_get_historical_data_bloomberg, path) # data importation from cache memory, if not available, directly from a data provider
return data
def _slice_by_ticker():
tickers = _get_historical_data_cache().indexes['tickers']
for k in tickers:
slice = _get_historical_data_cache().loc[:, k, :] # it gives me duplicated tickers.
从数据提供者,我得到一个具有以下维度的3D数据数组(xarray):日期,代码和字段。我的目标是“切片”这个立方体,按计划计划,在我的情况下,通过自动收报机,以便在每次迭代中获得一个2D数据数组(或上面显示为所需输出的3D xarray)自动收报机及其相应的数据(日期和字段)。
以下是xarray在第一次迭代中的样子(如上所示)。问题是独特的自动收报机是重复的:
In[2]: slice
Out[2]:
<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan],
[ 4.9167, nan, ..., 2.1695, nan]],
[[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ],
[ 5. , nan, ..., 2.1333, 70.02 ]],
...,
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30
当我尝试Ryan提出的解决方案时,这里是代码:
def _slice_by_ticker():
tickers = _get_historical_data_cache().indexes['tickers']
for k in tickers:
slice = _get_historical_data_cache().loc[:, k, :] # it gives me duplicated tickers.
# get unique ticker values as numpy array
unique_tickers = np.unique(slice.tickers.values)
da_reindexed = slice.reindex(tickers=unique_tickers)
这是错误:
ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values
谢谢你的帮助 ! :)
听起来您想要重新索引数据表。 (见xarray docs on reindexing。)
下面我将假设da
是原始数据阵列的名称
import numpy as np
# get unique ticker values as numpy array
unique_tickers = np.unique(da.tickers.values)
da_reindexed = da.reindex(tickers=unique_tickers)
找到答案。
首先我尝试了这个:
slice_clean = (slice[:, :1]).rename('slice_clean')
slice.reindex_like(slice_clean)
这给了我同样的错误,如上所示:
ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values
然后,我尝试了这个:
slice = slice[:,:1]
它奏效了!
<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167, nan, ..., 2.1695, nan]],
[[ 5. , nan, ..., 2.1333, 70.02 ]],
...,
[[ nan, nan, ..., nan, nan]],
[[ nan, nan, ..., nan, nan]]])
Coordinates:
* tickers (tickers) object BloombergTicker:0000630D US Equity
* fields (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
* dates (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30