从字典中提取多个数据帧

问题描述 投票:0回答:1

我正在使用scprep做一些单细胞RNA测序。我正在使用命令scprep.stats.differential_expression_by_cluster(data, clusters)其中clusters是sk.learn kmeans的输出。

根据文档,输出为dict(pd.DataFrame)。

我的输出看起来像这样:

 {0:                               difference   rank
 C1qb (ENSMUSG00000036905)       0.176254      0
 C1qa (ENSMUSG00000036887)       0.145618      1
 C1qc (ENSMUSG00000036896)       0.120607      2
 Crybb1 (ENSMUSG00000029343)     0.105344      3
 Tyrobp (ENSMUSG00000030579)     0.098916      4
 ...                                  ...    ...
 mt-Co3 (ENSMUSG00000064358)   -68.884323  16091
 Malat1 (ENSMUSG00000092341)   -77.371274  16092
 Tuba1a (ENSMUSG00000072235)   -91.835869  16093
 Tmsb4x (ENSMUSG00000049775)  -101.908864  16094
 mt-Atp6 (ENSMUSG00000064357) -120.025289  16095

 [16096 rows x 2 columns], 1:                               difference   rank
 Tmsb4x (ENSMUSG00000049775)   127.537848      0
 Tuba1a (ENSMUSG00000072235)    91.644383      1
 Tubb2b (ENSMUSG00000045136)    48.972048      2
 mt-Atp6 (ENSMUSG00000064357)   41.105186      3
 Stmn1 (ENSMUSG00000028832)     40.466334      4
 ...                                  ...    ...
 Meg3 (ENSMUSG00000021268)      -2.904875  16091
 Hmgb2 (ENSMUSG00000054717)     -4.784257  16092
 Vim (ENSMUSG00000026728)       -5.001676  16093
 Dbi (ENSMUSG00000026385)       -6.704505  16094
 Fabp7 (ENSMUSG00000019874)    -12.319859  16095

 [16096 rows x 2 columns], 2:                              difference   rank
 Gria2 (ENSMUSG00000033981)     1.688701      0
 Pou3f2 (ENSMUSG00000095139)    1.167767      1
 Pou3f3 (ENSMUSG00000045515)    0.999804      2
 Cldn5 (ENSMUSG00000041378)     0.971778      3
 Robo2 (ENSMUSG00000052516)     0.877576      4

当我尝试pd.DataFrame.from_dict(dict)时收到错误消息

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-383-630287ba17f3> in <module>
----> 1 df = pd.DataFrame.from_dict(diff)

~/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in from_dict(cls, data, orient, dtype, columns)
   1188             raise ValueError("only recognize index or columns for orient")
   1189 
-> 1190         return cls(data, index=index, columns=columns, dtype=dtype)
   1191 
   1192     def to_numpy(self, dtype=None, copy=False):

~/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    409             )
    410         elif isinstance(data, dict):
--> 411             mgr = init_dict(data, index, columns, dtype=dtype)
    412         elif isinstance(data, ma.MaskedArray):
    413             import numpy.ma.mrecords as mrecords

~/anaconda/lib/python3.6/site-packages/pandas/core/internals/construction.py in init_dict(data, index, columns, dtype)
    255             arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
    256         ]
--> 257     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    258 
    259 

~/anaconda/lib/python3.6/site-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype)
     75     # figure out the index, if necessary
     76     if index is None:
---> 77         index = extract_index(arrays)
     78     else:
     79         index = ensure_index(index)

~/anaconda/lib/python3.6/site-packages/pandas/core/internals/construction.py in extract_index(data)
    356 
    357         if not indexes and not raw_lengths:
--> 358             raise ValueError("If using all scalar values, you must pass an index")
    359 
    360         if have_series:

ValueError: If using all scalar values, you must pass an index

我尝试了各种方法,例如pd.DataFrame.from_dict(dict, orient='index'),这给了我以下输出结果>>

                                                   0
0                                 difference   ran...
1                                 difference   ran...
2                                difference   rank...
3                                difference   rank...
4                                 difference   ran...
5                                 difference   ran...
6                                 difference   ran...
7                                difference   rank...
8                                 difference   ran...
9                                 difference   ran...
10                                difference   ran...
11                                difference   ran...
12                                difference   ran...
13                                difference   ran...
14                                difference   ran...
15                                difference   ran...
16                                difference   ran...
17                                difference   ran...
18                               difference   rank...
19                               difference   rank...
20                                difference   ran...
21                                difference   ran...
22                               difference   rank...
23                               difference   rank...
24                               difference   rank...
25                                difference   ran...

我想拥有26个不同的csv文件,这些文件的基因名称为行,'差异'和'排名'为列。

我查看了github上的原始代码,发现结果写成这样:

result = {cluster : differential_expression(
        select.select_rows(data, idx=clusters==cluster),
        select.select_rows(data, idx=clusters!=cluster),
        measure = measure, direction = direction,
        gene_names = gene_names, n_jobs = n_jobs)
              for cluster in np.unique(clusters)}

如何获得我想要的输出?

谢谢

我正在使用scprep做一些单细胞RNA测序。我正在使用命令scprep.stats.differential_expression_by_cluster(data,clusters),其中clusters是sk.learn kmeans的输出。 ...

python pandas dictionary scikit-learn
1个回答
0
投票

您可以从字典中检索数据框并将其保存为excel文件:

© www.soinside.com 2019 - 2024. All rights reserved.