我有一个数据库表中的数据,我将其导出到这样的文件,并有大约100k记录(这是基于id的重复)
id | dp_1 | pp_1 | Phone |
-------|---------|-------|--------|
1 | dp1 | | phone1 |
----------------------------------|
1 | | pp1 | phone1 |
----------------------------------|
2 | dp2 | pp2 | phone2 |
------------------------------------
2 | | | phone4 |
-----------------------------------
3 | dp3 | pp3 | phone3 |
------------------------------------
3 | dp3 | | phone3 |
-----------------------------------
4 | | pp4 | |
------------------------------------
4 | dp4 | | |
我希望结果如下:
id | dp_1 | pp_1 | Phone |
-------|---------|-------|-----------------|
1 | dp1 | pp1 | phone1 - phone1 |
-------------------------------------------|
2 | dp2 | pp2 | phone2 - phone4 |
-------------------------------------------|
3 | dp3 | pp3 | phone3 |
-------------------------------------------|
4 | dp4 | pp4 | |
--------------------------------------------
我写了这个SQL:
WITH cte AS (
SELECT*,
row_number() OVER(PARTITION BY id,DP_1, PP_1, phone ORDER BY id desc)
AS [rn]
FROM table1
)
Select * into #temp from cte WHERE [rn] = 1 ORDER BY id
如何在Python中或使用SQL查询实现此目的?我正在使用Anaconda。
我不明白为什么id 1和3有不同的电话逻辑(一个重复数字,一个不重复)。这个答案可以复制电话(如在id 1中)或返回DISTINCT
值(如id 3)。您可以通过取消注释GROUP BY
来更改逻辑。
--Sample Data
WITH VTE AS (
SELECT *
FROM (VALUES (1,'dp1',NULL,'phone1'),
(1,NULL,'pp1','phone1'),
(2,'dp2','pp2','phone2'),
(2,NULL,NULL,'phone4'),
(3,'dp3','pp2','phone3'),
(3,'dp3',NULL,'phone3')) V(id, dp_1, pp_1, phone))
--And the answer
SELECT id,
MAX(dp_1) AS dp_1,
MAX(pp_1) AS pp_1,
STUFF((SELECT ' - ' + sq.phone
FROM VTE sq
WHERE sq.id = VTE.id
AND phone <> ''
--GROUP BY sq.phone --If you only want to display unique phones, uncomment the GROUP BY.
FOR XML PATH('')),1,3,'') AS [phone]
FROM VTE
GROUP BY id;
在Python中,您的最佳解决方案是pandas。我还使用numpy为您的案例中的“手机”选择唯一变量
首先,我只是创建你的表(从SQL读取是一个单独的问题我猜)
df = pd.DataFrame(data={'id': [1, 1, 2, 2, 3, 3],
'dp_1': ['dp1', np.nan, 'dp2', np.nan, 'dp3', 'dp3'],
'pp_1': [np.nan, 'pp1', 'pp2', np.nan, 'pp3', np.nan],
'Phone': ['phone1 ', 'phone1 ', 'phone2 ', 'phone4 ', 'phone2 ', 'phone3 ']})
然后我创建一个将在分组中应用的函数
def unique_sum(str_list):
return np.sum(np.unique(str_list))
然后应用groupby。我希望这就是你所需要的
df.groupby('id').aggregate({'dp_1': 'last', 'pp_1': 'last', 'Phone': unique_sum})
pp_1 Phone dp_1
id
1 pp1 phone1 dp1
2 pp2 phone2 phone4 dp2
3 pp3 phone2 phone3 dp3
此查询提供您的预期结果
;With cte( id,dp_1,pp_1,Phone)
AS
(
SELECT 1 , 'dp1' , NULL , 'phone1' UNION ALL
SELECT 1 , NULL , 'pp1' , 'phone1' UNION ALL
SELECT 2 , 'dp2' , 'pp2' , 'phone2' UNION ALL
SELECT 2 , NULL , NULL , 'phone4' UNION ALL
SELECT 3 , 'dp3' , 'pp3' , 'phone3' UNION ALL
SELECT 3 , 'dp3' , NULL , 'phone3'
)
SELECT
DISTINCT id ,
MAX(dp_1)OVER(PARTITION BY id ORDER BY id) AS dp_1 ,
MAX(pp_1)OVER(PARTITION BY id ORDER BY id) AS pp_1,
STUFF((SELECT DISTINCT ' - ' + Phone FROM cte i WHERE i.id=o.id
FOR XML PATH ('')),1,2,'') AS Phone
FROM cte o
结果
id dp_1 pp_1 Phone
--------------------------------
1 dp1 pp1 phone1
2 dp2 pp2 phone2 - phone4
3 dp3 pp3 phone3