如何过滤掉不以数字开头的行(CSV, PySpark)。已编辑。只包含数字

问题描述 投票:1回答:1

CSV文件

在df中的一列有一些不是以数字开头的行,我想把它们删除,我试了下面的一些代码,但它们不工作。

import re
df = sqlContext.read.csv("/FileStore/tables/mtmedical_V6-16623.csv", header='true', inferSchema="true")

df.show()

import pyspark.sql.functions as f
w=df.filter(df['_c0'].isdigit()) #error1
w=df.filter(df['_c0'].startswith(('1','2','3','4','5','6','7','8','9'))) #error2
w.show()

错误。

'Column' object is not callable #no1
py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist #no2

这里是表格,你可以看到第7行下面的第7列'_c0'不是以数字开头的,我怎么才能删除这些行呢?

+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
|                 _c0|         description|   medical_specialty|                 age|              gender|sample_name (What has been done to patient = Treatment)|       transcription|            keywords|
+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
|                   1| A 23-year-old wh...| Allergy / Immuno...|                  23|              female|                                     Allergic Rhinitis |SUBJECTIVE:,  Thi...|allergy / immunol...|
|                   2| Consult for lapa...|          Bariatrics|                null|                male|                                    Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
|                   3| Consult for lapa...|          Bariatrics|                  42|                male|                                    Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
|                   4| 2-D M-Mode. Dopp...| Cardiovascular /...|                null|                null|                                    2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
|                   5|  2-D Echocardiogram| Cardiovascular /...|                null|                male|                                    2-D Echocardiogr...|1.  The left vent...|cardiovascular / ...|
|                   6| Morbid obesity. ...|          Bariatrics|                  30|                male|                                    Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, gastr...|
|                   7| Liposuction of t...|                null|                null|                null|                                                   null|                null|                null|
|", Bariatrics,31,...|       1.  Deformity| right breast rec...|2.  Excess soft t...| anterior abdomen...|                                   3.  Lipodystrophy...|POSTOPERATIVE DIA...|       1.  Deformity|
|                   8|  2-D Echocardiogram| Cardiovascular /...|                null|                male|                                    2-D Echocardiogr...|2-D ECHOCARDIOGRA...|cardiovascular / ...|
python pyspark data-cleaning
1个回答
1
投票
df.filter((f.col('_c0')).isin([x for x in range(1,df.count()+1)]))
© www.soinside.com 2019 - 2024. All rights reserved.