数据框值替换

Question

我试图用“yyyy-MM”+“-01”替换“yyyy-MM”，下面是我的代码，但我没有得到正确的结果。注意，我正在研究数据块：

from pyspark.sql.functions import col, concat, lit, when


# Show the DataFrame
new_df.show(5)

from pyspark.sql.functions import col, concat, lit, when

# Create new columns with replaced values
new_df = clinicaltrial_df.withColumn(
    'Start_new',
    when(
        col('Start').contains('-'),
        col('Start')
    ).otherwise(
        concat(col('Start'), lit('-01'))
    )
).withColumn(
    'Complete_new',
    when(
        col('Completion').contains('-'),
        col('Completion')
    ).otherwise(
        concat(col('Completion'), lit('-01'))
    )
)

# Show the DataFrame
new_df.show(5)

Answer 1

您的代码旨在将

'-01'

附加到 DataFrame 的

Start

和

Completion

列中的值如果它们尚未包含“-”。但是，您似乎想专门针对格式化的字符串作为

'yyyy-MM'

并确保它们成为

'yyyy-MM-01'

。为此，您需要识别与

'yyyy-MM'

格式精确匹配的字符串。

您可以使用

regexp_replace

来自 PySpark 的

sql.functions

模块的函数。该函数可以搜索正则表达式模式并使用指定的替换字符串替换字符串的匹配部分。对于您的情况，您可以查找与年份和月份模式匹配的字符串 (

'yyyy-MM'

) 并且不要以一天结束。 然后，将

'-01'

附加到这些字符串，将它们标准化为完整日期 (

'yyyy-MM-01'

)。

以下是调整代码的方法：

from pyspark.sql.functions import regexp_replace

# Adjust the DataFrame
new_df = clinicaltrial_df.withColumn(
    'Start_new',
    regexp_replace('Start', r'^(\d{4}-\d{2})$', concat(col('Start'), lit('-01')))
).withColumn(
    'Complete_new',
    regexp_replace('Completion', r'^(\d{4}-\d{2})$', concat(col('Completion'), lit('-01')))
)

# Show the modified DataFrame
new_df.show(5)

上面显示的方法可确保仅更改格式与

'yyyy-MM'

完全相同的字符串，准确定位所描述的需求并利用强大的模式匹配功能正则表达式来实现所需的转换。

示例

以下是上述一些虚拟数据解决方案的示例：

from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace, col, concat, lit, when
from pyspark.sql.types import StructType, StructField, StringType

# Initialize SparkSession (not necessary if you're running this in Databricks as it's already initialized)
spark = SparkSession.builder.appName("example").getOrCreate()

# Define schema for the DataFrame
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Start", StringType(), True),
    StructField("Completion", StringType(), True)
])

# Sample data
data = [
    ("001", "2022-01", "2022-12-31"),
    ("002", "2022-05-01", "2023-05"),
    ("003", "2023", "2023-11"),
    ("004", "2022-07", "2022-09-15")
]

# Create DataFrame
clinicaltrial_df = spark.createDataFrame(data, schema=schema)

# Showing original DataFrame
clinicaltrial_df.show()
# +---+----------+----------+
# | ID|     Start|Completion|
# +---+----------+----------+
# |001|   2022-01|2022-12-31|
# |002|2022-05-01|   2023-05|
# |003|      2023|   2023-11|
# |004|   2022-07|2022-09-15|
# +---+----------+----------+

# Apply the solution
new_df = clinicaltrial_df.withColumn(
    'Start_new',
    regexp_replace('Start', r'^(\d{4}-\d{2})$', concat(col('Start'), lit('-01')))
).withColumn(
    'Complete_new',
    regexp_replace('Completion', r'^(\d{4}-\d{2})$', concat(col('Completion'), lit('-01')))
)

# Show the modified DataFrame
new_df.show(truncate=False)
# +---+----------+----------+----------+------------+
# |ID |Start     |Completion|Start_new |Complete_new|
# +---+----------+----------+----------+------------+
# |001|2022-01   |2022-12-31|2022-01-01|2022-12-31  |
# |002|2022-05-01|2023-05   |2022-05-01|2023-05-01  |
# |003|2023      |2023-11   |2023      |2023-11-01  |
# |004|2022-07   |2022-09-15|2022-07-01|2022-09-15  |
# +---+----------+----------+----------+------------+

注意： 仅当字符串中同时存在年份和月份时才会附加日期按照相应的顺序。如果您还有只包含年份的日期 (

'YYYY'

)，年/月顺序不是

'YYYY-MM'

的日期，或不使用

'-'

作为年/月分隔符（例如，

'YYYY/MM'

），这些值将保持不变。

正则表达式模式分解

如果您不熟悉正则表达式字符串模式，这里是该模式的细分上面代码中使用：

```
regexp_replace
```
与正则表达式模式
```
r'^(\d{4}-\d{2})$'
```
一起使用：
- ```
^
```
  断言字符串的开头。
- ```
(\d{4}-\d{2})
```
  匹配并捕获由四位数字组成的组（代表年份），后跟连字符，然后是两位数字（代表月份）。
- ```
$
```
  断言字符串结尾。

数据框值替换

问题描述投票：0回答：1

1个回答

示例

正则表达式模式分解

最新问题

数据框值替换

问题描述 投票：0回答：1

1个回答

示例

正则表达式模式分解

最新问题

问题描述投票：0回答：1