我不是正则表达式方面的专家,正在寻求帮助。提前致谢
我想从描述列中提取格式化的子字符串。下面的示例
来自
my testing on 456897 - Carol M. Smith, Ph.D.
my testing on 435670 - Ms. Paulina M. Hall
my testing on 980765 - Mr. John Smith
my testing on 14567 - Mrs. Lena C. Callum
my testing on 555777 - Dr. Paul F. Fairlake
234567 - Mr. Ryan M. Palmer, Sr.
123456 - Joyce R. Hilton, Ph.D.
到
my testing on 456897 - C.Smith
my testing on 435670 - Ms. P. Hall
my testing on 980765 - Mr. J. Smith
my testing on 14567 - Mrs. L. Callum
my testing on 555777 - Dr. P. Fairlake
234567 - Mr. R. Palmer
123456 - J. Hilton
我的查询适用于第一条和最后一条记录。但是,有标题的有点复杂。
对于有标题的记录,我需要保留名字和姓氏的首字母。
SELECT description,
CASE
WHEN REGEXP_LIKE(description, '(Mr\.|Ms\.|Mrs\.|Dr\.)') THEN REGEXP_REPLACE(description, '(Ms\.|Mr\.|Mrs\.|Dr\.[A-Z][a-z]+ [A-Z]\.)')
WHEN NOT REGEXP_LIKE(description, '(Mr\.|Ms\.|Mrs\.|Dr\.)') THEN REGEXP_REPLACE(description, '(\w)\w*\W+(\w)\w*\W+(\w+),.*', '\1. \3')
ELSE 'some other validation needed'
END AS order_regex
from mytable;
再次感谢您的任何建议。 K
对于这个确切的例子,你可以使用这样的东西:
select
description,
CASE
WHEN REGEXP_LIKE(description, '(Mr\.|Ms\.|Mrs\.|Dr\.)') THEN
REGEXP_REPLACE(description, '(Ms\.|Mr\.|Mrs\.|Dr\.) ([A-Z])[a-zA-Z. ]+ ([A-Za-z]+)', '\1 \2. \3')
WHEN REGEXP_LIKE(description, ', (Ph\.D\.|Sr\.)') THEN
REGEXP_REPLACE(description, '([A-Z])[a-z]+ ([A-Z]\.)? ([A-Z][a-z]+), (Ph\.D\.|Sr\.)', '\1. \3')
ELSE 'some other validation needed'
END AS order_regex
from t1
编辑:对多部分名称更通用:
select
description,
CASE
WHEN REGEXP_LIKE(description, '(Mr\.|Ms\.|Mrs\.|Dr\.)') THEN
REGEXP_REPLACE(description, '(Ms\.|Mr\.|Mrs\.|Dr\.) ([A-Z])[a-zA-Z. ]+ ([A-Za-z]+)', '\1 \2. \3')
WHEN REGEXP_LIKE(description, ', (Ph\.D\.|Sr\.)') THEN
REGEXP_REPLACE(description, '([A-Z])[a-zA-Z. ]* ([A-Z][a-z]+), (Ph\.D\.|Sr\.)', '\1. \2')
ELSE 'some other validation needed'
END AS order_regex
from t1
但一般来说,名称很难解析,恐怕简单的正则表达式集是行不通的。
我会这样做:
select
t1.*
,regexp_replace(
t1.description
,'([^-]+)-\s*((Mr|Ms|Mrs|Dr)[.]\s*)?(\w)\w*(\s[a-zA-Z.]*)*\s(\w+)(,.*|$)'
,'\1- \2\4. \6'
) subs
from t1
这个正则表达式的简短描述:
([^-]+)-
- 查找以 -
结尾的子字符串的第一部分(子表达式 #1)\s*
- 任意数量的空格字符((Mr|Ms|Mrs|Dr)[.]\s*)?
- 检查先生|女士|女士|博士。存在并作为子表达式 #2(\w)\w*
- 找到一个名字并返回第一个字母作为子表达式 $3(\s[a-zA-Z.]*)*
- 名字和姓氏之间的任意数量的单词(子表达式 #4)\s(\w+)(,.*|$)
- 查找姓氏(即“,”之前的最后一个单词或字符串的末尾)并作为子表达式 #5 返回。完整测试用例:
with t1 as (
select 'my testing on 456897 - Carol M. Smith, Ph.D. ' description from dual union all
select 'my testing on 435670 - Ms. Paulina M. Hall' from dual union all
select 'my testing on 980765 - Mr. John Smith' from dual union all
select 'my testing on 14567 - Mrs. Lena C. Callum' from dual union all
select 'my testing on 555777 - Dr. Paul F. Fairlake' from dual union all
select '234567 - Mr. Ryan M. Palmer, Sr.' from dual union all
select '123456 - Joyce R. Hilton, Ph.D.' from dual
)
select
t1.*
,regexp_replace(
t1.description
,'([^-]+)-\s*((Mr|Ms|Mrs|Dr)[.]\s*)?(\w)\w*(\s[a-zA-Z.]*)*\s(\w+)(,.*|$)'
,'\1- \2\4. \6'
) subs
from t1;
DBFiddle:https://dbfiddle.uk/HNHHzGR4