我想访问和处理csv文件的第四列。特别是要排除不符合特定要求的行(排除不包含3个字符的国家/地区代码的行)。
我的数据集:
Luxembourg,LUX,2017,9294689.12
Aruba,ABW,2017,927865.82
Nepal,NPL,2017,9028196.37
Bangladesh,BGD,2017,88057460.51
Costa Rica,CRI,2017,8695008.05
Chile,CHL,2017,84603249.72
Cook Islands,COK,2017,82045.41
World,OWIDWRL,1755,9361520
India,INDIA,1763,0
Asia and Pacific (other),,2017,5071156099
World,OWID_WRL,1752,9354192
Middle East,,1751,0
International transport,,1751,0
India,IND,1751,0
Europe (other),,1751,0
China,CHN,1751,0
Asia and Pacific (other),,1751,0
Americas (other),,1751,0
Africa,,1751,0
提前感谢。
我已经按年份对数据文件进行了排序,但我不知道如何访问第4列并使用awk或sed。
预期数据集:
Luxembourg,LUX,2017,9294689.12
Aruba,ABW,2017,927865.82
Nepal,NPL,2017,9028196.37
Bangladesh,BGD,2017,88057460.51
Costa Rica,CRI,2017,8695008.05
Chile,CHL,2017,84603249.72
Cook Islands,COK,2017,82045.41
awk --re-interval -F, 'tolower($2) ~ /^[a-z]{3}$/' country.txt
也可以检查长度,但这可以确保仅提供3个字母。[
--re-internval
允许您在RE中使用itnernval表达式,因为大括号是awk中的保留字符。[
-F,
告诉awk输入分隔符是逗号。[
tolower($2) ~ /^[a-z]{3}$/
是表示tolower($2) ~ /^[a-z]{3}$/ {print}
的简写方式
tolower($2)
使第二个字段的值小写,并且~
是正则表达式比较运算符,我们用它来检查字符串^
的开头,然后检查[a-z]
重复{3}
次并字符串$
的结尾。
awk 'BEGIN{FS=","} $2~/^[a-zA-Z]{3}$/' Input_file
如果您使用旧的awk
,但在{3}
范围不起作用的情况下尝试。
awk 'BEGIN{FS=","} $2~/^[a-zA-Z][a-zA-Z][a-zA-Z]$/' Input_file
说明:在此处添加上述代码的说明。
awk ' ##Starting awk program here. BEGIN{ ##Starting BEGIN section from here. Which will be executed before Input_file is being read FS="," ##Setting field separator as comma here. } ##Closing BEGIN section here. $2~/^[a-zA-Z]{3}$/ ##Checking condition if 2nd field is starting with alphabets 3 occurrence of it and ending with it too. ##Since awk works on method of condition then action; so if condition is TRUE then perform certain action. ##In this case no action given so by default print of line will happen. ' Input_file ##Mentioning Input_file name here.