awk 处理巨大文本文件时无法捕获错误记录

问题描述 投票:0回答:1

下面是我的产品代码。这项工作面临着一个挑战,它会间歇性失败,同时每隔一天返回非零返回代码。这项工作的输入是一个巨大的 .txt 文件。

#!/bin/ksh  
. /opt/coy/coyvars
. /opt/coy/coyenv.sh

typeset -Z2 inc
File="/opt/coy/data/output/mq/listdly.txt"
work_dir="/opt/coy/data/output/mq/"
log_dir="/opt/coy/logs/"
log_fname="coydata_sensold_d.log"
err_fname="sensold_error.log"
tmp_file=$work_dir"corder_rpt_daily.txt"

rm -f $log_dir$log_fname

print "Beginning $0 script" >$log_dir$log_fname

rm -f /opt/coy/data/output/mq/*lc_coy_dly_0*

find /opt/coy/data/input/archive/ -name '*lc_coy_dly_0*'  -type f -mtime +1 -exec rm -f {} \;

cd /opt/coy/data/input/archive
last_date=`find . -name 'd_corder_rpt_*.Z' | cut -c 23-30 |sort -u | tail -1`
find . -name 'd_corder_rpt_*.Z' | cut -c 18-21 |sort -u > $File
cons_date=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]

while read storenum
do

    c_file1=$(ls d_corder_rpt_${storenum}_${cons_date}.txt.Z)
    uncompress /opt/coy/data/input/archive/$c_file1  2>> $log_dir$log_fname
    c_file2=$(ls d_corder_rpt_${storenum}_${cons_date}.txt)
    sort -u +12n -t"," -c /opt/coy/data/input/archive/$c_file2 -o /opt/coy/data/input/archive/$c_file2 2>/dev/null
    cat /opt/coy/data/input/archive/$c_file2 >> $tmp_file 2>> $log_dir$log_fname
    compress -f /opt/coy/data/input/archive/$c_file2 2>> $log_dir$log_fname

done < $File

print "Temp merged data file completed" >>$log_dir$log_fname

# Remove store list file
rm -f $File

###format date to YYYYDDMM ###
end_dt=$last_date

end_dt_yr=$(echo ${end_dt}|cut -c1-4)
end_dt_mon=$(echo ${end_dt}|cut -c5-6)
end_dt_day=$(echo ${end_dt}|cut -c7-8)

oldfile=`ls /opt/coy/data/input/archive/${end_dt}_lc_coy_dly_??.dat.Z |tail -1`

if [[ -z $oldfile ]]
  then
    file_name=$end_dt"_lc_coy_dly_01"
    inc=01
else
  num=`echo $oldfile | cut -c 49-50`
  (( inc = num + 1 ))
  file_name=$end_dt"_lc_coy_dly_"$inc
fi

while read -r line
  do
    awk ' BEGIN {FS= ","}     # set a delimeter to comma
    $2 ~ /[mM]/  {     ### Check if it is an item level "M" ($2 - second field)

    ###Check if UPC code exists###
    if  ($13 ~ /[1-9]+/ )
      {
        UPCNext=$13
        if ( UPCNext != UPCPrev )
          {
            UPCPrev=UPCNext
            $3=substr($3,5,4)"-"substr($3,1,2)"-"substr($3,3,2)  ###reformat date to yyyy-mm-dd
            #$9=sprintf("%7.7f",$9)       ### format ninth field to  0.0000001
            #$9=substr($9,2)              ### and cut first zero according to requirements #8s
            ### STP11702 - TDT,II - Added fields 14-19 to the end of the record
            printf ("%10s|%05s|%014s|%08d|%08d|%09.3f|%08d|%08d|%08d|%08d|%13s|%08d|%08d|%8s|%s|%s| \n",
                    $3,$1,$13,$5,$6,$7,$8,$9,$10,$11,$14,$15,$16,$17,$18,$19)
          }
        else
          {
            printf ("%s\t Duplicate UPC \n",$0) >>$log_dir$log_fname
          }
      }
    else
      {
        printf ("%s\t No UPC  for this item\n",$0) >>$log_dir$log_fname
      }
    }
             ' 2>>$log_dir$err_fname
 done<$tmp_file>$work_dir$file_name".dat"

 rc="$?"
 
 echo "Value of rc after awk is: $rc" >>$log_dir$log_fname
 ###create manifest file consisting of record count and size ###
 size=`ls -l $work_dir$file_name.dat | awk '{print $5}'`
 rownum=`wc -l<$work_dir$file_name.dat`
 printf "%4s|%s|%s|%s\n" $end_dt_yr-$end_dt_mon-$end_dt_day $inc $size $rownum >$work_dir$file_name.mft

 # compresses and archives the Merged data file
 compress -f $work_dir$file_name.dat
 cp $work_dir$file_name.dat.Z /opt/coy/data/input/archive/$file_name.dat.Z

 # removes the temporary merged data file
 rm -f $tmp_file

 #uncompresses the merged data file for pick up by Tibco
 uncompress $work_dir$file_name.dat

 case "$rc" in
 0) print "Script $0 completed successfully" >>$log_dir$log_fname;;
 *) print "Awk part in $0 Failed!" >>$log_dir$log_fname;;
 esac

 exit $rc

但是,我对此脚本做了一些微小的更改,以捕获使用此

' 2>>$log_dir$err_fname
从 awk 返回的标准错误,并将其捕获到自定义错误文件中。

 cat sensolid_error.log
    awk: 0602-562 Field $() is not correct.
     The input line number is 3.567832e+04.
     The source line number is 20.

但无法捕获 awk 正在写入哪个记录的标准错误。需要从输入 txt 文件中的大量行中捕获该输入行。

示例输入记录如下:

1,M,03282024,127722,16,0,0,15,0,0,0,0,4157014707,003B004009011,0,0,20240218,A,U
1,M,03282024,154230,7,0,0,8,0,0,0,1,68826755008,003B004011008,0,0,20231004,A,N
1,M,03282024,127747,5,0,0,8,0,0,0,1,2900007906,003B004002011,0,0,20231104,A,N
38,I,03282024,247657,,,,,,,,Y
38,I,03282024,247658,,,,,,,,Y
38,I,03282024,247664,,,,,,,,Y
1,M,03282024,165805,3,0,0,3,0,0,0,0,4133321301,694B011001009,0,0,20231010,A,U
1,M,03282024,165815,4,0,0,5,0,0,0,0,3980003678,694B010001008,0,0,20231010,A,U
1,M,03282024,165817,4,0,0,5,0,0,0,0,3980001361,694B010001007,0,0,20231010,A,U
1,M,03282024,224743,3,0,0,3,0,0,0,0,3980091156,694B010001006,0,0,20231010,A,N

请提供一种方法来了解 awk 失败的原因以及它在哪一行失败?

linux awk ksh txt
1个回答
0
投票

这些输入行只有 12 个字段

38,I,03282024,247658,,,,,,,,Y
38,I,03282024,247664,,,,,,,,Y

但您正在尝试匹配字段 13

if  ($13 ~ /[1-9]+/ )

© www.soinside.com 2019 - 2024. All rights reserved.