awk field separator with in the xml -
i have xml file following data.
<record record_no = "2" error_code="100">"18383531";"22677833";"21459732";"41001";"394034";"0208";"prime lending - ;corporate - 2201";"";"prime lending - lacey - 2508";"prime lending - lacey - 2508";"1";"rrvc";"tiffany poe";"heidi";"bundy";"000002274";"2.0";"18.0";"2";"362661";"rejected irs";"a1aaa";"20160720";"1021";"hedi & bundy";"4985045838";"ppassess";"web";"3683000826";"823";"ic w2";"";"";"";"";"rapid_20160801_monthly.txt";"20160720102100";"";"20160803095309";"286023";"rgt";"1";"14702324400223";"14702324400223";"0";"omcprocessed"
i'm using following code:
cat rr_00404.fin.bc_lerr.xml.bc| awk 'begin { fs=ofs=";" }/<record/ { gsub(/"/,"\""); gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }'
the idea following:
- replace
"e;
"
- extract
error_code
- print
"
,;
seperated values. - use
sqlldr
load ( not worry this).
problem solve:
- there
;
within text. e.gprime lending -
;corporate - 2201
- there's
&
output:
100;"20160803095309";"1";"1";"base_err_xml";"0";"prime lending 100;"286023";"14702324400223";"omcprocessed";"base_err_xml";"0";"prime lending - corporate - 2201" 100;"286024-1";"";"omcprocessed";"base_err_xml";"0";"prime lending - corporate - 2201"
awk
wrong tool job, without preprocessing. here, use xmlstarlet first pass (decoding xml entities , splitting attributes off separate fields), , gnu awk second (reading fields , performing whatever transforms or logic need):
#!/bin/sh # reads xml on stdin; puts record_no in first field, error code in second, # ...record content remainder of output line. xmlstarlet sel -t -m '//record' \ -v ./@record_no -o ';' \ -v ./@error_code -o ';' \ -v . -n
...and, cribbed gnu awk documentation...
#!/bin/env gawk -f # must gnu awk fpat feature begin { fpat = "([^;]*)|(\"[^\"]*\")" } { print "nf = ", nf (i = 1; <= nf; i++) { printf("$%d = <%s>\n", i, $i) } }
here, we're doing gawk
showing how fields split, obviously, can modify script whatever needs have.
a subset of output above given input file (when extended valid xml) quoted below:
$1 = <2> $2 = <100> $9 = <"prime lending - ;corporate - 2201">
note, then, $1
record_no
, $2
error_code
, , $9
correctly contains semicolon literal content.
obviously, can encapsulate both these components in shell functions avoid need separate files.
Comments
Post a Comment