awk field separator with in the xml -
i have xml file following data.
<record record_no = "2" error_code="100">"18383531";"22677833";"21459732";"41001";"394034";"0208";"prime lending - ;corporate - 2201";"";"prime lending - lacey - 2508";"prime lending - lacey - 2508";"1";"rrvc";"tiffany poe";"heidi";"bundy";"000002274";"2.0";"18.0";"2";"362661";"rejected irs";"a1aaa";"20160720";"1021";"hedi & bundy";"4985045838";"ppassess";"web";"3683000826";"823";"ic w2";"";"";"";"";"rapid_20160801_monthly.txt";"20160720102100";"";"20160803095309";"286023";"rgt";"1";"14702324400223";"14702324400223";"0";"omcprocessed" i'm using following code:
cat rr_00404.fin.bc_lerr.xml.bc| awk 'begin { fs=ofs=";" }/<record/ { gsub(/"/,"\""); gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }' the idea following:
- replace
"e;" - extract
error_code - print
",;seperated values. - use
sqlldrload ( not worry this).
problem solve:
- there
;within text. e.gprime lending -;corporate - 2201 - there's
&
output:
100;"20160803095309";"1";"1";"base_err_xml";"0";"prime lending 100;"286023";"14702324400223";"omcprocessed";"base_err_xml";"0";"prime lending - corporate - 2201" 100;"286024-1";"";"omcprocessed";"base_err_xml";"0";"prime lending - corporate - 2201"
awk wrong tool job, without preprocessing. here, use xmlstarlet first pass (decoding xml entities , splitting attributes off separate fields), , gnu awk second (reading fields , performing whatever transforms or logic need):
#!/bin/sh # reads xml on stdin; puts record_no in first field, error code in second, # ...record content remainder of output line. xmlstarlet sel -t -m '//record' \ -v ./@record_no -o ';' \ -v ./@error_code -o ';' \ -v . -n ...and, cribbed gnu awk documentation...
#!/bin/env gawk -f # must gnu awk fpat feature begin { fpat = "([^;]*)|(\"[^\"]*\")" } { print "nf = ", nf (i = 1; <= nf; i++) { printf("$%d = <%s>\n", i, $i) } } here, we're doing gawk showing how fields split, obviously, can modify script whatever needs have.
a subset of output above given input file (when extended valid xml) quoted below:
$1 = <2> $2 = <100> $9 = <"prime lending - ;corporate - 2201"> note, then, $1 record_no, $2 error_code, , $9 correctly contains semicolon literal content.
obviously, can encapsulate both these components in shell functions avoid need separate files.
Comments
Post a Comment