awk field separator with in the xml -


i have xml file following data.

<record record_no = "2" error_code="100">&quot;18383531&quot;;&quot;22677833&quot;;&quot;21459732&quot;;&quot;41001&quot;;&quot;394034&quot;;&quot;0208&quot;;&quot;prime lending - ;corporate  - 2201&quot;;&quot;&quot;;&quot;prime lending - lacey - 2508&quot;;&quot;prime lending - lacey - 2508&quot;;&quot;1&quot;;&quot;rrvc&quot;;&quot;tiffany poe&quot;;&quot;heidi&quot;;&quot;bundy&quot;;&quot;000002274&quot;;&quot;2.0&quot;;&quot;18.0&quot;;&quot;2&quot;;&quot;362661&quot;;&quot;rejected irs&quot;;&quot;a1aaa&quot;;&quot;20160720&quot;;&quot;1021&quot;;&quot;hedi &amp; bundy&quot;;&quot;4985045838&quot;;&quot;ppassess&quot;;&quot;web&quot;;&quot;3683000826&quot;;&quot;823&quot;;&quot;ic w2&quot;;&quot;&quot;;&quot;&quot;;&quot;&quot;;&quot;&quot;;&quot;rapid_20160801_monthly.txt&quot;;&quot;20160720102100&quot;;&quot;&quot;;&quot;20160803095309&quot;;&quot;286023&quot;;&quot;rgt&quot;;&quot;1&quot;;&quot;14702324400223&quot;;&quot;14702324400223&quot;;&quot;0&quot;;&quot;omcprocessed&quot; 

i'm using following code:

cat rr_00404.fin.bc_lerr.xml.bc| awk 'begin { fs=ofs=";" }/<record/ { gsub(/&quot;/,"\"");  gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }'  

the idea following:

  1. replace &quote; "
  2. extract error_code
  3. print " , ; seperated values.
  4. use sqlldr load ( not worry this).

problem solve:

  1. there ; within text. e.g prime lending -;corporate - 2201
  2. there's &amp;

output:

100;"20160803095309";"1";"1";"base_err_xml";"0";"prime lending 100;"286023";"14702324400223";"omcprocessed";"base_err_xml";"0";"prime lending - corporate  - 2201" 100;"286024-1";"";"omcprocessed";"base_err_xml";"0";"prime lending - corporate  - 2201" 

awk wrong tool job, without preprocessing. here, use xmlstarlet first pass (decoding xml entities , splitting attributes off separate fields), , gnu awk second (reading fields , performing whatever transforms or logic need):

#!/bin/sh  # reads xml on stdin; puts record_no in first field, error code in second, # ...record content remainder of output line.  xmlstarlet sel -t -m '//record' \   -v ./@record_no -o ';' \   -v ./@error_code -o ';' \   -v . -n 

...and, cribbed gnu awk documentation...

#!/bin/env gawk -f # must gnu awk fpat feature  begin {     fpat = "([^;]*)|(\"[^\"]*\")" }  {     print "nf = ", nf     (i = 1; <= nf; i++) {         printf("$%d = <%s>\n", i, $i)     } } 

here, we're doing gawk showing how fields split, obviously, can modify script whatever needs have.


a subset of output above given input file (when extended valid xml) quoted below:

$1 = <2> $2 = <100> $9 = <"prime lending - ;corporate  - 2201"> 

note, then, $1 record_no, $2 error_code, , $9 correctly contains semicolon literal content.


obviously, can encapsulate both these components in shell functions avoid need separate files.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

Sound is not coming out while implementing Text-to-speech in Android activity -