javascript - Analysis of a large json log file in node.js -


i have following json file:

sensorlogs.json {"arr":[{"utctime":10000001,"s1":22,"s2":32,"s3":42,"s4":12}, {"utctime":10000002,"s1":23,"s2":33,"s4":13}, {"utctime":10000003,"s1":24,"s2":34,"s3":43,"s4":14}, {"utctime":10000005,"s1":26,"s2":36,"s3":44,"s4":16}, {"utctime":10000006,"s1":27,"s2":37,"s4":17}, {"utctime":10000004,"s1":25,"s2":35,"s4":15}, ... {"utctime":12345678,"s1":57,"s2":35,"s3":77,"s4":99} ]} 

sensors s1, s2, s3, etc transmitting @ different frequencies (note s3 transmitting every 2 seconds, , timestanps can out of order).

how can achieve -

analyzing s1: s = [[10000001, 22], [10000002, 23],.. [12345678,57]] s1 had 2 missing entries analyzing s2: s = [[10000001, 32], [10000002, 33],.. [12345678,35]] s2 had 0 missing entries analyzing s3: s = [[10000001, 42], [10000003, 43],.. [12345678,77]] s3 had 0 missing entries analyzing s4: s = [[10000001, 12], [10000003, 13],.. [12345678,99]] s4 had 1 missing entries 

sensorlogs.json 16 gb.

missing entries can found based on difference in consecutive utc timestamps. each sensor transmitted @ known frequency.

i cannot use multiple large arrays analysis due memory constraints, have make multiple passes on same json log file , use single large array analysis.

what have till following -

var result = []; //1. extract keys log file console.log("extracting keys... \n"); var stream = fs.createreadstream(filepath); var linereader = lr.createinterface( {   input: stream });  linereader.on('line', function (line)  {   getkeys(line);//extract keys json }); stream.on('end', function() {   //obj -> arr   for(var key in tmpobj)     arrstrm.push(key);    //2. validate individual sensors   console.log("validating sensor data ...\n");    //synchronous execution of sensors in array   async.each(arrstrm, function(key)   {     {         currsensor = key;         console.log("validating " + currsensor + "...\n");          stream = fs.createreadstream(filepath);         linereader = lr.createinterface(         {           input: stream         });          linereader.on('line', function (line)          {           processline(line);//create arrays sensors         });         stream.on('end', function()         {             processsensor(currsensor);//process data current sensor         });     }   }); });  function getkeys(line)  {     if(((pos = line.indexof('[')) >= 0)||((pos = line.indexof(']')) >= 0))         return;     if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard cr (0x0d)     if (line[line.length-1] == ',') line=line.substr(0,line.length-1); // discard , //  console.log(line);      if (line.length > 1)      { // ignore empty lines         var obj = json.parse(line); // parse json         for(var key in obj)          {             if(key != "debug")             {                 if(tmpobj[key] == undefined)                     tmpobj[key]=[];             }         };     } } 

of course doesn't work, , not able find on net explains how can implemented.

note: can choose language of choice develop tool (c/c++,c#/java/python), going javascript because of it's capability of parsing json arrays (and interest in getting better in js well). suggest alternate language if javascript isn't best language make such tool?

edit: important info either not clear or did not include earlier, looks important include in question -

  1. the data in json logs not streaming live, stored json file in hard disk
  2. data stored not in chronological order, means timestamps might not in correct order. each sensor data needs sorted based on timestamps after has been stored in array
  3. i can not use separate arrays each sensor (that same storing entire 16 gb json in ram), , save memory, 1 array should used @ time. , yes, there more 4 sensors in log, sample (roughly 20 give idea)

i have modified json , expected output

one solution might make multiple passes on json file, storing 1 sensor data timestamps in array @ time, sorting array , analyzing data corruption , gaps. , thats i'm trying in code above

so have big fat sensorslog of 16gb wrapped in json.

to start, entire json file of 16gb isn't realistic, because opening , closing brackets breaks regularity , turns annoying characters in array. know file has beginning , end, , moreover, without them program can work on chunks of file or on stream directly plugged device. let's assume processing :

{"utctime":10000001,"s1":22,"s2":32,"s3":42,"s4":12}, {"utctime":10000002,"s1":23,"s2":33,"s4":13}, {"utctime":10000003,"s1":24,"s2":34,"s3":43,"s4":14}, ... {"utctime":12345678,"s1":57,"s2":35,"s3":77,"s4":99}, 

and adding or detecting missing comma @ end shouldn't difficult.

now every line formatted same way, , can interpreted json. problem : sensors outputted data when expected ? if sure speak @ right time , @ right frequency (case 1) miss writing, well. if start make slight slip on time frame (case 2), kind of heuristic recover proper line-line frequency needed , analysis longer.

if not processing realtime, first , easy validation check file tell if on every freq line found expected sensors data, right ?

in case, since it's big file has processed line line whenever possible.

in following program, considered case 1, , processing continuous stream.

#!/usr/bin/python import json  sensors={} sensors['s1']=[1] # frequencies sensors['s2']=[1] sensors['s3']=[2] sensors['s4']=[1]  # append data array , error counter @ sensors[i] # holds [freq,err,data] k,v in sensors.iteritems(): sensors[k].extend([0,[]]) frq=0;err=1;dat=2 print list(sorted(sensors.items())) s=list(sorted(sensors.keys()))  open('./sensors.json', "r") stream:     i=0     line in stream:       if not line.rstrip(): continue # skip blank lines       j=json.loads(line[:-2]) # skip comma , \n       t=j["utctime"]       k in s:           sensor=sensors[k]           if i%sensor[frq]==0 : # every nth iteration             v=j.get(k)             if v none:                 sensor[err]+=1                 print k,"has",sensor[err],"missing entries"             sensor[dat].append([t,v]) # append sensor data             # filling memory...       i+=1  k,v in sorted(sensors.iteritems()): print k,sensors[k][dat] k,v in sorted(sensors.iteritems()): print k,'had',sensors[k][err],"missing entries" 

to handle case 2 invert none check modulus check, verify if sensor wrote when wasn't supposed , try detect shifts.

last note : program short on memory, perhaps keeping entire data in memory isn't idea. if it's intended use separate arrays each sensors further processing, might wiser write them files.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -