javascript - Analysis of a large json log file in node.js -
i have following json file:
sensorlogs.json {"arr":[{"utctime":10000001,"s1":22,"s2":32,"s3":42,"s4":12}, {"utctime":10000002,"s1":23,"s2":33,"s4":13}, {"utctime":10000003,"s1":24,"s2":34,"s3":43,"s4":14}, {"utctime":10000005,"s1":26,"s2":36,"s3":44,"s4":16}, {"utctime":10000006,"s1":27,"s2":37,"s4":17}, {"utctime":10000004,"s1":25,"s2":35,"s4":15}, ... {"utctime":12345678,"s1":57,"s2":35,"s3":77,"s4":99} ]}
sensors s1, s2, s3, etc transmitting @ different frequencies (note s3 transmitting every 2 seconds, , timestanps can out of order).
how can achieve -
analyzing s1: s = [[10000001, 22], [10000002, 23],.. [12345678,57]] s1 had 2 missing entries analyzing s2: s = [[10000001, 32], [10000002, 33],.. [12345678,35]] s2 had 0 missing entries analyzing s3: s = [[10000001, 42], [10000003, 43],.. [12345678,77]] s3 had 0 missing entries analyzing s4: s = [[10000001, 12], [10000003, 13],.. [12345678,99]] s4 had 1 missing entries
sensorlogs.json 16 gb.
missing entries can found based on difference in consecutive utc timestamps. each sensor transmitted @ known frequency.
i cannot use multiple large arrays analysis due memory constraints, have make multiple passes on same json log file , use single large array analysis.
what have till following -
var result = []; //1. extract keys log file console.log("extracting keys... \n"); var stream = fs.createreadstream(filepath); var linereader = lr.createinterface( { input: stream }); linereader.on('line', function (line) { getkeys(line);//extract keys json }); stream.on('end', function() { //obj -> arr for(var key in tmpobj) arrstrm.push(key); //2. validate individual sensors console.log("validating sensor data ...\n"); //synchronous execution of sensors in array async.each(arrstrm, function(key) { { currsensor = key; console.log("validating " + currsensor + "...\n"); stream = fs.createreadstream(filepath); linereader = lr.createinterface( { input: stream }); linereader.on('line', function (line) { processline(line);//create arrays sensors }); stream.on('end', function() { processsensor(currsensor);//process data current sensor }); } }); }); function getkeys(line) { if(((pos = line.indexof('[')) >= 0)||((pos = line.indexof(']')) >= 0)) return; if (line[line.length-1] == '\r') line=line.substr(0,line.length-1); // discard cr (0x0d) if (line[line.length-1] == ',') line=line.substr(0,line.length-1); // discard , // console.log(line); if (line.length > 1) { // ignore empty lines var obj = json.parse(line); // parse json for(var key in obj) { if(key != "debug") { if(tmpobj[key] == undefined) tmpobj[key]=[]; } }; } }
of course doesn't work, , not able find on net explains how can implemented.
note: can choose language of choice develop tool (c/c++,c#/java/python), going javascript because of it's capability of parsing json arrays (and interest in getting better in js well). suggest alternate language if javascript isn't best language make such tool?
edit: important info either not clear or did not include earlier, looks important include in question -
- the data in json logs not streaming live, stored json file in hard disk
- data stored not in chronological order, means timestamps might not in correct order. each sensor data needs sorted based on timestamps after has been stored in array
- i can not use separate arrays each sensor (that same storing entire 16 gb json in ram), , save memory, 1 array should used @ time. , yes, there more 4 sensors in log, sample (roughly 20 give idea)
i have modified json , expected output
one solution might make multiple passes on json file, storing 1 sensor data timestamps in array @ time, sorting array , analyzing data corruption , gaps. , thats i'm trying in code above
so have big fat sensorslog of 16gb wrapped in json.
to start, entire json file of 16gb isn't realistic, because opening , closing brackets breaks regularity , turns annoying characters in array. know file has beginning , end, , moreover, without them program can work on chunks of file or on stream directly plugged device. let's assume processing :
{"utctime":10000001,"s1":22,"s2":32,"s3":42,"s4":12}, {"utctime":10000002,"s1":23,"s2":33,"s4":13}, {"utctime":10000003,"s1":24,"s2":34,"s3":43,"s4":14}, ... {"utctime":12345678,"s1":57,"s2":35,"s3":77,"s4":99},
and adding or detecting missing comma @ end shouldn't difficult.
now every line formatted same way, , can interpreted json. problem : sensors outputted data when expected ? if sure speak @ right time , @ right frequency (case 1) miss writing, well. if start make slight slip on time frame (case 2), kind of heuristic recover proper line-line frequency needed , analysis longer.
if not processing realtime, first , easy validation check file tell if on every freq
line found expected sensors data, right ?
in case, since it's big file has processed line line whenever possible.
in following program, considered case 1, , processing continuous stream.
#!/usr/bin/python import json sensors={} sensors['s1']=[1] # frequencies sensors['s2']=[1] sensors['s3']=[2] sensors['s4']=[1] # append data array , error counter @ sensors[i] # holds [freq,err,data] k,v in sensors.iteritems(): sensors[k].extend([0,[]]) frq=0;err=1;dat=2 print list(sorted(sensors.items())) s=list(sorted(sensors.keys())) open('./sensors.json', "r") stream: i=0 line in stream: if not line.rstrip(): continue # skip blank lines j=json.loads(line[:-2]) # skip comma , \n t=j["utctime"] k in s: sensor=sensors[k] if i%sensor[frq]==0 : # every nth iteration v=j.get(k) if v none: sensor[err]+=1 print k,"has",sensor[err],"missing entries" sensor[dat].append([t,v]) # append sensor data # filling memory... i+=1 k,v in sorted(sensors.iteritems()): print k,sensors[k][dat] k,v in sorted(sensors.iteritems()): print k,'had',sensors[k][err],"missing entries"
to handle case 2 invert none
check modulus check, verify if sensor wrote when wasn't supposed , try detect shifts.
last note : program short on memory, perhaps keeping entire data in memory isn't idea. if it's intended use separate arrays each sensors further processing, might wiser write them files.
Comments
Post a Comment