scala - Find aggregated sum of repetition spark -


input :

name1    name2
arjun       deshwal
nikhil       choubey
anshul     pandyal
arjun       deshwal
arjun       deshwal
deshwal    arjun

code used in scala

val df = sqlcontext.read.format("com.databricks.spark.csv")                    .option("header", "true")                    .load(file_path) val result = df.groupby("name1", "name2")                .agg(count(lit(1))                .alias("cnt")) 

getting output :

nikhil   choubey    1
anshul   pandyal   1
deshwal    arjun    1
arjun    deshwal    3

required output :

nikhil     choubey   1
anshul   pandyal    1
deshwal   arjun      4

or

nikhil   choubey   1
anshul   pandyal   1
arjun   deshwal   4

i approach using set, not contain order , therefore compares on content of set:

scala> val data = array(  |     ("arjun",   "deshwal"),  |     ("nikhil",  "choubey"),  |     ("anshul",  "pandyal"),  |     ("arjun",   "deshwal"),  |     ("arjun",   "deshwal"),  |     ("deshwal", "arjun")  | ) data: array[(string, string)] = array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun))  scala> val distdata = sc.parallelize(data) distdata: org.apache.spark.rdd.rdd[(string, string)] = parallelcollectionrdd[0] @ parallelize @ <console>:29  scala> val distdatasets = distdata.map(tup => (set(tup._1, tup._2), 1)).countbykey() distdatasets: scala.collection.map[scala.collection.immutable.set[string],long] = map(set(nikhil, choubey) -> 1, set(arjun, deshwal) -> 4, set(anshul, pandyal) -> 1) 

hope helps.


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -