scala - Find aggregated sum of repetition spark -
input :
name1 name2
arjun deshwal
nikhil choubey
anshul pandyal
arjun deshwal
arjun deshwal
deshwal arjun
code used in scala
val df = sqlcontext.read.format("com.databricks.spark.csv") .option("header", "true") .load(file_path) val result = df.groupby("name1", "name2") .agg(count(lit(1)) .alias("cnt"))
getting output :
nikhil choubey 1
anshul pandyal 1
deshwal arjun 1
arjun deshwal 3
required output :
nikhil choubey 1
anshul pandyal 1
deshwal arjun 4
or
nikhil choubey 1
anshul pandyal 1
arjun deshwal 4
i approach using set, not contain order , therefore compares on content of set:
scala> val data = array( | ("arjun", "deshwal"), | ("nikhil", "choubey"), | ("anshul", "pandyal"), | ("arjun", "deshwal"), | ("arjun", "deshwal"), | ("deshwal", "arjun") | ) data: array[(string, string)] = array((arjun,deshwal), (nikhil,choubey), (anshul,pandyal), (arjun,deshwal), (arjun,deshwal), (deshwal,arjun)) scala> val distdata = sc.parallelize(data) distdata: org.apache.spark.rdd.rdd[(string, string)] = parallelcollectionrdd[0] @ parallelize @ <console>:29 scala> val distdatasets = distdata.map(tup => (set(tup._1, tup._2), 1)).countbykey() distdatasets: scala.collection.map[scala.collection.immutable.set[string],long] = map(set(nikhil, choubey) -> 1, set(arjun, deshwal) -> 4, set(anshul, pandyal) -> 1)
hope helps.
Comments
Post a Comment