apache spark - need term frequency in new column of dataframe // scala -
i have dataframe :
+----------+--------+---------+-------------+----+ |article_id| sen| token| ner| pos| +----------+--------+---------+-------------+----+ | 1|example1|standford| organisation| nnp| | 1|example1| is| o| vp| | 1|example1| is| location| adp| | 2|example2|standford|organisation2|nnp2| | 2|example2| is| o2| vp2| | 2|example2| good| location2|adp2| +----------+--------+---------+-------------+----+
i need new column called "term_frequency" gives me:
- 2 in front of is
and 1 in front of stanford
as need map them number of times occur in article_id.
i guess like:
df2.withcolumn("termfrequency",'token.map(s => (s,1).reducebykey(_ + _)))
or creating new udf.
dataframe schema follows:
root |-- article_id: long (nullable = true) |-- sen: string (nullable = true) |-- token: string (nullable = true) |-- ner: string (nullable = true) |-- pos: string (nullable = true)
you can results double group on 2 columns:
>>> pyspark.sql import row >>> lines = sc.textfile("data.txt") >>> parts = lines.map(lambda l: l.split(",")) >>> articles = parts.map(lambda p: row(article_id=int(p[0]), sen=p[1], token=p[2], ner=p[3], pos=p[4])) >>> df = articles.todf() >>> df.count() 6 >>> df.groupby("article_id", "token").count().show() +----------+---------+-----+ |article_id| token|count| +----------+---------+-----+ | 1|standford| 1| | 1| is| 2| | 2|standford| 1| | 2| is| 1| | 2| good| 1| +----------+---------+-----+
you can register new table table, , original 1 table , perform join single table again:
>>> sqlcontext.registerdataframeastable(df, "terms") >>> sqlcontext.registerdataframeastable(freq, "freq") >>> sqlcontext.sql("select * terms join freq on terms.article_id = freq.article_id , terms,token = freq.token").show() +----------+-------------+----+--------+---------+----------+---------+-----+ |article_id| ner| pos| sen| token|article_id| token|count| +----------+-------------+----+--------+---------+----------+---------+-----+ | 1| organisation| nnp|example1|standford| 1|standford| 1| | 1| o| vp|example1| is| 1| is| 2| | 1| location| adp|example1| is| 1| is| 2| | 2|organisation2|nnp2|example2|standford| 2|standford| 1| | 2| o2| vp2|example2| is| 2| is| 1| | 2| location2|adp2|example2| good| 2| good| 1| +----------+-------------+----+--------+---------+----------+---------+-----+
hope helps!
Comments
Post a Comment