apache spark - need term frequency in new column of dataframe // scala -


i have dataframe :

+----------+--------+---------+-------------+----+ |article_id|     sen|    token|          ner| pos| +----------+--------+---------+-------------+----+ |         1|example1|standford| organisation| nnp| |         1|example1|       is|            o|  vp| |         1|example1|       is|     location| adp| |         2|example2|standford|organisation2|nnp2| |         2|example2|       is|           o2| vp2| |         2|example2|     good|    location2|adp2| +----------+--------+---------+-------------+----+ 

i need new column called "term_frequency" gives me:

  • 2 in front of is
  • and 1 in front of stanford

    as need map them number of times occur in article_id.

i guess like:

df2.withcolumn("termfrequency",'token.map(s => (s,1).reducebykey(_ + _))) or creating new udf.

dataframe schema follows:

root  |-- article_id: long (nullable = true)  |-- sen: string (nullable = true)  |-- token: string (nullable = true)  |-- ner: string (nullable = true)  |-- pos: string (nullable = true) 

you can results double group on 2 columns:

>>> pyspark.sql import row >>> lines = sc.textfile("data.txt") >>> parts = lines.map(lambda l: l.split(",")) >>> articles = parts.map(lambda p: row(article_id=int(p[0]), sen=p[1], token=p[2], ner=p[3], pos=p[4])) >>> df = articles.todf()  >>> df.count() 6  >>> df.groupby("article_id", "token").count().show() +----------+---------+-----+ |article_id|    token|count| +----------+---------+-----+ |         1|standford|    1| |         1|       is|    2| |         2|standford|    1| |         2|       is|    1| |         2|     good|    1| +----------+---------+-----+ 

you can register new table table, , original 1 table , perform join single table again:

>>> sqlcontext.registerdataframeastable(df, "terms") >>> sqlcontext.registerdataframeastable(freq, "freq") >>> sqlcontext.sql("select * terms join freq on terms.article_id = freq.article_id , terms,token = freq.token").show() +----------+-------------+----+--------+---------+----------+---------+-----+ |article_id|          ner| pos|     sen|    token|article_id|    token|count| +----------+-------------+----+--------+---------+----------+---------+-----+ |         1| organisation| nnp|example1|standford|         1|standford|    1| |         1|            o|  vp|example1|       is|         1|       is|    2| |         1|     location| adp|example1|       is|         1|       is|    2| |         2|organisation2|nnp2|example2|standford|         2|standford|    1| |         2|           o2| vp2|example2|       is|         2|       is|    1| |         2|    location2|adp2|example2|     good|         2|     good|    1| +----------+-------------+----+--------+---------+----------+---------+-----+ 

hope helps!


Comments

Popular posts from this blog

javascript - Thinglink image not visible until browser resize -

firebird - Error "invalid transaction handle (expecting explicit transaction start)" executing script from Delphi -

mongodb - How to keep track of users making Stripe Payments -