Compare Rows To Make Noun Chunk in PySpark - pyspark

I have a Spark dataframe where each row is the token from a sentence and includes its part of speech. I am trying to find the best way to compare one row to the next in order to create the longest noun chunk.
+------+-----------+---------------------------+--------+-------+-------+-----+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS|
+------+-----------+---------------------------+--------+-------+-------+-----+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN|
I know a for loop isn't efficient but I'm not sure how else to get my intended results like below:
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
|REV_ID| SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
+------+-----------+---------------------------+--------+-------+-------+-----+----------------+
| 1| 1|Ice hockey game took hours.| 1| Ice| ice| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 2| hockey| hockey| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 3| game| game| NOUN| ice hockey game|
| 1| 1|Ice hockey game took hours.| 4| took| take| VERB| NULL|
| 1| 1|Ice hockey game took hours.| 5| hours| hour| NOUN| hour|

Try this with window functions.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("SENT_ID").orderBy("TOKEN_ID")
w1=Window().partitionBy("SENT_ID", "list")
df\
.withColumn("list", F.sum(F.when(F.col("POS")=='NOUN', F.lit(0)).otherwise(F.lit(1))).over(w))\
.withColumn("list", F.expr("""IF(POS!='NOUN',null,list)"""))\
.withColumn("NOUN_CHUNK", F.when(F.col("list").isNotNull(),F.array_join(F.collect_list("LEMMA").over(w1),' '))\
.otherwise(F.lit(None))).drop("list").orderBy("SENT_ID","TOKEN_ID").show()
#+------+-------+--------------------+--------+------+------+----+---------------+
#|REV_ID|SENT_ID| SENTENCE|TOKEN_ID| TOKEN| LEMMA| POS| NOUN_CHUNK|
#+------+-------+--------------------+--------+------+------+----+---------------+
#| 1| 1|Ice hockey game t...| 1| Ice| ice|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 2|hockey|hockey|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 3| game| game|NOUN|ice hockey game|
#| 1| 1|Ice hockey game t...| 4| took| take|VERB| null|
#| 1| 1|Ice hockey game t...| 5| hours| hour|NOUN| hour|
#+------+-------+--------------------+--------+------+------+----+---------------+

Related

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

Taking sum ini spark-scala based on a condition

I have a data frame like this. How can i take the sum of the column sales where the rank is greater than 3 , per 'M'
+---+-----+----+
| M|Sales|Rank|
+---+-----+----+
| M1| 200| 1|
| M1| 175| 2|
| M1| 150| 3|
| M1| 125| 4|
| M1| 90| 5|
| M1| 85| 6|
| M2| 1001| 1|
| M2| 500| 2|
| M2| 456| 3|
| M2| 345| 4|
| M2| 231| 5|
| M2| 123| 6|
+---+-----+----+
Expected Output --
+---+-----+----+---------------+
| M|Sales|Rank|SumGreaterThan3|
+---+-----+----+---------------+
| M1| 200| 1| 300|
| M1| 175| 2| 300|
| M1| 150| 3| 300|
| M1| 125| 4| 300|
| M1| 90| 5| 300|
| M1| 85| 6| 300|
| M2| 1001| 1| 699|
| M2| 500| 2| 699|
| M2| 456| 3| 699|
| M2| 345| 4| 699|
| M2| 231| 5| 699|
| M2| 123| 6| 699|
+---+-----+----+---------------+
I have done sum over ROwnumber like this
df.withColumn("SumGreaterThan3",sum("Sales").over(Window.partitionBy(col("M"))))` //But this will provide total sum of sales.
To replicate the same DF-
val df = Seq(
("M1",200,1),
("M1",175,2),
("M1",150,3),
("M1",125,4),
("M1",90,5),
("M1",85,6),
("M2",1001,1),
("M2",500,2),
("M2",456,3),
("M2",345,4),
("M2",231,5),
("M2",123,6)
).toDF("M","Sales","Rank")
Well, the partition is enough to set the window function. Of course you also have to use the conditional summation by mixing sum and when.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("M")
df.withColumn("SumGreaterThan3", sum(when('Rank > 3, 'Sales).otherwise(0)).over(w).alias("sum")).show
This will givs you the expected results.

Apache Spark visualization

I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
Patient.csv
+---+---------+------+---+-----------------+-----------+------------+-------------+
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
+---+---------+------+---+-----------------+-----------+------------+-------------+
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
+---+---------+------+---+-----------------+-----------+------------+-------------+
And coverType.csv
+--------------+-----------------+
|cover_type_key| cover_type_label|
+--------------+-----------------+
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
+--------------+-----------------+
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/Patient.csv")
.load()
val coverTypeDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/covertype.csv")
.load()
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.
To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
display(PatientDF.groupBy(Disease_Type).count())
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
display(PatientDF.select(year(to_timestamp("Infected_Date")).alias("year")).orderBy("year"))

Find continous data in pyspark dataframe

I have a dataframe that looks like
key | value | time | status
x | 10 | 0 | running
x | 15 | 1 | running
x | 30 | 2 | running
x | 15 | 3 | running
x | 0 | 4 | stop
x | 40 | 5 | running
x | 10 | 6 | running
y | 10 | 0 | running
y | 15 | 1 | running
y | 30 | 2 | running
y | 15 | 3 | running
y | 0 | 4 | stop
y | 40 | 5 | running
y | 10 | 6 | running
...
I want to end up with a table that looks like
key | start | end | status | max value
x | 0 | 3 | running| 30
x | 4 | 4 | stop | 0
x | 5 | 6 | running| 40
y | 0 | 3 | running| 30
y | 4 | 4 | stop | 0
y | 5 | 6 | running| 40
...
In other words, I want to split by key, sort by time, into windows that have the same status, keep the first and last time and do a calculation over that window i.e max of value
Using pyspark ideally.
Here is one approach you can take.
First create a column to determine if the status has changed for a given key:
from pyspark.sql.functions import col, lag
from pyspark.sql import Window
w = Window.partitionBy("key").orderBy("time")
df = df.withColumn(
"status_change",
(col("status") != lag("status").over(w)).cast("int")
)
df.show()
#+---+-----+----+-------+-------------+
#|key|value|time| status|status_change|
#+---+-----+----+-------+-------------+
#| x| 10| 0|running| null|
#| x| 15| 1|running| 0|
#| x| 30| 2|running| 0|
#| x| 15| 3|running| 0|
#| x| 0| 4| stop| 1|
#| x| 40| 5|running| 1|
#| x| 10| 6|running| 0|
#| y| 10| 0|running| null|
#| y| 15| 1|running| 0|
#| y| 30| 2|running| 0|
#| y| 15| 3|running| 0|
#| y| 0| 4| stop| 1|
#| y| 40| 5|running| 1|
#| y| 10| 6|running| 0|
#+---+-----+----+-------+-------------+
Next fill the nulls with 0 and take the cumulative sum of the status_change column, per key:
from pyspark.sql.functions import sum as sum_ # avoid shadowing builtin
df = df.fillna(0).withColumn(
"status_group",
sum_("status_change").over(w)
)
df.show()
#+---+-----+----+-------+-------------+------------+
#|key|value|time| status|status_change|status_group|
#+---+-----+----+-------+-------------+------------+
#| x| 10| 0|running| 0| 0|
#| x| 15| 1|running| 0| 0|
#| x| 30| 2|running| 0| 0|
#| x| 15| 3|running| 0| 0|
#| x| 0| 4| stop| 1| 1|
#| x| 40| 5|running| 1| 2|
#| x| 10| 6|running| 0| 2|
#| y| 10| 0|running| 0| 0|
#| y| 15| 1|running| 0| 0|
#| y| 30| 2|running| 0| 0|
#| y| 15| 3|running| 0| 0|
#| y| 0| 4| stop| 1| 1|
#| y| 40| 5|running| 1| 2|
#| y| 10| 6|running| 0| 2|
#+---+-----+----+-------+-------------+------------+
Now you can aggregate over the key and status_group. You can also include status in the groupBy since it will be the same for each status_group. Finally select only the columns you want in your output.
from pyspark.sql.functions import min as min_, max as max_
df_agg = df.groupBy("key", "status", "status_group")\
.agg(
min_("time").alias("start"),
max_("time").alias("end"),
max_("value").alias("max_value")
)\
.select("key", "start", "end", "status", "max_value")\
.sort("key", "start")
df_agg.show()
#+---+-----+---+-------+---------+
#|key|start|end| status|max_value|
#+---+-----+---+-------+---------+
#| x| 0| 3|running| 30|
#| x| 4| 4| stop| 0|
#| x| 5| 6|running| 40|
#| y| 0| 3|running| 30|
#| y| 4| 4| stop| 0|
#| y| 5| 6|running| 40|
#+---+-----+---+-------+---------+

Apache Spark group by combining types and sub types

I have this dataset in spark,
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
I can now group this by city and media like this,
val groupByCityAndYear = sales
.groupBy("city", "media")
.count()
groupByCityAndYear.show()
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
+-------+--------+-----+
But, how can I do combine media and action together in one column, so the expected output should be,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Combine media and action columns as array column, explode it, then do groupBy count:
sales.select(
$"city", explode(array($"media", $"action")).as("mediaAction")
).groupBy("city", "mediaAction").count().show()
+-------+-----------+-----+
| city|mediaAction|count|
+-------+-----------+-----+
| Boston| share| 2|
| Boston| facebook| 1|
| Warsaw| share| 1|
| Boston| twitter| 1|
| Warsaw| like| 1|
|Toronto| twitter| 1|
|Toronto| like| 1|
| Warsaw| facebook| 2|
+-------+-----------+-----+
Or assuming media and action doesn't intersect (the two columns don't have common elements):
sales.groupBy("city", "media").count().union(
sales.groupBy("city", "action").count()
).show
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
| Boston| share| 2|
| Warsaw| share| 1|
| Warsaw| like| 1|
|Toronto| like| 1|
+-------+--------+-----+