Concatenate Dataframe rows based on timestamp value - scala

I have a Dataframe with text messages and a timestamp value for each row.
Like so:
+--------------------------+---------------------+
| message | timestamp |
+--------------------------+---------------------+
| some text from message 1 | 2019-08-03 01:00:00 |
+--------------------------+---------------------+
| some text from message 2 | 2019-08-03 01:01:00 |
+--------------------------+---------------------+
| some text from message 3 | 2019-08-03 01:03:00 |
+--------------------------+---------------------+
I need to concatenate the messages by creating time windows of X number of minutes so that for example they look like this:
+---------------------------------------------------+
| message |
+---------------------------------------------------+
| some text from message 1 some text from message 2 |
+---------------------------------------------------+
| some text from message 3 |
+---------------------------------------------------+
After doing the concatenation I have no use for the timestamp column so I can drop it or keep it with any value.
I have been able to do this by iterating through the entire Dataframe, adding timestamp diffs and inserting into a new Dataframe when the time window is achieved. It works but it's ugly and I am looking for some pointers into how to accomplish this in Scala in a more functional/elegant way.
I looked at the Window functions but since I am not doing aggregations it appears that I do not have a way to access the content of the groups once the WindowSpec is created so I didn't get very far.
I also looked at the lead and lag functions but I couldn't figure out how to use them without also having to go into a for loop.
I appreciate any ideas or pointers you can provide.
Any thoughts or pointers into how to accomplish this?

You can use the window datetime function (not to be confused with Window functions) to generate time windows, followed by a groupBy to aggregate messages using concat_ws:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("message1", "2019-08-03 01:00:00"),
("message2", "2019-08-03 01:01:00"),
("message3", "2019-08-03 01:03:00")
).toDF("message", "timestamp")
val duration = "2 minutes"
df.
groupBy(window($"timestamp", duration)).
agg(concat_ws(" ", collect_list($"message")).as("message")).
show(false)
// +------------------------------------------+-----------------+
// |window |message |
// +------------------------------------------+-----------------+
// |[2019-08-03 01:00:00, 2019-08-03 01:02:00]|message1 message2|
// |[2019-08-03 01:02:00, 2019-08-03 01:04:00]|message3 |
// +------------------------------------------+-----------------+

Related

pyspark dataframe: add a new indicator column with random sampling

I have a spark dataframe containing the following schema:
StructField(email_address,StringType,true),StructField(subject_line,StringType,true)))
I want randomly sample 50% of the population into control and test. Currently I am doing it the following way:
df_segment_ctl = df_segment.sample(False, 0.5, seed=0)
df_segment_tmt = df_segment.join(df_segment_ctl, ["email_address"], "leftanti")
But I am certain there must be a better way to create a column instead like the following
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treatment|
| xxxxxxx#gmail.com| 1.6|control |
Any help is appreciated. I am new to this world
UPDATE:
I dont want to split the dataframe into two. Just want to add an indicator column
UPDATE:
Is it possible to have multiple splits elegantly. Suppose instead of two groups I want a single control and two treatment
| email_address| segment_id|group |
+--------------------+---------------+---------+
|xxxxxxxxxx#gmail.com| 1.1|treat_1. |
| xxxxxxx#gmail.com| 1.6|control |
| xxxxx#gmail.com | 1.6|treat_2 |
You can split the spark dataframe using random split like below
df_segment_ctl, df_segment_tmt = df_segment.randomSplit(weights=[0.5,0.5], seed=0)

Need some data after by grouping on key in spark/scala

I have a problem in spark(v2.2.2)/scala(v2.11.8). Mostly into scala/spark functional language.
I have a list of person with rented_date like below.
These are csv file which I will convert into parquet and read as a dataframe.
Table: Person
+-------------------+-----------+
| ID |report_date|
+-------------------+-----------+
| 123| 2011-09-25|
| 111| 2017-08-23|
| 222| 2018-09-30|
| 333| 2020-09-30|
| 444| 2019-09-30|
+-------------------+-----------+
I want to find out the start_date of the address for the period person's rented it out by grouping on ID
Table: Address
+-------------------+----------+----------+
| ID |start_date|close_date|
+-------------------+----------+----------+
| 123|2008-09-23|2009-09-23|
| 123|2009-09-24|2010-09-23|
| 123|2010-09-24|2011-09-23|
| 123|2011-09-30|2012-09-23|
| 123|2012-09-24| null|
| 111|2013-09-23|2014-09-23|
| 111|2014-09-24|2015-09-23|
| 111|2015-09-24|2016-09-23|
| 111|2016-09-24|2017-09-23|
| 111|2017-09-24| null|
| 222|2018-09-24| null|
+-------------------+----------+----------+
ex: For 123 rented_date is 2011-09-20, which in address table falls in the period (start_date, close_date) 2010-09-24,2011-09-23 (row 3 in address). Form here I have to fetch start_date 2010-09-24.
I have to do this on entire dataset by joining the tables. Or need to fetch start_date from address table into the Person table.
Also need to handle where closed date is null.
Sometime scenario may also include where rented date will not fall in any of the period in that case we need to take it where rented_date < closed_date.
Apologies, proper format of tables are not populating.
Thanks in Advance.
First of all
I have a list of person with rented_date like below. These are csv file which I will convert into parquet and read as a dataframe.
No need to convert it you can just read it directly with spark
spark.read.csv("path")
spark.read.format("csv").load("path")
I am not sure what your expectation in null fields are so I would filter them out for now:
dfAdressNotNull.filter($"close_date".isNotNull)
Of course now you need to join them together and since the data in Address is the relevant one I would do a left join.
val joinedDf = dfAddressNotNull.join(dfPerson, Seq("ID"), "left")
No you have Addresses and Persons combined
If you filter now like that
joinedDf.filter($"report_date" >= $"start_date" && $"report_date" < $"closed_date")
You should have something like that what you want to achieve.

"Enrich" Spark DataFrame from another DF (or from HBase)

I am not sure this is the right title so feel free to suggest an edit. Btw, I'm really new to Scala and Spark.
Basically, I have a DF df_1 looking like this:
| ID | name | city_id |
| 0 | "abc"| 123 |
| 1 | "cba"| 124 |
...
The city_id is a key in a huge HBase:
123; New York; .... 124; Los Angeles; .... etc.
The result should be df_1:
| ID | name | city_id |
| 0 | "abc"| New York|
| 1 | "cba"| Los Angeles|
...
My approach was to create an external Hive table on top of HBase with the columns I need. But then again I do not know how to join them in the most efficient manner.
I suppose there is a way to do it dirrectly from HBase, but again I do not know how.
Any hint is appreciated. :)
There is no need to create an itermediate hive table over hbase. Spark sql can deal with all kind of unstructured data directly. Just load hbase data into a dataframe with the hbase data source
Once you have the proper hbase dataframe use the following
sample spark-scala code to get the joined dataframe:
val df=Seq((0,"abc",123),(1,"cda",124),(2,"dsd",125),(3,"gft",126),(4,"dty",127)).toDF("ID","name","city_id")
val hbaseDF=Seq((123,"New York"),(124,"Los Angeles"),(125,"Chicago"),(126,"Seattle")).toDF("city_id","city_name")
df.join(hbaseDF,Seq("city_id"),"inner").drop("city_id").show()

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.