Pyspark 2.1:
I've created a datafame and have a timestamp column that I convert to a unix timestamp. However, the column derived from the unix timestamp is incorrect. As the timestamp increases the unix_timestamp should also increase, however this is not the case. You can see this from the below code an example of this. Note that when you sort the timestamp variable and the unix_ts variable you get different orders.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
("a", "1", "2018-01-08 23:03:23.325359"),
("a", "2", "2018-01-09 00:03:23.325359"),
("a", "3", "2018-01-09 00:03:25.025240"),
("a", "4", "2018-01-09 00:03:27.025240"),
("a", "5", "2018-01-09 00:08:27.021240"),
("a", "6", "2018-01-09 03:03:27.025240"),
("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])
df = df.withColumn("unix_ts",F.unix_timestamp(F.col("timestamp"), "yyyy-MM-dd HH:mm:ss.SSSSSS"))
df.orderBy("timestamp").show(10,False)
df.orderBy("unix_ts").show(10,False)
Output:
+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp |unix_ts |
+---------+----------+--------------------------+----------+
|a |1 |2018-01-08 23:03:23.325359|1515474528|
|a |2 |2018-01-09 00:03:23.325359|1515478128|
|a |3 |2018-01-09 00:03:25.025240|1515477830|
|a |4 |2018-01-09 00:03:27.025240|1515477832|
|a |5 |2018-01-09 00:08:27.021240|1515478128|
|a |6 |2018-01-09 03:03:27.025240|1515488632|
|a |7 |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+
+---------+----------+--------------------------+----------+
|person_id|session_id|timestamp |unix_ts |
+---------+----------+--------------------------+----------+
|a |1 |2018-01-08 23:03:23.325359|1515474528|
|a |3 |2018-01-09 00:03:25.025240|1515477830|
|a |4 |2018-01-09 00:03:27.025240|1515477832|
|a |5 |2018-01-09 00:08:27.021240|1515478128|
|a |2 |2018-01-09 00:03:23.325359|1515478128|
|a |6 |2018-01-09 03:03:27.025240|1515488632|
|a |7 |2018-01-09 05:03:27.025240|1515495832|
+---------+----------+--------------------------+----------+
Is this a bug or am I doing something/implementing this wrong?
Also, you can see that both 2018-01-09 00:03:27.025240 and2018-01-09 00:08:27.021240produce the same unix_timestamp of1515495832`
The problem seems to be that Spark's unix_timestamp internally uses Java's SimpleDateFormat to parse dates, and SimpleDateFormat does not support microseconds (see e.g. here). Furthermore, unix_timestamp returns a long, so it only has a granularity of seconds.
One fix would be to just parse without the microsecond information, and add the microseconds back in separately:
df = spark.createDataFrame([
("a", "1", "2018-01-08 23:03:23.325359"),
("a", "2", "2018-01-09 00:03:23.325359"),
("a", "3", "2018-01-09 00:03:25.025240"),
("a", "4", "2018-01-09 00:03:27.025240"),
("a", "5", "2018-01-09 00:08:27.021240"),
("a", "6", "2018-01-09 03:03:27.025240"),
("a", "7", "2018-01-09 05:03:27.025240"),
], ["person_id", "session_id", "timestamp"])
# parse the timestamp up to the seconds place
df = df.withColumn("unix_ts_sec",f.unix_timestamp(f.substring(f.col("timestamp"), 1, 19), "yyyy-MM-dd HH:mm:ss"))
# extract the microseconds
df = df.withColumn("microsec", f.substring(f.col("timestamp"), 21, 6).cast('int'))
# add to get full epoch time accurate to the microsecond
df = df.withColumn("unix_ts", f.col("unix_ts_sec") + 1e-6 * f.col("microsec"))
Side note: I don't have easy access to Spark 2.1, but using Spark 2.2 I get nulls for unix_ts as originally written. You seem to have hit some sort of a Spark 2.1 bug giving you nonsense timestamps.
Related
I have a spark dataframe which has two column one is id and secound id col_datetime. as you can see the dataframe given below. how can i filter the dataframe based on col_datetime to get the oldest month data. I want to achieve the result dynamically because I have 20 odd dataframes.
INPUT DF:-
import spark.implicits._
val data = Seq((1 , "2020-07-02 00:00:00.0"),(2 , "2020-08-02 00:00:00.0"),(3 , "2020-09-02 00:00:00.0"),(4 , "2020-10-02 00:00:00.0"),(5 , "2020-11-02 00:00:00.0"),(6 , "2020-12-02 00:00:00.0"),(7 , "2021-01-02 00:00:00.0"),(8 , "2021-02-02 00:00:00.0"),(9 , "2021-03-02 00:00:00.0"),(10, "2021-04-02 00:00:00.0"),(11, "2021-05-02 00:00:00.0"),(12, "2021-06-02 00:00:00.0"),(13, "2021-07-22 00:00:00.0"))
val dfFromData1 = data.toDF("ID","COL_DATETIME").withColumn("COL_DATETIME",to_timestamp(col("COL_DATETIME")))
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
OUTPUT:-
DF1 : - Oldest month data
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
+------+---------------------+
DF2:- lastest months data after removing oldest month data from orginal DF.
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
logic/approach:-
step1 :- calculate the minimum datetime for col_datetime column for given dataframe and assign to mindate variable.
Lets assume I will get
var mindate = "2020-07-02 00:00:00.0"
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
print(mindate)
result:-
mindate : org.apache.spark.sql.Row = [2020-07-02 00:00:00.0]
[2020-07-02 00:00:00.0]
Step2:- to get the end date of month using mindate.I haven’t code for this part to get enddatemonth using mindate.
Val enddatemonth = "2020-07-31 00:00:00.0"
Step3 : - Now I can use enddatemonth variable to filter the spark dataframe in DF1 and DF2 based on conditions.
Even if I tried to filter the dataframe based on mindate I am getting error
val DF1 = dfFromData1.where(col("COL_DATETIME") <= enddatemonth)
val DF2 = dfFromData1.where(col("COL_DATETIME") > enddatemonth)
Error : <console>:166: error: type mismatch;
found : org.apache.spark.sql.Row
required: org.apache.spark.sql.Column val DF1 = dfFromData1.where(col("COL_DATETIME" )<= mindate)
Thanks...!!
A Similar approach , but I find it cleaner just to deal with MONTHS.
Idea : like we have epoch for seconds, compute it for months
val dfWithEpochMonth = dfFromData1.
withColumn("year",year($"COL_DATETIME")).
withColumn("month",month($"COL_DATETIME")).
withColumn("epochMonth", (($"year" - 1970 - 1) * 12) + $"month")
Now your df will look like :
+---+-------------------+----+-----+----------+
| ID| COL_DATETIME|year|month|epochMonth|
+---+-------------------+----+-----+----------+
| 1|2020-07-02 00:00:00|2020| 7| 595|
| 2|2020-08-02 00:00:00|2020| 8| 596|
| 3|2020-09-02 00:00:00|2020| 9| 597|
| 4|2020-10-02 00:00:00|2020| 10| 598|
Now, you can calculate min epochMonth and filter directly.
val minEpochMonth = dfWithEpochMonth.select(min("epochMonth")).first().apply(0).toString().toInt
val df1 = dfWithEpochMonth.where($"epochMonth" <= minEpochMonth)
val df2 = dfWithEpochMonth.where($"epochMonth" > minEpochMonth)
You can drop unnecessary columns.
To address your error message :
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
val mindateString = mindate.apply(0).toString()
Now you can use mindateString to filter.
I have a Dataframe which has the following structure and data
Source:
Column1(String), Column2(String), Date
-----------------------
1, 2, 01/01/2021
A, B, 02/01/2021
M, N, 05/01/2021
I want to transform it to the following (First 2 columns are replicated in values and date is incremented until a fixed date (until 07/01/2021 in this example) for each of the source row)
To Result:
1, 2, 01/01/2021
1, 2, 02/01/2021
1, 2, 03/01/2021
1, 2, 04/01/2021
1, 2, 05/01/2021
1, 2, 06/01/2021
1, 2, 07/01/2021
A, B, 02/01/2021
A, B, 03/01/2021
A, B, 04/01/2021
A, B, 05/01/2021
A, B, 06/01/2021
A, B, 07/01/2021
M, N, 05/01/2021
M, N, 06/01/2021
M, N, 07/01/2021
Any idea on how this can be achieved in scala spark?
I got this link Replicate Spark Row N-times, but there is no hint on how a particular column can be incremented during replication.
We can use sequence function to generate list of dates in required range, then explode the output array of sequence function to get dataframe in required format.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
// Sample dataframe
val df = List(("1", "2", "01/01/2021"),
("A", "B", "02/01/2021"),
("M", "N", "05/01/2021"))
.toDF("Column1(String)", "Column2(String)", "Date")
df
.withColumn("Date",explode_outer(sequence(to_date('Date,"dd/MM/yyyy"),
to_date(lit("07/01/2021"),"dd/MM/yyyy"))))
.withColumn("Date",date_format('Date,"dd/MM/yyyy"))
.show(false)
/*
+---------------+---------------+----------+
|Column1(String)|Column2(String)|Date |
+---------------+---------------+----------+
|1 |2 |01/01/2021|
|1 |2 |02/01/2021|
|1 |2 |03/01/2021|
|1 |2 |04/01/2021|
|1 |2 |05/01/2021|
|1 |2 |06/01/2021|
|1 |2 |07/01/2021|
|A |B |02/01/2021|
|A |B |03/01/2021|
|A |B |04/01/2021|
|A |B |05/01/2021|
|A |B |06/01/2021|
|A |B |07/01/2021|
|M |N |05/01/2021|
|M |N |06/01/2021|
|M |N |07/01/2021|
+---------------+---------------+----------+ */
Here is a sample data
val df4 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A6", 30, "9", 1, 450),
("A7", 89, "7", 1, 333),
("A7", 89, "4", 1, 320),
("A2",60, "5", 1, 22),
("A1",45, "22", 1, 1)
)).toDF("CID","age", "children", "marketplace_id","value")
thanks to #Shu for this piece of code
val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id")
+---+-----------+-----+--------------+
|CID|item |value|marketplace_id|
+---+-----------+-----+--------------+
|A1 |{"id":"5"} |90 |1 |
|A2 |{"id":"1"} |120 |1 |
|A6 |{"id":"9"} |450 |1 |
|A7 |{"id":"7"} |333 |1 |
|A7 |{"id":"4"} |320 |1 |
|A2 |{"id":"5"} |22 |1 |
|A1 |{"id":"22"}|1 |1 |
+---+-----------+-----+--------------+
when you do df5.dtypes
(CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType)
the column item is of string type, is there a way this can be of json/object type(if that is a thing)?
EDIT 1:
I will describe what I am trying to achieve here, the above two steps remains same.
val w = Window.partitionBy("CID").orderBy(desc("value"))
val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item")
Output:
+---+-------------------------+
|CID|item |
+---+-------------------------+
|A6 |[{"id":"9"}] |
|A2 |[{"id":"1"}, {"id":"5"}] |
|A7 |[{"id":"7"}, {"id":"4"}] |
|A1 |[{"id":"5"}, {"id":"22"}]|
+---+-------------------------+
now whatever is inside [ ] is a string. which is causing a problem for one of the tools we are using.
Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json data using struct type, check below code.
scala> dfa
.withColumn("item_without_json",struct($"cid".as("id")))
.withColumn("item_as_json",to_json($"item_without_json"))
.show(false)
+---+-----------+-----+--------------+-----------------+------------+
|CID|item |value|marketplace_id|item_without_json|item_as_json|
+---+-----------+-----+--------------+-----------------+------------+
|A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} |
|A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} |
|A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} |
|A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} |
|A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} |
|A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} |
|A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} |
+---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use:
df4
.select(collect_list(struct($"CID".as("id"))).as("items"))
.write()
.json(path)
The output will look like:
{"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]}
If you need the thing in memory to pass down to a function, instead of write().json(...) use toJSON
The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+
I have the following DataFrame in Spark and Scala:
nodeId typeFrom typeTo date
1 A G 2016-10-12T12:10:00.000Z
2 B A 2016-10-12T12:00:00.000Z
3 A B 2016-10-12T12:05:00.000Z
4 D C 2016-10-12T12:30:00.000Z
5 G D 2016-10-12T12:35:00.000Z
I want to make pairs of nodeId for those cases when typeFrom and typeTo values are the same.
The expected output for the above-shown example is the following one:
nodeId_1 nodeId_2 type date
1 2 A 2016-10-12T12:10:00.000Z
3 2 A 2016-10-12T12:05:00.000Z
2 3 B 2016-10-12T12:00:00.000Z
4 5 C 2016-10-12T12:30:00.000Z
5 1 G 2016-10-12T12:35:00.000Z
I don't know how to make pairs of nodeId:
df.
.filter($"typeFrom" === $"typeTo")
.???
You can use self-join on matching nodeFrom with nodeTo:
val df = Seq(
(1, "A", "G", "2016-10-12T12:10:00.000Z"),
(2, "B", "A", "2016-10-12T12:00:00.000Z"),
(3, "A", "B", "2016-10-12T12:05:00.000Z"),
(4, "D", "C", "2016-10-12T12:30:00.000Z"),
(5, "G", "D", "2016-10-12T12:35:00.000Z")
).toDF("nodeId", "typeFrom", "typeTo", "date")
df.as("df1").join(
df.as("df2"),
$"df1.typeFrom" === $"df2.typeTo"
).select(
$"df1.nodeId".as("nodeId_1"), $"df2.nodeId".as("nodeId_2"), $"df1.typeFrom".as("type"), $"df1.date"
).show(truncate=false)
// +--------+--------+----+------------------------+
// |nodeId_1|nodeId_2|type|date |
// +--------+--------+----+------------------------+
// |1 |2 |A |2016-10-12T12:10:00.000Z|
// |2 |3 |B |2016-10-12T12:00:00.000Z|
// |3 |2 |A |2016-10-12T12:05:00.000Z|
// |4 |5 |D |2016-10-12T12:30:00.000Z|
// |5 |1 |G |2016-10-12T12:35:00.000Z|
// +--------+--------+----+------------------------+