spark OneHotEncoder - how to exclude user-defined category? - scala

Consider the following spark dataframe:
df.printSchema()
|-- predictor: double (nullable = true)
|-- label: double (nullable = true)
|-- date: string (nullable = true)
df.show(6)
predictor label date
4.23 6.33 20160510
4.77 7.18 20160510
4.09 5.94 20160511
4.23 6.33 20160511
4.77 7.18 20160512
4.09 5.94 20160512
Essentially, my dataframe consists of data with daily frequency. I need to map the column of dates to a column of binary vectors. This is simple to implement using StringIndexer & OneHotEncoder:
val dateIndexer = new StringIndexer()
.setInputCol("date")
.setOutputCol("dateIndex")
.fit(df)
val indexed = dateIndexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("dateIndex")
.setOutputCol("date_codeVec")
val encoded = encoder.transform(indexed)
My problem is that OneHotEncoder drops the last category by default. However, I need to drop the category which relates to the first date in my dataframe (20160510 in the above example) because I need to compute a time trend relative to the first date.
How can I achieve this for the above example (note that I have more than 100 dates in my dataframe)?

You can try setting setDropLast to false:
val encoder = new OneHotEncoder()
.setInputCol("dateIndex")
.setOutputCol("date_codeVec")
.setDropLast(false)
val encoded = encoder.transform(indexed)
and dropping level choice manually, using VectorSlicer:
import org.apache.spark.ml.feature.VectorSlicer
val slicer = new VectorSlicer()
.setInputCol("date_codeVec")
.setOutputCol("data_codeVec_selected")
.setNames(dateIndexer.labels.diff(Seq(dateIndexer.labels.min)))
slicer.transform(encoded)
+---------+-----+--------+---------+-------------+---------------------+
|predictor|label| date|dateIndex| date_codeVec|data_codeVec_selected|
+---------+-----+--------+---------+-------------+---------------------+
| 4.23| 6.33|20160510| 0.0|(3,[0],[1.0])| (2,[],[])|
| 4.77| 7.18|20160510| 0.0|(3,[0],[1.0])| (2,[],[])|
| 4.09| 5.94|20160511| 2.0|(3,[2],[1.0])| (2,[1],[1.0])|
| 4.23| 6.33|20160511| 2.0|(3,[2],[1.0])| (2,[1],[1.0])|
| 4.77| 7.18|20160512| 1.0|(3,[1],[1.0])| (2,[0],[1.0])|
| 4.09| 5.94|20160512| 1.0|(3,[1],[1.0])| (2,[0],[1.0])|
+---------+-----+--------+---------+-------------+---------------------+

Related

Pyspark date_trunc without modifying actual value

Consider the below dataframe
df:
time
2022-02-21T11:23:54
I have to convert it to
time
2022-02-21T11:23:00
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
time
2022-02-21 11:23:00
By desired output is
time
2022-02-21T11:23:00
Is there anyway I can keep the data same and just update/truncate the seconds??
you simply have a format issue. the output that you see is the string representation of a timestamp. check your output formats :
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)

String Functions for Nested Schema in Spark Scala

I am learning Spark in Scala programming language.
Input file ->
"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}
Schema ->
root
|-- Personal: struct (nullable = true)
| |-- ID: integer (nullable = true)
| |-- Name: array (nullable = true)
| | |-- element: string (containsNull = true)
Operation for output ->
I want to concat the Strings of "Name" element
Eg - abcs|dakjdb
I am reading the file using dataframe API.
Please help me from this.
It should be pretty straightforward if you are working with Spark >= 1.6.0 you can use get_json_object and concat_ws:
import org.apache.spark.sql.functions.{get_json_object, concat_ws}
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
df.select(
concat_ws(
"-",
get_json_object($"data", "$.Personal.Name[0]"),
get_json_object($"data", "$.Personal.Name[1]")
).as("FullName")
).show(false)
// +-----------+
// |FullName |
// +-----------+
// |abcs-dakjdb|
// |cfg-woooww |
// +-----------+
With get_json_object we go through the json data an extract the two elements of the Name array which we concatenate later on.
There is an inbuilt function concat_ws which should be useful here.
to extend #Alexandros Biratsis answer. you can first convert Name into array[String] type before concatenating to avoid writing every name position. Querying by position would also fail when the value is null or when only one value exist instead of two.
import org.apache.spark.sql.functions.{get_json_object, concat_ws, from_json}
import org.apache.spark.sql.types.{ArrayType, StringType}
val arraySchema = ArrayType(StringType)
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
.select(get_json_object($"data", "$.Personal.Name") as "name")
.select(from_json($"name", arraySchema) as "name")
.select(concat_ws("|", $"name"))
.show(false)

EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.

Timestamp formats and time zones in Spark (scala API)

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul
The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.

Spark Scala -difference between current date and max(day)

Need to calculate the difference between two dates. The question is
Currentdate - max(day_id)
"Currentdate" is of simple date format - yyyyMMdd
"day_id" is of string format and its value is yyyy-mm-dd.
I have a dataframe which converted the date(string format) to date format (yyyy-mm-dd)
df1 = to_date(from_unixtime(unix_timestamp(day_id, 'yyyy-MM-dd')))
Normally, for finding the max(day_id), I would do
def daySince (columnName: String): Column = {
max(col(columnName))
I cannot figure out how to do the difference between
Currentdate - max(day_id)
Given input dataframe with schema as
+---+----------+
|id |day_id |
+---+----------+
|id1|2017-11-21|
|id1|2018-01-21|
|id2|2017-12-21|
+---+----------+
root
|-- id: string (nullable = true)
|-- day_id: string (nullable = true)
you can use current_date() and datediff() inbuilt functions to meet your requirement as
import org.apache.spark.sql.functions._
df.withColumn("diff", datediff(current_date(), to_date(col("day_id"), "yyyy-MM-dd")))
which should give you
+---+----------+----+
|id |day_id |diff|
+---+----------+----+
|id1|2017-11-21|167 |
|id1|2018-01-21|106 |
|id2|2017-12-21|137 |
+---+----------+----+