how to update all the values of a column in a dataFrame - scala

I have a data frame which has a non formated Date column :
+--------+-----------+--------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+--------+
| AAA|bbbbbbbbbbb|13190326|
| AAA|bbbbbbbbbbb|10190309|
| AAA|bbbbbbbbbbb|36190908|
| AAA|bbbbbbbbbbb|07190214|
| AAA|bbbbbbbbbbb|13190328|
| AAA|bbbbbbbbbbb|23190608|
| AAA|bbbbbbbbbbb|13190330|
| AAA|bbbbbbbbbbb|26190630|
+--------+-----------+--------+
the date column is formated as : wwyyMMdd (week, year, month, day) which I want to format to YYYYMMdd, for that a have a method : format that do that.
so my question is how could I map all the values of column Date to the needed format? here is the output that I want :
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+

Spark 2.4.3 using unix_timestamp you can convert data to the expected output.
scala> var df2 =spark.createDataFrame(Seq(("AAA","bbbbbbbbbbb","13190326"),("AAA","bbbbbbbbbbb","10190309"),("AAA","bbbbbbbbbbb","36190908"),("AAA","bbbbbbbbbbb","07190214"),("AAA","bbbbbbbbbbb","13190328"),("AAA","bbbbbbbbbbb","23190608"),("AAA","bbbbbbbbbbb","13190330"),("AAA","bbbbbbbbbbb","26190630"))).toDF("CDOPEINT","bbbbbbbbbb","Date")
scala> df2.withColumn("Date",from_unixtime(unix_timestamp(substring(col("Date"),3,7),"yyMMdd"),"yyyy/MM/dd")).show
+--------+-----------+----------+
|CDOPEINT| bbbbbbbbbb| Date|
+--------+-----------+----------+
| AAA|bbbbbbbbbbb|2019/03/26|
| AAA|bbbbbbbbbbb|2019/03/09|
| AAA|bbbbbbbbbbb|2019/09/08|
| AAA|bbbbbbbbbbb|2019/02/14|
| AAA|bbbbbbbbbbb|2019/03/28|
| AAA|bbbbbbbbbbb|2019/06/08|
| AAA|bbbbbbbbbbb|2019/03/30|
| AAA|bbbbbbbbbbb|2019/06/30|
+--------+-----------+----------+
let me know if you have any query related to this.

If the date involves values from 2000 and the Date column in your original dataframe is of Integer type,you could try something like this
def getDate =(date:Int) =>{
val dateString = date.toString.drop(2).sliding(2,2)
dateString.zipWithIndex.map{
case(value,index) => if(index ==0) 20+value else value
}.mkString("/")
}
Then create a UDF which calls this function
val updateDateUdf = udf(getDate)
If originalDF is the original Dataframe that you have, you could then change the dataframe like this
val updatedDF = originalDF.withColumn("Date",updateDateUdf(col("Date")))

Related

How to find similar rows by matching column values spark?

So i have a data set like
{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}
I have flattened the data in the DF like this
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a| att-b| att-c| att-d| att-e| att-f| att-g| att-h| att-i| att-j| customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|
I want to complete the comapreColumns function.
which compares the columns of the two dataframes(userDF and flattenedDF) and returns a new DF as sample output.
how to do that? Like, compare each row's and column in flattenedDF with userDF and count++ if they match? e.g att-a with att-a att-b with att-b.
def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
dataFrame.filter($"customer" === customerID).toDF()
}
def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
val userDF = dataFrame.transform(getCustomer(customerID))
userDF.printSchema()
userDF
}
Sample Output:
+--------------------+-----------+
| customer | similarity_score |
+--------------------+-----------+
|customer-1 | -1 | its the same as the reference customer so to ignore '-1'
|customer-12 | 2 |
|customer-3 | 2 |
|customer-44 | 5 |
|customer-5 | 1 |
|customer-6 | 10 |
Thanks

Extract only Hour from Epochtime in scala

I am having a dataframe with one of its column as epochtime.
I want to extract only hour from it and display it as a separate column.
Below is the sample dataframe:
+----------+-------------+
| NUM_ID| STIME|
+----------+-------------+
|xxxxxxxx01|1571634285000|
|xxxxxxxx01|1571634299000|
|xxxxxxxx01|1571634311000|
|xxxxxxxx01|1571634316000|
|xxxxxxxx02|1571634318000|
|xxxxxxxx02|1571398176000|
|xxxxxxxx02|1571627596000|
Below is the expected output.
+----------+-------------+-----+
| NUM_ID| STIME| HOUR|
+----------+-------------+-----+
|xxxxxxxx01|1571634285000| 10 |
|xxxxxxxx01|1571634299000| 10 |
|xxxxxxxx01|1571634311000| 10 |
|xxxxxxxx01|1571634316000| 10 |
|xxxxxxxx02|1571634318000| 10 |
|xxxxxxxx02|1571398176000| 16 |
|xxxxxxxx02|1571627596000| 08 |
I have tried
val test = test1DF.withColumn("TIME", extract HOUR(from_unixtime($"STIME"/1000)))
which throws exception at
<console>:46: error: not found: value extract
Tried as below to obtain date format and even it is not working.
val test = test1DF.withColumn("TIME", to_timestamp(from_unixtime(col("STIME")))
The datatype of STIME in dataframe is Long.
Any leads to extract hour from epochtime in Long datatype?
Extracting the hours from a timestamp is as simple as using the hour() function:
import org.apache.spark.sql.functions._
val df_with_hour = df.withColumn("TIME", hour(from_unixtime($"STIME" / 1000)))
df_with_hour.show()
// +-------------+----+
// | STIME|TIME|
// +-------------+----+
// |1571634285000| 5|
// |1571398176000| 11|
// |1571627596000| 3|
// +-------------+----+
(Note: I'm in a different timezone)

How to filter Date Columns and store them as numbers in Data Frames using Scala

I have a dataframe (dateds1) which looks like below,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|
I need help in filtering it out, my output should look like below,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916 | 20080909 |20090209 |20170909 |
| 19940920 | 20080910 |19990505 |20160930 |
| 19930924 | 20160629 |20031207 |20280213 |
| 19920928 | 20070624 |20040605 |20190924 |
| 19911003 | 20110707 |20110707 |20200330 |
| 19901007 | 20090209 |20090209 |20110313 |
| 19891011 | 19990505 |19990505 |20210313 |
I tried using filter, but I was able to filter only for either of the one case, when the dates are in YYYY/MM/DD or YYYY-MM-DD 00:00:00 format and number of columns are fixed. Can someone please help me in figuring it out for both the formats and when the number of columns are dynamic(They might be increasing or decreasing).
They should be converted from Date Datatype to Integers or Long in this format YYYYMMDD.
Note: The records in this Dataframe or either in YYYY/MM/DD or YYYY-MM-DD 00:00:00 format.
Any help is appreciated. Thanks
To do that conversion dynamically you'll have to iterate through all columns and perform different operations depending on the column type.
Here's an example:
import java.sql.Date
import org.apache.spark.sql.types._
import java.sql.Timestamp
val originalDf = Seq(
(Timestamp.valueOf("2016-09-30 03:04:00"),Date.valueOf("2016-09-30")),
(Timestamp.valueOf("2016-07-30 00:00:00"),Date.valueOf("2016-10-30"))
).toDF("ts_value","date_value")
Original table details:
> originalDf.show
+-------------------+----------+
| ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+
> originalDf.printSchema
root
|-- ts_value: timestamp (nullable = true)
|-- date_value: date (nullable = true)
Example of conversion operation:
val newDf = originalDf.columns.foldLeft(originalDf)((df, name) => {
val data_type = df.schema(name).dataType
if(data_type == DateType)
df.withColumn(name, date_format(col(name), "yyyyMMdd").cast(IntegerType))
else if(data_type == TimestampType)
df.withColumn(name, year(col(name))*10000 + month(col(name))*100 + dayofmonth(col(name)))
else
df
})
New table details:
newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930| 20160930|
|20160730| 20161030|
+--------+----------+
newDf.printSchema
root
|-- ts_value: integer (nullable = true)
|-- date_value: integer (nullable = true)
If you don't want to perform this operation in all columns you can manually specify the columns by changing
val newDf = originalDf.columns.foldLeft ...
to
val newDf = Seq("col1_name","col2_name", ... ).foldLeft ...
Hope this helps!

Hourly Aggregation in Scala Spark

I'm looking for a way to aggregate by hour my data. I want firstly to keep only hours in my evtTime. My DataFrame looks like this:
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:23:06.426|1 |
|X166815|2018-01-01 02:20:06.426|2 |
|X166816|2018-01-01 11:25:06.429|5 |
|X166817|2018-02-01 10:23:06.429|1 |
|X166818|2018-01-01 09:23:06.430|3 |
|X166819|2018-01-01 10:15:06.430|8 |
|X166820|2018-08-01 11:00:06.431|20 |
|X166821|2018-03-01 06:23:06.431|7 |
|X166822|2018-01-01 07:23:06.434|2 |
|X166823|2018-01-01 11:23:06.434|1 |
+-------+-----------------------+-----------+
My objectif is to get something like this :
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:00:00.000|1 |
|X166815|2018-01-01 02:00:00.000|2 |
|X166816|2018-01-01 11:00:00.000|5 |
|X166817|2018-02-01 10:00:00.000|1 |
|X166818|2018-01-01 09:00:00.000|3 |
|X166819|2018-01-01 10:00:00.000|8 |
|X166820|2018-08-01 11:00:00.000|20 |
|X166821|2018-03-01 06:00:00.000|7 |
|X166822|2018-01-01 07:00:00.000|2 |
|X166823|2018-01-01 11:00:00.000|1 |
+-------+-----------------------+-----------+
I'm using scala 2.10.5 and spark 1.6.3. My objectif subsequently is to group by reqUser and calculate the sum of event_count. I tried this :
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{round, sum}
val new_df = df
.groupBy($"reqUser",
Window(col("evtTime"), "1 hour"))
.agg(sum("event_count") as "aggregate_sum")
This is my error message :
Error:(81, 15) org.apache.spark.sql.expressions.Window.type does not take parameters
Window(col("time"), "1 hour"))
Help ? Thx !
In Spark 1.x you can use format tools
import org.apache.spark.sql.functions.trunc
val df = Seq("2018-01-01 10:15:06.430").toDF("evtTime")
df.select(date_format($"evtTime".cast("timestamp"), "yyyy-MM-dd HH:00:00")).show
+------------------------------------------------------------+
|date_format(CAST(evtTime AS TIMESTAMP), yyyy-MM-dd HH:00:00)|
+------------------------------------------------------------+
| 2018-01-01 10:00:00|
+------------------------------------------------------------+

pyspark - Can I use substring of value as a key of groupBy() function?

I have a dataframe looks like this:
datetime | ID |
======================
20180201000000 | 275 |
20171231113024 | 534 |
20180201220000 | 275 |
20170205000000 | 28 |
what I want to do is to count by ID, monthly.
this way was perfactly worked :
add column of month by extracting from datetime column :
new_df = df.withColumn('month', df.datetime.substr(0,6))
count by ID & month :
count_df = new_df.groupBy('ID','month').count()
but is there a way to use substring of certain column values as an argument of groupBy() function? like :
`count_df = df.groupBy('ID', df.datetime.substr(0,6)).count()`
at least, this code didn't work.
if there exist the way to use substring of values, don't need to add new column and save much of resources(in case of big data).
but even if this approach is wrong, do you have a better idea to get same result?
Try this
>>> df.show()
+--------------+---+
| datetime| id|
+--------------+---+
|20180201000000|275|
|20171231113024|534|
|20180201220000|275|
|20170205000000| 28|
+--------------+---+
>>> df.groupBy('id',df.datetime.substr(0,6)).agg(count('id')).show()
+---+-----------------------+---------+
| id|substring(datetime,0,6)|count(id)|
+---+-----------------------+---------+
|275| 201802| 2|
|534| 201712| 1|
| 28| 201702| 1|
+---+-----------------------+---------+