Split dataframe by column values Scala - scala

I need to split a dataframe into multiple dataframes by the timestamp column. So I would provide a number of hours that this dataframe should contain and will get a set of dataframes with a specified number of hours in each of those.
The signature of the method would look like this:
def splitDataframes(df: DataFrame, hoursNumber: Int): Seq[DataFrame]
How can I achieve that?
The schema of the dataframe looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
Some of the input df fields:
event_ts, user_id
2020-12-13 08:22:00, 1
2020-12-13 08:51:00, 2
2020-12-13 09:28:00, 1
2020-12-13 10:53:00, 3
2020-12-13 11:05:00, 1
2020-12-13 12:19:00, 2
Some of the output df fields with hoursNumber=2:
df1 event_ts, user_id
2020-12-13 08:22:00, 1
2020-12-13 08:51:00, 2
2020-12-13 09:28:00, 1
df2 2020-12-13 10:46:00, 3
2020-12-13 11:05:00, 1
df3 2020-12-13 12:48:00, 2

Convert the timestamp to unix timestamp, and then work out the id for each row using the time difference from the earliest timestamp.
EDIT: the solution is even simpler if you count starting time from 00:00:00.
import org.apache.spark.sql.DataFrame
def splitDataframes(df: DataFrame, hoursNumber: Int): Seq[DataFrame] = {
val df2 = df.withColumn(
"event_unix_ts",
unix_timestamp($"event_ts")
).withColumn(
"grouping",
floor($"event_unix_ts" / (3600 * hoursNumber))
).drop("event_unix_ts")
val df_array = df2.select("grouping").distinct().collect().map(
x => df2.filter($"grouping" === x(0)).drop("grouping")).toSeq
return df_array
}

Related

Spark MergeSchema on parquet columns

For schema evolution Mergeschema can be used in Spark for Parquet file formats, and I have below clarifications on this
Does this support only Parquet file format or any other file formats like csv,txt files.
If new additional columns are added in between I understand Mergeschema will move the columns to last.
And if column orders are disturbed then whether Mergeschema will align the columns to correct order when it was created or do we need to do this manually by selecting all the columns.
Update from Comment :
for example If I have a schema as below and create table as below - spark.sql("CREATE TABLE emp USING DELTA LOCATION '****'") empid,empname,salary====> 001,ABC,10000 and next day if I get below format empid,empage,empdept,empname,salary====> 001,30,XYZ,ABC,10000.
Whether new columns - empage, empdept will be added after empid,empname,salary columns?
Q :
1. Does this support only Parquet file format or any other file formats like csv,txt files.
2. if column orders are disturbed then whether Mergeschema will align the columns to correct order when it was created or do we need to do this manuallly by selecting all the columns
AFAIK Merge schema is supported only by parquet not by other format like csv , txt.
Mergeschema (spark.sql.parquet.mergeSchema) will align the columns in the correct order even they are distributed.
Example from spark documentation on parquet schema-merging:
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.parquet("data/test_table/key=1")
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
cubesDF.write.parquet("data/test_table/key=2")
// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
// root
// |-- value: int (nullable = true)
// |-- square: int (nullable = true)
// |-- cube: int (nullable = true)
// |-- key: int (nullable = true)
UPDATE : Real example given by you in the comment box...
Q : Whether new columns - empage, empdept will be added after
empid,empname,salary columns?
Answer : Yes
EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column.
see the full example.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SaveMode
object CSVDataSourceParquetSchemaMerge extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("CSVParquetSchemaMerge")
.master("local")
.getOrCreate()
import spark.implicits._
val csvDataday1 = spark.sparkContext.parallelize(
"""
|empid,empname,salary
|001,ABC,10000
""".stripMargin.lines.toList).toDS()
val csvDataday2 = spark.sparkContext.parallelize(
"""
|empid,empage,empdept,empname,salary
|001,30,XYZ,ABC,10000
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema", true).csv(csvDataday1)
println("first day data ")
frame.show
frame.write.mode(SaveMode.Overwrite).parquet("data/test_table/day=1")
frame.printSchema
val frame1 = spark.read.option("header", true).option("inferSchema", true).csv(csvDataday2)
frame1.write.mode(SaveMode.Overwrite).parquet("data/test_table/day=2")
println("Second day data ")
frame1.show(false)
frame1.printSchema
// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
println("Merged Schema")
mergedDF.printSchema
println("Merged Datarame where EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column")
mergedDF.show(false)
}
Result :
first day data
+-----+-------+------+
|empid|empname|salary|
+-----+-------+------+
| 1| ABC| 10000|
+-----+-------+------+
root
|-- empid: integer (nullable = true)
|-- empname: string (nullable = true)
|-- salary: integer (nullable = true)
Second day data
+-----+------+-------+-------+------+
|empid|empage|empdept|empname|salary|
+-----+------+-------+-------+------+
|1 |30 |XYZ |ABC |10000 |
+-----+------+-------+-------+------+
root
|-- empid: integer (nullable = true)
|-- empage: integer (nullable = true)
|-- empdept: string (nullable = true)
|-- empname: string (nullable = true)
|-- salary: integer (nullable = true)
Merged Schema
root
|-- empid: integer (nullable = true)
|-- empname: string (nullable = true)
|-- salary: integer (nullable = true)
|-- empage: integer (nullable = true)
|-- empdept: string (nullable = true)
|-- day: integer (nullable = true)
Merged Datarame where EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column
+-----+-------+------+------+-------+---+
|empid|empname|salary|empage|empdept|day|
+-----+-------+------+------+-------+---+
|1 |ABC |10000 |30 |XYZ |2 |
|1 |ABC |10000 |null |null |1 |
+-----+-------+------+------+-------+---+
Directory tree :

Scala recursively overwrite column type in DF

I want to transform the year, month and day columns in my dataframe schema from Integer to string.
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
it should return the same DF with but column type changed to String for the specific columns.
The error i am getting is
This is my function:
def transformDate(df: Dataset[Row], spark: SparkSession): Dataset[Row] = {
val cols = Seq("month","day","year")
for (col <- cols) {
val df = df.withColumn(col, $"col".cast(org.apache.spark.sql.types.StringType))
}
df
}
What is wrong with my function?

Change schema of existing dataframe

I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe.
val customSchema=StructType(
Array(
StructField("data_typ", StringType, nullable=false),
StructField("data_typ", IntegerType, nullable=false),
StructField("proc_date", IntegerType, nullable=false),
StructField("cyc_dt", DateType, nullable=false),
));
val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode| Description|monthColNam| timeStampColNam|
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)
expected result
val newdf=
+------------+--------------------+-----------+--------------------+
|data_typ_cd | data_typ_desc|proc_dt | cyc_dt |
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
Any help will be appricated
You can do something like this to change the datatype from one to other.
I have created a dataframe similar to yours like below:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._
var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
.toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")
df.printSchema()
df.show()
This gives me the following output:
root
|-- DatatypeCode: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date: string (nullable = true)
|-- cyc_dt: string (nullable = true)
+------------+--------------------+---------+-------------------+
|DatatypeCode| data_typ|proc_date| cyc_dt|
+------------+--------------------+---------+-------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:00|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+
If you see the schema above all the columns are of type String. Now I want to change the column proc_date to Integer type and cyc_dt to Date type, I will do the following:
df = df.withColumnRenamed("DatatypeCode", "data_type_code")
df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")
df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")
and when you check the schema of this dataframe
df.printSchema()
then it gives the output as following with the new column names:
root
|-- data_type_code: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date_new: integer (nullable = true)
|-- cyc_dt_new: date (nullable = true)
You cannot change schema like this. Schema object passed to createDataFrame has to match the data, not the other way around:
To parse timestamp data use corresponding functions, for example like Better way to convert a string field into timestamp in Spark
To change other types use cast method, for example how to change a Dataframe column from String type to Double type in pyspark

How to compute statistics on a streaming dataframe for different type of columns in a single query?

I have a streaming dataframe having three columns time, col1,col2.
+-----------------------+-------------------+--------------------+
|time |col1 |col2 |
+-----------------------+-------------------+--------------------+
|2018-01-10 15:27:21.289|0.4988615628926717 |0.1926744113882285 |
|2018-01-10 15:27:22.289|0.5430687338123434 |0.17084552928040175 |
|2018-01-10 15:27:23.289|0.20527770821641478|0.2221980020202523 |
|2018-01-10 15:27:24.289|0.130852802747647 |0.5213147910202641 |
+-----------------------+-------------------+--------------------+
The datatype of col1 and col2 is variable. It could be a string or numeric datatype.
So I have to calculate statistics for each column.
For string column, calculate only valid count and invalid count.
For timestamp column, calculate only min & max.
For numeric type column, calculate min, max, average and mean.
I have to compute all statistics in a single query.
Right now, I have computed with three queries separately for every type of column.
Enumerate cases you want and select. For example, if stream is defined as:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
val schema = StructType(Seq(
StructField("v", TimestampType),
StructField("x", IntegerType),
StructField("y", StringType),
StructField("z", DecimalType(10, 2))
))
val df = spark.readStream.schema(schema).format("csv").load("/tmp/foo")
The result would be
val stats = df.select(df.dtypes.flatMap {
case (c, "StringType") =>
Seq(count(c) as s"valid_${c}", count("*") - count(c) as s"invalid_${c}")
case (c, t) if Seq("TimestampType", "DateType") contains t =>
Seq(min(c), max(c))
case (c, t) if (Seq("FloatType", "DoubleType", "IntegerType") contains t) || t.startsWith("DecimalType") =>
Seq(min(c), max(c), avg(c), stddev(c))
case _ => Seq.empty[Column]
}: _*)
// root
// |-- min(v): timestamp (nullable = true)
// |-- max(v): timestamp (nullable = true)
// |-- min(x): integer (nullable = true)
// |-- max(x): integer (nullable = true)
// |-- avg(x): double (nullable = true)
// |-- stddev_samp(x): double (nullable = true)
// |-- valid_y: long (nullable = false)
// |-- invalid_y: long (nullable = false)
// |-- min(z): decimal(10,2) (nullable = true)
// |-- max(z): decimal(10,2) (nullable = true)
// |-- avg(z): decimal(14,6) (nullable = true)
// |-- stddev_samp(z): double (nullable = true)

SpHow to merge two fields(string type) of a DataFrame's column to generate a Date

I have a DataFrame which simplified schema has got two columns with 3 fields each column:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
Possible values:
npaDownloadDate - "30JAN17"
npaDownloadTime - "19.50.00"
I need to compare two rows in a DataFrame with this schema, to determine which one is "fresher". To do so I need to merge the fields npaDownloadDate and npaDownloadTime to generate a Date that I can compare easily.
Below its the code I have written so far. It works, but I think it takes more steps than necessary and I'm sure that Scala offers better solutions than my approach.
val parquetFileDF = sqlContext.read.parquet("MyParquet.parquet")
val relevantRows = parquetFileDF.filter($"npaHeaderData.npaNumber" === "123456")
val date = relevantRows .select($"npaHeaderData.npaDownloadDate").head().get(0)
val time = relevantRows .select($"npaHeaderData.npaDownloadTime").head().get(0)
val dateTime = new SimpleDateFormat("ddMMMyykk.mm.ss").(date+time)
//I would replicate the previous steps to get dateTime2
if(dateTime.before(dateTime2))
println("dateTime is before dateTime2")
So the output of "30JAN17" and "19.50.00" would be Mon Jan 30 19:50:00 GST 2017
Is there another way to generate a Date from two fields of a column, without extracting and merging them as strings? Or even better, is it possible to compare directly both values (date and time) between two different rows in a dataframe to know which one has an older date
In spark 2.2,
df.filter(
to_date(
concat(
$"npaHeaderData.npaDownloadDate",
$"npaHeaderData.npaDownloadTime"),
fmt = "[your format here]")
) < lit(some date))
I'd use
import org.apache.spark.sql.functions._
df.withColumn("some_name", date_format(unix_timestamp(
concat($"npaHeaderData.npaDownloadDate", $"npaHeaderData.npaDownloadTime"),
"ddMMMyykk.mm.ss").cast("timestamp"),
"EEE MMM d HH:mm:ss z yyyy"))