I have readstream data came from kafka topic , I changed to a dataframe (because I am using structured streamin pyspark)
This is my schema:
root
|-- DriverId: string (nullable = true)
|-- time: timestamp (nullable = true)
|-- Longitude: float (nullable = true)
|-- Latitude: float (nullable = true)
|-- SPEED: integer (nullable = true)
|-- EngineSpeed: string (nullable = true)
|-- MAF: string (nullable = true)
|-- FuelType: string (nullable = true)
Harsh acceleration is an increase in speed of 6.14 miles per hour in one second.
So I calculate this on Batch (I read csv file an apply this):
acc_max=9.87926
dec_max=10.9412
#w = Window.partitionBy('DriverId').orderBy('time')
df = df.withColumn("prev_value",F.lag(df.SPEED).over(w))
df = df.withColumn("diff_speed", F.when(F.isnull(df.SPEED - df.prev_value), 0)
.otherwise(df.SPEED - df.prev_value))
df = df.withColumn(
'HarshAcceleration',
F.when((F.col("diff_speed") > acc_max),1)\
.otherwise(0))
df = df.withColumn(
'HarshBraking',
F.when((F.col("diff_speed") < - dec_max),1)\
.otherwise(0))
df=(df.groupBy("DriverId").sum("harshAcceleration").alias("harshAcceleration")).show()
and it shows this result:
+--------+----------------------+
|DriverId|sum(harshAcceleration)|
+--------+----------------------+
| 3| 4|
| 0| 1|
| 1| 1|
| 2| 0|
+--------+----------------------+
Now I want do this but on continuous streaming , I can't use lag function multiple errors I have found!
windowSpec = Window.partitionBy("DriverId").orderBy(df.time)
acc_max=9.87926
df=df.withColumn("prev_value",F.lag(df.SPEED).over(windowSpec))
df=df.withColumn("diff_speed", F.when(F.isnull(df.SPEED - df.prev_value), 0)\
.otherwise(df.SPEED - df.prev_value))
df=df.withColumn('HarshAcceleration',F.when((F.col("diff_speed") > acc_max),1) \
.otherwise(0))
df=df.groupBy(window(df.time,"3 seconds"),df.DriverId).sum("HarshAcceleration").orderBy('window')
# Start running the query that prints the windowed word counts to the console
query=df\
.select("DriverId","sum(HarshAcceleration)")\
.writeStream\
.format("console") \
.outputMode("complete")\
.start()
but i get this error:
AnalysisException: 'Non-time-based windows are not supported on streaming DataFrames/Datasets;;\nWindow [lag(SPEED#41, 1, null) windowspecdefinition(DriverId#25, time#68 ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS prev_value#78], [DriverId#25], [time#68 ASC NULLS FIRST]\n+- Project [DriverId#25, time#68, Longitude#59, Latitude#50, SPEED#41, EngineSpeed#30, MAF#31, FuelType#32]\n +- Project [DriverId#25, cast(time#26 as timestamp) AS time#68, Longitude#59, Latitude#50, SPEED#41, EngineSpeed#30, MAF#31, FuelType#32]\n
can you please suggest for a solution with pysparkstreaming rdd with updatebykey
Related
My requirement is as below
Joined two data frames as below:
var c = a.join(b,keys,"fullouter")
c.printSchema() as below:
|-- add: string (nullable = true)
|-- sub: string (nullable = true)
|-- delete: string (nullable = true)
|-- mul: long (nullable = true)
|-- ADD: string (nullable = true)
|-- SUB: string (nullable = true)
|-- DELETE: string (nullable = true)
|-- MUL: long (nullable = true)
It's good until here.
Now i am doing a withcolumn when condition as below
val d = c.withColumn("column", when(c("a.add") === c("b.ADD"),
"Neardata"))
error as below:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot resolve column name "a.add"
I tried as below also
val d = c.withColumn("column", when(col("a.add") === col("b.ADD"), "Neardata"))
Again error.
Please suggest.
You have to define alias with datframe.as("a") and dataframe1.as("b")
Example :
import spark.sqlContext.implicits._
val data = List(("James","","Smith","36636","M",60000),
("Michael","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Jen","Mary","Brown","","F",0))
val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*).as("a")
val df2 = df.withColumn("a.new_gender", when(col("a.gender") === "M","Male")
.when(col("a.gender") === "F","Female")
.otherwise("Unknown")).show
Output :
+----------+-----------+---------+-----+------+------+------------+
|first_name|middle_name|last_name| dob|gender|salary|a.new_gender|
+----------+-----------+---------+-----+------+------+------------+
| James| | Smith|36636| M| 60000| Male|
| Michael| Rose| |40288| M| 70000| Male|
| Robert| | Williams|42114| |400000| Unknown|
| Maria| Anne| Jones|39192| F|500000| Female|
| Jen| Mary| Brown| | F| 0| Female|
+----------+-----------+---------+-----+------+------+------------+
I think with out alias you are trying to access like this... that might be the cause.
val df2 = df.withColumn("df.new_gender", when(col("df.gender") === "M","Male")
.when(col("df.gender") === "F","Female")
.otherwise("Unknown")).show
For schema evolution Mergeschema can be used in Spark for Parquet file formats, and I have below clarifications on this
Does this support only Parquet file format or any other file formats like csv,txt files.
If new additional columns are added in between I understand Mergeschema will move the columns to last.
And if column orders are disturbed then whether Mergeschema will align the columns to correct order when it was created or do we need to do this manually by selecting all the columns.
Update from Comment :
for example If I have a schema as below and create table as below - spark.sql("CREATE TABLE emp USING DELTA LOCATION '****'") empid,empname,salary====> 001,ABC,10000 and next day if I get below format empid,empage,empdept,empname,salary====> 001,30,XYZ,ABC,10000.
Whether new columns - empage, empdept will be added after empid,empname,salary columns?
Q :
1. Does this support only Parquet file format or any other file formats like csv,txt files.
2. if column orders are disturbed then whether Mergeschema will align the columns to correct order when it was created or do we need to do this manuallly by selecting all the columns
AFAIK Merge schema is supported only by parquet not by other format like csv , txt.
Mergeschema (spark.sql.parquet.mergeSchema) will align the columns in the correct order even they are distributed.
Example from spark documentation on parquet schema-merging:
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.parquet("data/test_table/key=1")
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
cubesDF.write.parquet("data/test_table/key=2")
// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
// root
// |-- value: int (nullable = true)
// |-- square: int (nullable = true)
// |-- cube: int (nullable = true)
// |-- key: int (nullable = true)
UPDATE : Real example given by you in the comment box...
Q : Whether new columns - empage, empdept will be added after
empid,empname,salary columns?
Answer : Yes
EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column.
see the full example.
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SaveMode
object CSVDataSourceParquetSchemaMerge extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("CSVParquetSchemaMerge")
.master("local")
.getOrCreate()
import spark.implicits._
val csvDataday1 = spark.sparkContext.parallelize(
"""
|empid,empname,salary
|001,ABC,10000
""".stripMargin.lines.toList).toDS()
val csvDataday2 = spark.sparkContext.parallelize(
"""
|empid,empage,empdept,empname,salary
|001,30,XYZ,ABC,10000
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema", true).csv(csvDataday1)
println("first day data ")
frame.show
frame.write.mode(SaveMode.Overwrite).parquet("data/test_table/day=1")
frame.printSchema
val frame1 = spark.read.option("header", true).option("inferSchema", true).csv(csvDataday2)
frame1.write.mode(SaveMode.Overwrite).parquet("data/test_table/day=2")
println("Second day data ")
frame1.show(false)
frame1.printSchema
// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
println("Merged Schema")
mergedDF.printSchema
println("Merged Datarame where EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column")
mergedDF.show(false)
}
Result :
first day data
+-----+-------+------+
|empid|empname|salary|
+-----+-------+------+
| 1| ABC| 10000|
+-----+-------+------+
root
|-- empid: integer (nullable = true)
|-- empname: string (nullable = true)
|-- salary: integer (nullable = true)
Second day data
+-----+------+-------+-------+------+
|empid|empage|empdept|empname|salary|
+-----+------+-------+-------+------+
|1 |30 |XYZ |ABC |10000 |
+-----+------+-------+-------+------+
root
|-- empid: integer (nullable = true)
|-- empage: integer (nullable = true)
|-- empdept: string (nullable = true)
|-- empname: string (nullable = true)
|-- salary: integer (nullable = true)
Merged Schema
root
|-- empid: integer (nullable = true)
|-- empname: string (nullable = true)
|-- salary: integer (nullable = true)
|-- empage: integer (nullable = true)
|-- empdept: string (nullable = true)
|-- day: integer (nullable = true)
Merged Datarame where EMPAGE,EMPDEPT WERE ADDED AFER EMPID,EMPNAME,SALARY followed by your day column
+-----+-------+------+------+-------+---+
|empid|empname|salary|empage|empdept|day|
+-----+-------+------+------+-------+---+
|1 |ABC |10000 |30 |XYZ |2 |
|1 |ABC |10000 |null |null |1 |
+-----+-------+------+------+-------+---+
Directory tree :
I want to filter Spark sql.DataFrame leaving only wanted array elements without any knowledge for the whole schema before hand (don't want to hardcode it).
Schema:
root
|-- callstartcelllabel: string (nullable = true)
|-- calltargetcelllabel: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- enodeb: string (nullable = true)
| | |-- label: string (nullable = true)
| | |-- ltecelloid: long (nullable = true)
|-- networkcode: long (nullable = true)
|-- ocode: long (nullable = true)
|-- startcelllabel: string (nullable = true)
|-- startcelloid: long (nullable = true)
|-- targetcelllabel: string (nullable = true)
|-- targetcelloid: long (nullable = true)
|-- timestamp: long (nullable = true)
I want whole root only with particular measurements (which are filtered on) and root must contain at least one after filtering.
I have a dataframe of this root, and I have a dataframe of filtering values (one column).
So, example: I would only know that my root contains measurements array, and this array contains labels. So I want whole root with whole measurements which contains labels ("label1","label2").
last trial with explode and collect_list leads to: grouping expressions sequence is empty, and 'callstartcelllabel' is not an aggregate function... Is it even possible to generalize such filtering case ? Don't know how such generic udaf should look like yet.
New in Spark.
EDIT:
Current solution I've came to is:
explode array -> filter out unwanted rows with unwanted array members -> groupby everything but array members -> agg.(collect_list(col("measurements"))
Would it be faster doing it with udf ? I can't figure out how to make a generic udf filtering generic array, knowing only about filtering values...
case class Test(a:Int,b:Int) // declared case class to show above scenario
var df=List((1,2,Test(1,2)),(2,3,Test(3,4)),(4,2,Test(5,6))).toDF("name","rank","array")
+----+----+------+
|name|rank| array|
+----+----+------+
| 1| 2|[1, 2]|
| 2| 3|[3, 4]|
| 4| 2|[5, 6]|
+----+----+------+
df.printSchema
//dataFrame structure look like this
root
|-- name: integer (nullable = false)
|-- rank: integer (nullable = false)
|-- array: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
df.filter(df("array")("a")>1).show
//after filter on dataFrame on specified condition
+----+----+------+
|name|rank| array|
+----+----+------+
| 2| 3|[3, 4]|
| 4| 2|[5, 6]|
+----+----+------+
//Above code help you to understand the Scenario
//use this piece of code:
df.filter(df("measurements")("label")==="label1" || df("measurements")("label")==="label2).show
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.udf
var df=Seq((1,2,Array(Test(1,2),Test(5,6))),(1,3,Array(Test(1,2),Test(5,3))),(10,11,Array(Test(1,6)))).toDF("name","rank","array")
df.show
+----+----+----------------+
|name|rank| array|
+----+----+----------------+
| 1| 2|[[1, 2], [5, 6]]|
| 1| 3|[[1, 2], [5, 3]]|
| 10| 11| [[1, 6]]|
+----+----+----------------+
def test={
udf((a:scala.collection.mutable.WrappedArray[Row])=> {
val b=a.toArray.map(x=>(x.getInt(0),x.getInt(1)))
b.filter(y=>y._1>1)
})}
df.withColumn("array",test(df("array"))).show
+----+----+--------+
|name|rank| array|
+----+----+--------+
| 1| 2|[[5, 6]]|
| 1| 3|[[5, 3]]|
| 10| 11| []|
+----+----+--------+
I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- name: string (nullable = true)
Should be transformed to
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
I can achieve this with
df
.select(explode($"arr").as("tmp"))
.select($"tmp.*")
How can I do that in a single select statement?
I thought this could work, unfortunately it does not:
df.select(explode($"arr")(".*"))
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
such struct field .* in col;
Single step solution is available only for MapType columns:
val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF
df.select(explode($"_1") as Seq("foo", "bar")).show
+---+---+
|foo|bar|
+---+---+
| 1|bar|
| 2|foo|
+---+---+
With arrays you can use flatMap:
val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)
A single SELECT statement can written in SQL:
df.createOrReplaceTempView("df")
spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")
I have next DataFrame:
df.show()
+---------------+----+
| x| num|
+---------------+----+
|[0.1, 0.2, 0.3]| 0|
|[0.3, 0.1, 0.1]| 1|
|[0.2, 0.1, 0.2]| 2|
+---------------+----+
This DataFrame has follow Datatypes of columns:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: double (containsNull = true)
|-- num: long (nullable = true)
I try to convert currently the DoubleArray inside of DataFrame to the FloatArray. I do it with the next statement of udf:
val toFloat = udf[(val line: Seq[Double]) => line.map(_.toFloat)]
val test = df.withColumn("testX", toFloat(df("x")))
This code is currently not working. Can anybody share with me the solution how to change the array Type inseide of DataFrame?
What I want is:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: float (containsNull = true)
|-- num: long (nullable = true)
This question is based on the question How tho change the simple DataType in Spark SQL's DataFrame
Your udf is wrongly declared. You should write it as follows :
val toFloat = udf((line: Seq[Double]) => line.map(_.toFloat))