I want to calculate cumulative count of values in data frame column over past1 hour using moving window. I can get the expected output with pyspark (non streaming) window function using rangeBetween, but I want to use real time data processing so trying with spark structured streaming such that if any new record/transaction come into system, I get desired output.
the data is like
time,col
2019-04-27 01:00:00,A
2019-04-27 00:01:00,A
2019-04-27 00:05:00,B
2019-04-27 01:01:00,A
2019-04-27 00:08:00,B
2019-04-27 00:03:00,A
2019-04-27 03:03:00,A
using pyspark (non streaming)
from pyspark.sql.window import Window
df = sqlContext.read.format("csv") \
.options(header='true', inferschema='false',delimiter=',') \
.load(r'/datalocation')
df=df.withColumn("numddate",unix_timestamp(df.time, "yyyy-MM-dd HH:mm:ss"))
w1=Window.partitionBy("col").orderBy("numddate").rangeBetween(-3600, -1)
df=df.withColumn("B_cumulative_count", count("col").over(w1))
+-------------------+---+----------+------------------+
| time|col| numddate|B_cumulative_count|
+-------------------+---+----------+------------------+
|2019-04-27 00:05:00| B|1556348700| 0|
|2019-04-27 00:08:00| B|1556348880| 1|
|2019-04-27 00:01:00| A|1556348460| 0|
|2019-04-27 00:03:00| A|1556348580| 1|
|2019-04-27 01:00:00| A|1556352000| 2|
|2019-04-27 01:01:00| A|1556352060| 3|
|2019-04-27 03:03:00| A|1556359380| 0|
+-------------------+---+----------+------------------+
(This is what I required, so getting it by above code)
Structured streaming, this is what i am trying
userSchema = StructType([
StructField("time", TimestampType()),
StructField("col", StringType())
])
lines2 = spark \
.readStream \
.format('csv')\
.schema(userSchema)\
.csv("/datalocation")
windowedCounts = lines2.groupBy(
window(lines2.time, "1 hour"),
lines2.col
).count()
windowedCounts.writeStream.format("memory").outputMode("complete").queryName("test2").option("truncate","false").start()
spark.table("test2").show(truncate=False)
streaming output:
+------------------------------------------+---+-----+
|window |col|count|
+------------------------------------------+---+-----+
|[2019-04-27 03:00:00, 2019-04-27 04:00:00]|A |1 |
|[2019-04-27 00:00:00, 2019-04-27 01:00:00]|A |2 |
|[2019-04-27 01:00:00, 2019-04-27 02:00:00]|A |2 |
|[2019-04-27 00:00:00, 2019-04-27 01:00:00]|B |2 |
+------------------------------------------+---+-----+
How to replicate same using spark structured streaming?
You can group by your window slide and count it.
Example of a word count in structured streaming -
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load()
// Split the lines into words, retaining timestamps
val words = lines.as[(String, Timestamp)].flatMap(line =>
line._1.split(" ").map(word => (word, line._2))
).toDF("word", "timestamp")
val windowDuration = "10 seconds"
val slideDuration = "5 seconds"
// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", windowDuration, slideDuration), $"word"
).count().orderBy("window")
// Start running the query that prints the windowed word counts to the console
val query = windowedCounts.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
Related
The data frame what I get after reading text file in spark context
+----+---+------+
| _1| _2| _3|
+----+---+------+
|name|age|salary|
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
the dataframe I required is
+----+---+------+
|name|age|salary|
+----+---+------+
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
Here is the the code:
## from spark context
df_txt=spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
df_txt1=df_txt.map(lambda x: x.split(" "))
ddf=df_txt1.toDF().show()
You can use spark csv reader to read your comma seperate file.
For reading text file, you have to take first row as header and create a Seq of String and pass to toDF function. Also, remove first header to the rdd.
Note: Below code has written in spark scala. you can convert into lambda function to make it work in pyspark
import org.apache.spark.sql.functions._
val df = spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
val header = df.first()
val headerCol: Seq[String] = header.split(",").toList
val filteredRDD = df.filter(x=> x!= header)
val finaldf = filteredRDD.map( _.split(",")).map(w => (w(0),w(1),w(2))).toDF(headerCol: _*)
finaldf.show()
w(0),w(1),w(2) - you have to define fixed number of column from your file.
The first RDD, user_person, is a Hive table which records every person's information:
+---------+---+----+
|person_id|age| bmi|
+---------+---+----+
| -100| 1|null|
| 3| 4|null|
...
Below is my second RDD, a Hive table that only has 40 row and only includes basic information:
| id|startage|endage|energy|
| 1| 0| 0.2| 1|
| 1| 2| 10| 3|
| 1| 10| 20| 5|
I want to compute every person's energy requirement by age scope for each row.
For example,a person's age is 4, so it require 3 energy. I want to add that info into RDD user_person.
How can I do this?
First, initialize the spark session with enableHiveSupport() and copy Hive config files (hive-site.xml, core-site.xml, and hdfs-site.xml) to Spark/conf/ directory, to enable Spark to read from Hive.
val sparkSession = SparkSession.builder()
.appName("spark-scala-read-and-write-from-hive")
.config("hive.metastore.warehouse.dir", params.hiveHost + "user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
Read the Hive tables as Dataframes as below:
val personDF= spark.sql("SELECT * from user_person")
val infoDF = spark.sql("SELECT * from person_info")
Join these two dataframes using below expression:
val outputDF = personDF.join(infoDF, $"age" >= $"startage" && $"age" < $"endage")
The outputDF dataframe contains all the columns of input dataframes.
I have got a requirement to do , but I am confused how to do it.
I have two dataframes. so first time i got the below data file1
file1
prodid, lastupdatedate, indicator
00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A
the output should be
0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A
Second time i got another one file2
prodid, lastupdatedate, indicator
00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A
I want the end result like
00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A
so whatever the updates are there in the second file that date should come in the second column and the default date (2400-01-01) should come in the third column and the relavant indicator. The default indicator is A
I have started like this
val spark=SparkSession.builder()
.master("local")
.appName("creating data frame for csv")
.getOrCreate()
import spark.implicits._
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod.txt")
val df1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod1.txt")
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
val df3 = df1.except(newdf)
}
You should join them with prodid and use some when function to manipulate the dataframes to the expected output. You should filter the updated dataframes for second rows and merge them back (I have included comments for explaining each part of the code)
import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
.select(col("prodid"),
when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))
//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
.withColumn("lastupdatedate", col("defaultdate"))
.withColumn("defaultdate", lit("2400-01-01"))
.withColumn("indicator", lit("A"))
//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)
which should give you
+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1 |1900-01-01 |2400-01-01 |A |
|2 |1981-01-25 |2018-01-25 |I |
|2 |2018-01-25 |2400-01-01 |A |
|3 |1982-01-26 |2400-01-01 |A |
|4 |1985-12-20 |2018-01-25 |I |
|4 |2018-01-25 |2400-01-01 |A |
|6 |2018-01-25 |2400-01-01 |A |
|8 |2018-01-25 |2400-01-01 |A |
+------+--------------+-----------+---------+
I am reading a JSON file into a Spark Dataframe and it creates a extra column at the end.
var df : DataFrame = Seq(
(1.0, "a"),
(0.0, "b"),
(0.0, "c"),
(1.0, "d")
).toDF("col1", "col2")
df.write.mode(SaveMode.Overwrite).format("json").save("/home/neelesh/year=2018/")
val newDF = sqlContext.read.json("/home/neelesh/year=2018/*")
newDF.show
The output of newDF.show is:
+----+----+----+
|col1|col2|year|
+----+----+----+
| 1.0| a|2018|
| 0.0| b|2018|
| 0.0| c|2018|
| 1.0| d|2018|
+----+----+----+
However the JSON file is stored as:
{"col1":1.0,"col2":"a"}
{"col1":0.0,"col2":"b"}
{"col1":0.0,"col2":"c"}
{"col1":1.0,"col2":"d"}
The extra column is not added if year=2018 is removed from the path. What can be the issue here?
I am running Spark 1.6.2 with Scala 2.10.5
Could you try:
val newDF = sqlContext.read.json("/home/neelesh/year=2018")
newDF.show
+----+----+
|col1|col2|
+----+----+
| 1.0| A|
| 0.0| B|
| 0.0| C|
| 1.0| D|
+----+----+
quoting from spark 1.6
Starting from Spark 1.6.0, partition discovery only finds partitions
under the given paths by default. For the above example, if users pass
path/to/table/gender=male to either SQLContext.read.parquet or
SQLContext.read.load, gender will not be considered as a partitioning
column
Spark uses directory structure field=value as partition information see https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#partition-discovery
so in your case the year=2018 is considered a year partition and thus an additonal column
I use Spark 1.6.2
I have epochs like this:
+--------------+-------------------+-------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+-------------------+-------------------+
|1491771599 |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600 |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601 |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+
The default timezone is the following on the Spark machines:
#timezone = DefaultTz: Europe/Prague, SparkUtilTz: Europe/Prague
the output of
logger.info("#timezone = DefaultTz: {}, SparkUtilTz: {}", TimeZone.getDefault.getID, org.apache.spark.sql.catalyst.util.DateTimeUtils.defaultTimeZone.getID)
I want to count the timestamps grouped by date and hour in the given timezone (now it is Europe/Helsinki +3hours).
What I expect:
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|23 |1 |
|2017-04-10|0 |2 |
+----------+---------+-----+
Code (using from_utc_timestamp):
def getCountsPerTime(sqlContext: SQLContext, inputDF: DataFrame, timeZone: String, aggr: String): DataFrame = {
import sqlContext.implicits._
val onlyTime = inputDF.select(
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), timeZone).alias("time")
)
val visitsPerTime =
if (aggr.equalsIgnoreCase("hourly")) {
onlyTime.groupBy(
date_format($"time", "yyyy-MM-dd").alias("date"),
date_format($"time", "H").cast(DataTypes.IntegerType).alias("hour"),
).count()
} else if (aggr.equalsIgnoreCase("daily")) {
onlyTime.groupBy(
date_format($"time", "yyyy-MM-dd").alias("date")
).count()
}
visitsPerTime.show(false)
visitsPerTime
}
What I get:
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|22 |1 |
|2017-04-09|23 |2 |
+----------+---------+-----+
Trying to wrap it with to_utc_timestamp:
def getCountsPerTime(sqlContext: SQLContext, inputDF: DataFrame, timeZone: String, aggr: String): DataFrame = {
import sqlContext.implicits._
val onlyTime = inputDF.select(
to_utc_timestamp(from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), timeZone), DateTimeUtils.defaultTimeZone.getID).alias("time")
)
val visitsPerTime = ... //same as above
visitsPerTime.show(false)
visitsPerTime
}
What I get:
+----------+---------+-----+
|tradedate |tradehour|count|
+----------+---------+-----+
|2017-04-09|20 |1 |
|2017-04-09|21 |2 |
+----------+---------+-----+
How to get the expected result?
Your codes are not working for me so I couldn't replicate the last two outputs you got.
But I am going to provide you some hints on how you can achieve the output you expected
I am assuming you already have dataframe as
+--------------+---------------------+---------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+---------------------+---------------------+
|1491750899 |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900 |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901 |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+
I got this dataframe by using following code
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val inputDF = Seq(
"2017-04-09 20:59:59",
"2017-04-09 21:00:00",
"2017-04-09 21:00:01"
).toDF("unix_timestamp")
val onlyTime = inputDF.select(
unix_timestamp($"unix_timestamp").alias("unix_timestamp"),
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "UTC").alias("UTC"),
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "Europe/Helsinki").alias("Europe/Helsinki")
)
onlyTime.show(false)
Once you have above dataframe, getting the output dataframe that you desire would require you to split the date, groupby and count as below
onlyTime.select(split($"Europe/Helsinki", " ")(0).as("date"), split(split($"Europe/Helsinki", " ")(1).as("time"), ":")(0).as("hour"))
.groupBy("date", "hour").agg(count("date").as("count"))
.show(false)
The resulting dataframe is
+----------+----+-----+
|date |hour|count|
+----------+----+-----+
|2017-04-09|23 |1 |
|2017-04-10|00 |2 |
+----------+----+-----+
Setting "spark.sql.session.timeZone" before the action seems to be reliable. Using this setting we can be sure that the timestamps that we use afterwards- does actually represent the time in the specified time zone. Without it (if we use from_unixtime or timestamp_seconds) we can't be sure which time zone is represented. Both those functions represent the current system time zone. And if afterwards we used to_utc_timestamp or from_utc_timestamp, we would only get a shift from the current system time zone. UTC does not necessarily come into play with the latter functions. This is why explicitly setting a time zone can be reliable. One thing to keep in mind is that the action(s) must be performed before spark.conf.unset("spark.sql.session.timeZone").
Scala
Input df:
import spark.implicits._
import org.apache.spark.sql.functions._
val inputDF = Seq(1491771599L,1491771600L,1491771601L).toDF("unix_timestamp")
inputDF.show()
// +--------------+
// |unix_timestamp|
// +--------------+
// | 1491771599|
// | 1491771600|
// | 1491771601|
// +--------------+
Result:
spark.conf.set("spark.sql.session.timeZone", "Europe/Helsinki")
val ts = from_unixtime($"unix_timestamp")
val DF = inputDF.groupBy(to_date(ts).alias("date"), hour(ts).alias("hour")).count()
DF.show()
// +----------+----+-----+
// | date|hour|count|
// +----------+----+-----+
// |2017-04-09| 23| 1|
// |2017-04-10| 0| 2|
// +----------+----+-----+
spark.conf.unset("spark.sql.session.timeZone")
PySpark
Input df:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1491771599,),(1491771600,),(1491771601,)], ['unix_timestamp'])
df.show()
# +--------------+
# |unix_timestamp|
# +--------------+
# | 1491771599|
# | 1491771600|
# | 1491771601|
# +--------------+
Result:
spark.conf.set("spark.sql.session.timeZone", "Europe/Helsinki")
ts = F.from_unixtime('unix_timestamp')
df_agg = df.groupBy(F.to_date(ts).alias('date'), F.hour(ts).alias('hour')).count()
df_agg.show()
# +----------+----+-----+
# | date|hour|count|
# +----------+----+-----+
# |2017-04-09| 23| 1|
# |2017-04-10| 0| 2|
# +----------+----+-----+
spark.conf.unset("spark.sql.session.timeZone")