I have a data frame that when saved as Parquet format takes ~11GB.
When reading to a dataframe and writing to json, it takes 5 minutes.
When I add partitionBy("day") it takes hours to finish.
I understand that the distribution to partitions is the costly action.
Is there a way to make it faster? Will sorting the files can make it better?
Example:
Run 5 minutes
df=spark.read.parquet(source_path).
df.write.json(output_path)
Run for hours
spark.read.parquet(source_path).createOrReplaceTempView("source_table")
sql="""
select cast(trunc(date,'yyyymmdd') as int) as day, a.*
from source_table a"""
spark.sql(sql).write.partitionBy("day").json(output_path)
Try adding a repartition("day") before the write, like this:
spark
.sql(sql)
.repartition("day")
.write
.partitionBy("day")
.json(output_path)
It should speed up your query.
Try adding repartition(any number ) to start with, then try increasing / decreasing the number depending upon the time it takes to write
spark
.sql(sql)
.repartition(any number)
.write
.partitionBy("day")
.json(output_path)
Related
This is my code:
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
spark_df1.count( ) # This command took around 1.40 min for exectuion
spark_df1 = spark.read.option('header','True').csv("/mnt/gmclmprod/dsshare/cp106_rf_dev_final_apr20.csv.gz")
test_data = spark_df1.sample(fraction=0.001)
spark_df2 = spark_df1.subtract(test_data)
spark_df2.count() #This command is taking more than 20 min for execution. Can any one help why
#its taking long time for same count command?
Why is count() taking long time before and after using subtract command?
The jist is that, subtract is an expensive operation involving joins and distinct incurring shuffled hence would take long time compared to count on spark_df1.count(). How much longer is dependent on the Spark executor configurations and partitioning scheme. Do update the question according to comment to an ind-depth analysis.
I have a streaming data frame in spark reading from a kafka topic and I want to drop duplicates for the past 5 minutes every time a new record is parsed.
I am aware of the dropDuplicates(["uid"]) function, I am just not sure how to check for duplicates over a specific historic time interval.
My understanding is that the following:
df = df.dropDuplicates(["uid"])
either works on the data read over the current (micro)batch or else over "anything" that is right now into memory.
Is there a way to set the time for this de-duplication, using a "timestamp" column within the data?
Thanks in advance.
df\
.withWatermark("event_time", "5 seconds")\
.dropDuplicates(["User", "uid"])\
.groupBy("User")\
.count()\
.writeStream\
.queryName("pydeduplicated")\
.format("memory")\
.outputMode("complete")\
.start()
for more info you can refer,
https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]).
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM).
I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact.
I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC connection through R to get the counts.
bin/pyspark --driver-memory 40g --executor-memory 40g
df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()
When I tried with BIG table (5B records) then no results returned upon completion of query.
All of the aggregation operations are performed after the whole dataset is retrieved into memory into a DataFrame collection. So doing the count in Spark will never be as efficient as it would be directly in TeraData. Sometimes it's worth it to push some computation into the database by creating views and then mapping those views using the JDBC API.
Every time you use the JDBC driver to access a large table you should specify the partitioning strategy otherwise you will create a DataFrame/RDD with a single partition and you will overload the single JDBC connection.
Instead you want to try the following AI (since Spark 1.4.0+):
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
columnName = "<INTEGRAL_COLUMN_TO_PARTITION>",
lowerBound = minValue,
upperBound = maxValue,
numPartitions = 20,
connectionProperties = new java.util.Properties()
)
There is also an option to push down some filtering.
If you don't have an uniformly distributed integral column you want to create some custom partitions by specifying custom predicates (where statements). For example let's suppose you have a timestamp column and want to partition by date ranges:
val predicates =
Array(
"2015-06-20" -> "2015-06-30",
"2015-07-01" -> "2015-07-10",
"2015-07-11" -> "2015-07-20",
"2015-07-21" -> "2015-07-31"
)
.map {
case (start, end) =>
s"cast(DAT_TME as date) >= date '$start' AND cast(DAT_TME as date) <= date '$end'"
}
predicates.foreach(println)
// Below is the result of how predicates were formed
//cast(DAT_TME as date) >= date '2015-06-20' AND cast(DAT_TME as date) <= date '2015-06-30'
//cast(DAT_TME as date) >= date '2015-07-01' AND cast(DAT_TME as date) <= date '2015-07-10'
//cast(DAT_TME as date) >= date '2015-07-11' AND cast(DAT_TME as date) <= date //'2015-07-20'
//cast(DAT_TME as date) >= date '2015-07-21' AND cast(DAT_TME as date) <= date '2015-07-31'
sqlctx.read.jdbc(
url = "<URL>",
table = "<TABLE>",
predicates = predicates,
connectionProperties = new java.util.Properties()
)
It will generate a DataFrame where each partition will contain the records of each subquery associated to the different predicates.
Check the source code at DataFrameReader.scala
Does the unserialized table fit into 40 GB? If it starts swapping on disk performance will decrease drammatically.
Anyway when you use a standard JDBC with ansi SQL syntax you leverage the DB engine, so if teradata ( I don't know teradata ) holds statistics about your table, a classic "select count(*) from table" will be very fast.
Instead spark, is loading your 100 million rows in memory with something like "select * from table" and then will perform a count on RDD rows. It's a pretty different workload.
One solution that differs from others is to save the data from the oracle table in an avro file (partitioned in many files) saved on hadoop.
This way reading those avro files with spark would be a peace of cake since you won't call the db anymore.
I have a working Pyspark Windowing function (Spark 2.0) that takes the last 30 days (86400*30) seconds and counts the number of times each action in column 'a' happens per ID. The dataset that I am applying this function to has multiple records for every day between '2018-01-01' and '2018-04-01'. Because this is a 30 day look back, I don't want to apply this function to data that doesn't have a full 30 days to look back on. For convenience, I want to start my counts on Feb 1st. I can' filter out January, because it is needed for Februrary's counts. I know I can just throw a filter on the new dataframe and filter out the data before for February, but is there a way to do it without that extra step? It'd be nice to not have to preform the calculations which could save time.
Here's the code:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowsess = Window.partitionBy("id",'a').orderBy('ts').rangeBetween(-86400*30, Window.currentRow)
df4 = df3.withColumn("2h4_ct",F.count(df.a).over(windowsess))
Mockup of current dataset. I didn't want to convert the col ts, by hand so I wrote in a substitute for it.
id,a,timestamp,ts
1,soccer,2018-01-01 10:41:00, <unix_timestamp>
1,soccer,2018-01-13 10:40:00, <unix_timestamp>
1,soccer,2018-01-23 10:39:00, <unix_timestamp>
1,soccer,2018-02-01 10:38:00, <unix_timestamp>
1,soccer,2018-02-03 10:37:00, <unix_timestamp>
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>
With my made up sample data. I want to return the following rows
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
instead I get this:
1,soccer,2018-01-01 10:41:00, <unix_timestamp>,1
1,soccer,2018-01-13 10:40:00, <unix_timestamp>,2
1,soccer,2018-01-23 10:39:00, <unix_timestamp>,3
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
What if you use :
df4 = df3.groupby(['id', 'a', 'timestamp']).count()
I'm using Spark Streaming to read data from Kinesis using the Structured Streaming framework, my connection is as follows
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("endpointUrl", endpointUrl)
.option("initialPositionInStream", "earliest")
.option("format", "json")
.schema(<my-schema>)
.load
The data comes from several IoT devices which have a unique id, I need to aggregate the data by this id and by a tumbling window over the timestamp field, as follows:
val aggregateData = kinesis
.groupBy($"uid", window($"timestamp", "15 minute", "15 minute"))
.agg(...)
The problem I'm encountering is that I need to guarantee that every window starts at round times (such as 00:00:00, 00:15:00 and so on), also I need a guarantee that only rows containing full 15-minute long windows are going to be output to my sink, what I'm currently doing is
val query = aggregateData
.writeStream
.foreach(postgreSQLWriter)
.outputMode("update")
.start()
.awaitTermination()
Where ths postgreSQLWriter is a StreamWriter I created for inserting each row into a PostgreSQL SGBD. How can I force my windows to be exactly 15-minute long and the start time to be round 15-minute timestamp values for each device unique id?
question1:
to start at specific times to start, there is one more parameters spark grouping function takes which is "offset".
By specifying that it will start after the specified time from an hour
Example:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute"))
so the above syntax will group by column1 and create windows of 22 minute duration with sliding window size of 1 minute and offset as 15 minute
for example it starts from:
window1: 8:15(8:00 add 15 minute offset) to 8:37 (8:15 add 22 minutes)
window2: 8:16(previous window start + 1 minute) to 8:38 ( 22 minute size again)
question2:
to push only those windows having full 15 minute size, create a count column which counts the number of events having in that window. once it reaches 15, push it to wherever you want using filter command
calculating count:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute")).agg(count*$"Column2").as("count"))
writestream filter containing count 15 only:
aggregateddata.filter($"count"===15).writeStream.format(....).outputMode("complete").start()