Technology: Spark 3.0.3 with Scala 2.12.10
I'm trying to pivot a spark dataframe having 238 million records (Parquet files total 1.1GB ) registered as spark tempView with 4 columns namely: (timestamp, asset, tag, value)
I had to pivot the dataframe on the tag column, so I fetched the distinct values of the tag and passed them into the in clause.
SQL Query Used:
SELECT
*
FROM
(
select
timestamp,
tag,
value
from
temp_table
group by
timestamp,
tag,
value
)
pivot( AVG(value) for tag IN
(
tagval_1,
tagval_2,
...,
tagval_4000
)
)
I've close to 4000 distinct values for the tag column.
Configuration used to run the Query:
Spark Driver: 5GB
Spark Executor: 14 Executors of 7GB memory and 2 cores each.
I always end up getting JVM memory overhead exception. Hence I kept on increasing the driver's memory to 5GB each time, eventually at 30GB memory the pivot query ran and it took whopping 25 mins to finish the job.
I'm doing something wrong? Is the resource usage justified? How can a total of 1.1GB of raw files take so many resources for pivoting?
Any help would be appreciated.
Related
I'm trying to check the size of the different tables we're generating in our data warehouse, so we can have an automatic way to calculate partition size in next runs.
In order to get the table size I'm getting the stats from dataframes in the following way:
val db = "database"
val table_name = "table_name"
val table_size_bytes = spark.read.table(s"$db.$table_name").queryExecution.analyzed.stats.sizeInBytes
This was working fine until I started running the same code on partitioned tables. Each time I ran it on a partitioned table I got the same value for sizeInBytes, which is the max allowed value for BigInt: 9223372036854775807.
Is this a bug in Spark or should I be running this in a different way for partitioned tables?
I had created a bucketed table using below command in Spark:
df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test")
Size of Table : 50 GB
Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?
As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.
Spark DF: jrny_df1.createOrReplaceTempView("journeymap_drvs1")
approx: 10MM records
Creating a sql table of this view takes a long time:
create table temp.ms_journey_drvsv1 as select * from journeymap_drvs1;
Is there any process that I can follow to optimize the speed of the table creation. We Spark 2.4, 88 cores, 671 GB memory
Check the cluster configuration , post that partition the DF accordingly so that parallaism can be achieved which will eventually reduce the time
I have created a table1 (partitioned by batchDate and batchID), comprising of 11,000 columns. The table1 is loaded with part files by some other process, usually having 30 part files, of average size 8 MB each. I am writing a process to insert data from this table1, to another table2 after repartitioning it. I am passing a dynamic value into repartition method which means, if I am expecting all partitions of 100 MB each, I am calculating size of data present in source folder, and dividing the same by 100 MB.
So potential partitions would be = Size of source folder/100 MB, for any given batch.
But it's not even able to complete this write operation for 226 MB of data present in source folder, and is running forever.
val df = spark.sql("select * from table1 where batchdt='20190716' and batchid = '20190716073'")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
df.repartition(2).write.insertInto("table2")
Running the above code using config:
spark-shell --master yarn --driver-memory 3g --num-executors 3 --executor-cores 3 --executor-memory 2g
Not sure what am I missing here. Would really appreciate any help.
this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY slow saves of a dataframe to parquet or csv, or converting a dataframe to a list or array type data structure. Below is the relevant python/pyspark code and info:
#Table is a List of Rows from small Hive table I loaded using
#query = "SELECT * FROM Table"
#Table = sqlContext.sql(query).collect()
for i in range(len(Table)):
rows = sqlContext.sql(qry)
val1 = Table[i][0]
val2 = Table[i][1]
count = Table[i][2]
x = 100 - count
#hivetemp is a table that I copied from Hive to my hfs using:
#create external table IF NOT EXISTS hive temp LIKE hivetableIwant2copy LOCATION "/user/name/hiveBackup";
#INSERT OVERWRITE TABLE hivetemp SELECT * FROM hivetableIwant2copy;
query = "SELECT * FROM hivetemp WHERE col1<>\""+val1+"\" AND col2 ==\""+val2+"\" ORDER BY RAND() LIMIT "+str(x)
rows = sqlContext.sql(query)
rows = rows.withColumn("col4", lit(10))
rows = rows.withColumn("col5", lit(some_string))
#writing to parquet is heck slow AND I can't work with pandas due to the library not installed on the server
rows.saveAsParquetFile("rows"+str(i)+".parquet")
#tried this before and heck slow also
#rows_list = rows.collect()
#shuffle(rows_list)
I have tried to do the above in Scala, and I had similar problems. I could easily load the hive table or query of a hive table, but needing to do a random shuffle or store a large dataframe encounters memory issues. There were also some challenges with being able to add 2 extra columns.
The Hive table (hiveTemp) that I want to add rows to has 5,570,000 ~5.5 million rows and 120 columns.
The Hive table that I am iterating in the for loop through has 5000 rows and 3 columns. There are 25 unique val1 (a column in hiveTemp), and the combinations of val1 and val2 3000. Val2 could be one of 5 columns and its specific cell value. This means if I had tweaked code, then I could reduce the lookups of rows to add down to 26 from 5000, but the number of rows I have to retrieve, store and random shuffle would be pretty large and hence a memory issue (unless anyone has suggestions on this)
As far as how many total rows I need to add to the table might be about 100,000.
The ultimate goal is to have the original table of 5.5mill rows appended with the 100k+ rows written as a hive or parquet table. If its easier, I am fine with writing the 100k rows in its own table that can be merged to the 5.5 mill table later
Scala or Python is fine, though Scala is more preferred..
Any advice on this and the options that would be best would be great.
Thanks a lot!
EDIT
Some additional thought I had on this problem:
I used the hash partitioner to partition the hive table into 26 partitions. This is based on a column value which there are 26 distinct ones. The operations I want to perform in the for loop could be generalized so that it only needs to happen on each of these partitions.
That being said, how could I, or what guide can I look at online to be able to write the scala code to do this, and for a separate executer to do each of these loops on each partition? I am thinking this would make things much faster.
I know how to do something like this using multithreads but not sure how to in the scala/spark paradigm.