Spark Standalone java.lang.OutOfMemoryError - scala

I have Spark Standalone application on 64GB ram server. The amount of data I want to process is much more than the amount of ram I can afford.
I'm reading a huge amount of data into one big table and trying to do self-join on it. Pseudo-code looks like this:
val df = spark.read.parquet("huge_table.parquet")
val df2 = df.select(...).withColumn(...) // some data manipulations
df.as("df1").join(df2.as("df2"), $"df1.store_name" == $"df2.store_name" && $"df1.city_id" === $"df2.city_id")
My executor settings look like this --driver-memory 8g --executor-memory 32g.
spark-defaults.conf:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
The problem is that no matter what I do, I get
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:53)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:472)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:142)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.<init>(WindowExec.scala:310)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:290)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:289)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
I've tried to persist tables before data manipulations, tried df.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
I know I don't have enough resources to process data efficiently, but what can I do to make process finish, no matter how long does it take.
UPDATE
df
+----+--------------+---------+
| ID | store_name | city_id |
+----+--------------+---------+
| 1 | Apple ... | 22 |
| 2 | Apple ... | 33 |
| 3 | Apple ... | 44 |
+----+--------------+---------+
df2
+----+--------------+---------+---------+-------------+
| ID | store_name | city_id | sale_id | sale_amount |
+----+--------------+---------+---------+-------------+
| 1 | Apple ... | 33 | 1 | $30 |
| 2 | Apple ... | 44 | 2 | $50 |
| 3 | Apple ... | 44 | 3 | $50 |
| 4 | Apple ... | 44 | 4 | $50 |
| 5 | Apple ... | 44 | 5 | $40 |
+----+--------------+---------+---------+-------------+

Related

Transform structure of Spark DF. Create one column or row for each value in a column. Impute values [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have a Spark DF with the following structure:
+--------------------------------------+
| user| time | counts |
+--------------------------------------+
| 1 | 2018-06-04 16:00:00.0 | 5 |
| 1 | 2018-06-04 17:00:00.0 | 7 |
| 1 | 2018-06-04 17:30:00.0 | 7 |
| 1 | 2018-06-04 18:00:00.0 | 8 |
| 1 | 2018-06-04 18:30:00.0 | 10 |
| 1 | 2018-06-04 19:00:00.0 | 9 |
| 1 | 2018-06-04 20:00:00.0 | 7 |
| 2 | 2018-06-04 17:00:00.0 | 4 |
| 2 | 2018-06-04 18:00:00.0 | 4 |
| 2 | 2018-06-04 18:30:00.0 | 5 |
| 2 | 2018-06-04 19:30:00.0 | 7 |
| 3 | 2018-06-04 16:00:00.0 | 6 |
+--------------------------------------+
It was obtained from an event-log table using the following code:
ranked.groupBy($"user", sql.functions.window($"timestamp", "30 minutes"))
.agg(sum("id").as("counts"))
.withColumn("time", $"window.start")
As can be seen looking at the time column, not all 30-min intervals registered events for each user, i.e. not all user groups of frames are of equal lengths. I'd like to impute (possibly with NA's or 0's) missing time values and create a table (or RDD) like the following:
+-----------------------------------------------------------------------------+
| user| 2018-06-04 16:00:00 | 2018-06-04 16:30:00 | 2018-06-04 17:00:00 | ... |
+-----------------------------------------------------------------------------+
| 1 | 5 | NA (or 0) | 7 | ... |
| 2 | NA (or 0) | NA (or 0) | 4 | ... |
| 3 | 6 | NA (or 0) | NA (or 0) | ... |
+-----------------------------------------------------------------------------+
The transpose of the table above (with a time, column, and a column for the counts of each user) would theoretically work too, but I am not sure it would be optimal spark-wise as I have almost a million different users.
How can I perform a table re-structuring like described?
If each time window appears for at least one user, a simple pivot would do the trick (and put null for missing values). With millions of rows, it should be the case.
val reshaped_df = df.groupBy("user").pivot("time").agg(sum('counts))
In case a column is still missing, you could access the list of the columns with reshaped_df.columns and then add the missing ones. You would need to generate the list of columns that you expect (expected_columns) and then generate the missing ones as follows:
val expected_columns = ???
var result = reshaped_df
expected_columns
.foreach{ c =>
if(! result.columns.contains(c))
result = result.withColumn(c, lit(null))
}

How to Join data from a dataframe

I have one table with a lot of type of data, and some of the data has one information that is really important to analyse the rest of the data.
This is the table that I have
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| |
win | 1223 | 9 | 111| |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| |
progress | 1111 | 25 | | 225 |
This is the table that I want
name |player_id|data_ms|coins|progress |
progress | 1223 | 10 | | 128 |
complete | 1223 | 11 | 154| 128 |
win | 1223 | 9 | 111| 129 |
progress | 1223 | 11 | | 129 |
played | 1111 | 19 | 141| 225 |
progress | 1111 | 25 | | 225 |
I need to find the progress of the player, using the condition that, it has to be the first progress emitted after the data_ms (epoch unixtimstamp) of this event.
My table has 4 bilions lines of data, it's partitioned by data.
I tried to create a UDF function that should read the table filtering it, but it's not an option since you can't serialize spark to an UDF.
Any idea of how should I do this?
It seems like you want to fill gaps in column progress. I didn't really understand the condition but if it's based on data_ms then your hive query should look like this:
dataFrame.createOrReplaceTempView("your_table")
val progressDf = sparkSession.sql(
"""
SELECT name, player_id, data_ms, coins,
COALESCE(progress, LAST_VALUE(progress, TRUE) over (PARTITION BY player_id ORDER BY data_ms ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) AS progress
FROM your_table;
"""
)

Redshift Distribution By Child Columns

My Situation
I have some tables in my redshift cluster that all break down into either an order_id, shipment_id, or shipment_item_id depending on how granular the table is. order_id is a 1 to many relationship on shipment_id and shipment_id is a 1 to many on shipemnt_item_id.
My Question
I distribute on order_id, so all shipment_id and shipment_item_id records should be on the same nodes across the tables since they are grouped by order_id. My question is, when I have to join on shipment_id or shipment_item_id then will redshift know that the records are on the same nodes, or will it still broadcast the tables since they aren't joined on order_id?
Example Tables
unified_order shipment_details
+----------+-------------+------------------+ +-------------+-----------+--------------+
| order_id | shipment_id | shipment_item_id | | shipment_id | ship_day | ship_details |
+----------+-------------+------------------+ +-------------+-----------+--------------+
| 1 | 1 | 1 | | 1 | 1/1/2017 | stuff |
| 1 | 1 | 2 | | 2 | 5/1/2017 | other stuff |
| 1 | 1 | 3 | | 3 | 6/14/2017 | more stuff |
| 1 | 2 | 4 | | 4 | 5/13/2017 | less stuff |
| 1 | 2 | 5 | | 5 | 6/19/2017 | that stuff |
| 1 | 3 | 6 | | 6 | 7/31/2017 | what stuff |
| 2 | 4 | 7 | | 7 | 2/5/2017 | things |
| 2 | 4 | 8 | +-------------+-----------+--------------+
| 3 | 5 | 9 |
| 3 | 5 | 10 |
| 4 | 6 | 11 |
| 5 | 7 | 12 |
| 5 | 7 | 13 |
+----------+-------------+------------------+
Distribution
distribution_by_node
+------+----------+-------------+------------------+
| node | order_id | shipment_id | shipment_item_id |
+------+----------+-------------+------------------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 1 | 1 | 3 |
| 1 | 1 | 2 | 4 |
| 1 | 1 | 2 | 5 |
| 1 | 1 | 3 | 6 |
| 1 | 5 | 7 | 12 |
| 1 | 5 | 7 | 13 |
| 2 | 2 | 4 | 7 |
| 2 | 2 | 4 | 8 |
| 3 | 3 | 5 | 9 |
| 3 | 3 | 5 | 10 |
| 4 | 4 | 6 | 11 |
+------+----------+-------------+------------------+
The Amazon Redshift documentation does not go into detail how information is shared between nodes, but it is doubtful that it "broadcasts the tables".
Rather, information is probably sent between nodes based on need -- only the relevant columns would be shared, and possibly only sub-ranges of the data.
Rather than worrying too much about the internal implementation, you should test various DISTKEY and SORTKEY strategies against real queries to determine performance.
Follow the recommendations from Choose the Best Distribution Style to minimize the amount of data that needs to be sent between nodes and consult Amazon Redshift Best Practices for Designing Queries to improve queries.
You can EXPLAIN your query to see how data will be distributed (or not) during the execution. In this doc you'll see how to read the query plan:
Evaluating the Query Plan

how to flatten rows to columns in postgreSQL

using postgresql 9.3 I have a table that shows indivual permits issued across a single year below:
permit_typ| zipcode| address| name
-------------+------+------+-----
CONSTRUCTION | 20004 | 124 fake streeet | billy joe
SUPPLEMENTAL | 20005 | 124 fake streeet | james oswald
POST CARD | 20005 | 124 fake streeet | who cares
HOME OCCUPATION | 20007 | 124 fake streeet | who cares
SHOP DRAWING | 20009 | 124 fake streeet | who cares
I am trying to flatten this so it looks like
CONSTRUCTION | SUPPLEMENTAL | POST CARD| HOME OCCUPATION | SHOP DRAWING | zipcode
-------------+--------------+-----------+----------------+--------------+--------
1 | 2 | 3 | 5 | 6 | 20004
1 | 2 | 3 | 5 | 6 | 20005
1 | 2 | 3 | 5 | 6 | 20006
1 | 2 | 3 | 5 | 6 | 20007
1 | 2 | 3 | 5 | 6 | 20008
have been trying to use Crosstab but its a bit above my rusty SQL experiance. anybody have any ideas
I usually approach this type of query using conditional aggregation. In Postgres, you can do:
select zipcode,
sum( (permit_typ = 'CONSTRUCTION')::int) as Construction,
sum( (permit_typ = 'SUPPLEMENTAL')::int) as SUPPLEMENTAL,
. . .
from t
group by zipcode;

In postgresql, how do you find aggregate base on time range

For example, if I have a database table of transactions done over the counter. And I would like to search whether there was any time that was defined as extremely busy (Processed more than 10 transaction in the span of 10 minutes). How would I go about querying it? Could I aggregate based on time range and count the amount of transaction id within those ranges?
Adding example to clarify my input and desired output:
+----+--------------------+
| Id | register_timestamp |
+----+--------------------+
| 25 | 08:10:50 |
| 26 | 09:07:36 |
| 27 | 09:08:06 |
| 28 | 09:08:35 |
| 29 | 09:12:08 |
| 30 | 09:12:18 |
| 31 | 09:12:44 |
| 32 | 09:15:29 |
| 33 | 09:15:47 |
| 34 | 09:18:13 |
| 35 | 09:18:42 |
| 36 | 09:20:33 |
| 37 | 09:20:36 |
| 38 | 09:21:04 |
| 39 | 09:21:53 |
| 40 | 09:22:23 |
| 41 | 09:22:42 |
| 42 | 09:22:51 |
| 43 | 09:28:14 |
+----+--------------------+
Desired output would be something like:
+-------+----------+
| Count | Min |
+-------+----------+
| 1 | 08:10:50 |
| 3 | 09:07:36 |
| 7 | 09:12:08 |
| 8 | 09:20:33 |
+-------+----------+
How about this:
SELECT time,
FROM (
SELECT count(*) AS c, min(time) AS time
FROM transactions
GROUP BY floor(extract(epoch from time)/600);
)
WHERE c > 10;
This will find all ten minute intervals for which more than ten transactions occurred within that interval. It assumes that the table is called transactions and that it has a column called time where the timestamp is stored.
Thanks to redneb, I ended up with the following query:
SELECT count(*) AS c, min(register_timestamp) AS register_timestamp
FROM trak_participants_data
GROUP BY floor(extract(epoch from register_timestamp)/600)
order by register_timestamp
It works close enough for me to be able tell which time chunks are the most busiest for the counter.