What is the column limit for Spark Data Frames? - scala

Our team is having a lot of issues with the Spark API particularly with large schema tables. We currently have a program written in Scala that utilizes the Apache spark API to create two Hive tables from raw files. We have one particularly very large raw data file that is giving us issues that contains around ~4700 columns and ~200,000 rows.
Every week we get a new file that shows the updates, inserts and deletes that happened in the last week. Our program will create two tables – a master table and a history table. The master table will be the most up to date version of this table while the history table shows all changes inserts and updates that happened to this table and showing what changed. For example, if we have the following schema where A and B are the primary keys:
Week 1 Week 2
|-----|-----|-----| |-----|-----|-----|
| A | B | C | | A | B | C |
|-----|-----|-----| |-----|-----|-----|
| 1 | 2 | 3 | | 1 | 2 | 4 |
|-----|-----|-----| |-----|-----|-----|
Then the master table will now be
|-----|-----|-----|
| A | B | C |
|-----|-----|-----|
| 1 | 2 | 4 |
|-----|-----|-----|
And The history table will be
|-----|-----|-------------------|----------------|-------------|-------------|
| A | B | changed_column | change_type | old_value | new_value |
|-----|-----|-------------------|----------------|-------------|-------------|
| 1 | 2 | C | Update | 3 | 4 |
|-----|-----|-------------------|----------------|-------------|-------------|
This process is working flawlessly for shorter schema tables. We have a table that has 300 columns but over 100,000,000 rows and this code still runs as expected. The process above for the larger schema table runs for around 15 hours, and then crashes with the following error:
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.generic.Growable$class.loop$1(Growable.scala:52)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
Here is a code example that takes around 4 hours to run for this larger table, but runs in 20 seconds for other tables:
var dataframe_result = dataframe1.join(broadcast(dataframe2), Seq(listOfUniqueIds:_*)).repartition(100).cache()
We have tried all of the following with no success:
Using hash broad-cast joins (dataframe2 is smaller, dataframe1 is huge)
Repartioining on different numbers, as well as not repartitioning at all
Caching the result of the dataframe (we originally did not do this).
What is causing this error and how can we fix it? The only difference between this problem table is that it has so many columns. Is there an upper limit to how many columns Spark can handle?
Note: We are running this code on a very large MAPR cluster and we tried giving the code 500GB of RAM and its still failing.

Related

Redshift latency, i.e. discrepancy between "execution time" and "total runtime"

I'm currently experimenting with Redshift and I've noticed that for a simple query like:
SELECT COUNT(*) FROM table WHERE column = 'value';
The execution time reported by Redshift is only 84ms, which is expected and pretty good with the table at ~33M rows. However, the total runtime, both observed on my local psql client as well as in Redshift's console UI is 5 seconds. I've tried the query on both a single node cluster and multi-node (2 nodes and 4 nodes) clusters.
In addition, when I try with more realistic, complicated queries, I can see similarly that the query execution itself is only ~500ms in a lot of cases, but the total runtime is ~7 seconds.
What causes this discrepancy? Is there anyway to reduce this latency? Any internal table to dive deeper into the time distribution that covers the entire end-to-end runtime?
I read about Cold query performance improvements that Amazon recently introduced, but this latency seems to be there even on queries past the first cold one, as long as I alter the value in my where clause. However, the latency is somewhat inconsistent but definitely still go all the way up to 5 seconds.
-- Edited to give more details based on Bill Weiner's answer below --
There is no difference between doing SELECT COUNT(*) vs SELECT COUNT(column) (where column is a dist key to avoid skew).
There are absolutely zero other activities happening on the cluster because this is for exploration only. I'm the only one issuing queries and making connections to the DB, so there should be no queueing or locking delays.
The data resides in the Redshift database, with a normal schema and common-sense dist key and sort key. I have not added explicit compression to any columns, so everything is just AUTO right now.
Looks like compile time is the culprit!
STL_WLM_QUERY shows that for query 12599, this is the exec_start_time/exec_end_time:
-[ RECORD 1 ]------------+-----------------------------------------------------------------
userid | 100
xid | 14812605
task | 7289
query | 12599
service_class | 100
slot_count | 1
service_class_start_time | 2021-04-22 21:46:49.217
queue_start_time | 2021-04-22 21:46:49.21707
queue_end_time | 2021-04-22 21:46:49.21707
total_queue_time | 0
exec_start_time | 2021-04-22 21:46:49.217077
exec_end_time | 2021-04-22 21:46:53.762903
total_exec_time | 4545826
service_class_end_time | 2021-04-22 21:46:53.762903
final_state | Completed
est_peak_mem | 2097152
query_priority | Normal
service_class_name | Default queue
And from SVL_COMPILE, we have:
userid | xid | pid | query | segment | locus | starttime | endtime | compile
--------+----------+-------+-------+---------+-------+----------------------------+----------------------------+---------
100 | 14812605 | 30442 | 12599 | 0 | 1 | 2021-04-22 21:46:49.218872 | 2021-04-22 21:46:53.744529 | 1
100 | 14812605 | 30442 | 12599 | 2 | 2 | 2021-04-22 21:46:53.745711 | 2021-04-22 21:46:53.745728 | 0
100 | 14812605 | 30442 | 12599 | 3 | 2 | 2021-04-22 21:46:53.761989 | 2021-04-22 21:46:53.762015 | 0
100 | 14812605 | 30442 | 12599 | 1 | 1 | 2021-04-22 21:46:53.745476 | 2021-04-22 21:46:53.745503 | 0
(4 rows)
It shows that compile took from 21:46:49.218872 to 2021-04-22 21:46:53.744529, i.e. the overwhelming majority of the 4545ms total exec time.
There's a lot that could be taking up this time. Looking a more of the query and queuing statistics will help track down what is happening. Here are a few possibilities that I've seen be significant in the past:
Date return time. Since your query is an open select and could be returning a meaningful amount of data and moving this over the network to the requesting computer takes time.
Queuing delays. What else is happening on your cluster? Does you query start right away or does it need to wait for a slot?
Locking delays. What else is happening on your cluster? Are data/tables changing? Is the data your query needs being committed elsewhere?
Compile time. Is this the first time this query is run?
Is the table external? In S3 as an external table. Or are you using the new rs3 instance type? All the source data is in S3. (I'm guessing you are not on rs3 nodes but it doesn't hurt to ask)
A place to start is STL_WLM_QUERY to see where the query is spending this extra time.

Postgres DB Schema - multiple columns vs one json column

I have a db that contains username with 3 different phone numbers and 3 different ids. also we will have 3 different type of notes for each username.
I am using postgres and data are planned to increase to millions of rows. querying and inserting new data process are really important to be fastest way.
which schema would be better for that:
username | no1(string) | no2(string) | no3(string) | id1(string) | id2(string) | | id3(string) | note1(string) | note2(string) | note3(string)
OR
username | no(JSON) | id(JSON) | note(JSON)

Know which table are affected by a connection

I want to know if there is a way to retrieve which table are affected by request made from a connection in PostgreSQL 9.5 or higher.
The purpose is to have the information in such a way that will allow me to know which table where affected, in which order and in what way.
More precisely, something like this will suffice me :
id | datetime | id_conn | id_query | table | action
---+----------+---------+----------+---------+-------
1 | ... | 2256 | 125 | user | select
2 | ... | 2256 | 125 | order | select
3 | ... | 2256 | 125 | product | select
(this will be the result of a select query from user join order join product).
I know I can retrieve id_conn througth "pg_stat_activity", and I can see if there is a running query, but I can't find an "history" of the query.
The final purpose is to debug the database when incoherent data are inserted into the table (due to a lack of constraint). Knowing which connection do the insert will lead me to find the faulty script (as I have already the script name and the id connection linked).

Is it possible to use different forms and create one row of information in a table?

I have been searching for a way to combine two or more rows of one table in a database into one row.
I am currently creating multiple web-based forms that connect to one table in my database. Is there any way to write some mysql and php code that will take separate form submissions and put them into one row of the database instead of multiple rows?
Here is an example of what is going into the database:
This is all in one table with three rows.
Form_ID represents the three different forms that I used to insert the data into the table.
Form_ID | Lot_ID| F_Name | L_Name | Date | Age
------------------------------------------------------------
1 | 1 | John | Evans | *NULL* | *NULL*
-------------------------------------------------------------
2 |*NULL* | *NULL* | *NULL* | 2017-07-06 | *NULL*
-------------------------------------------------------------
3 |*NULL* | *NULL* | *NULL* | *NULL* | 22
This is an example of three separate forms going into one table. Every time the submit button is hit the data just inserts down to the next row of information.
I need some sort of join or update once the submit button is hit to replace the preceding NULL values.
Here is what I want to do after the submit button is hit:
I want it to be combined all into one row but still in one table
Form_ID is still the three separate forms but only in one row now.
Form_ID |Lot_ID | F_Name | L_Name | Date | Age
----------------------------------------------------------
1 | 1 | John | Evans | 2017-07-06 | 22
My goal is once a one form has been submitted I want the next, different form submission to replace the NULL values in the row above it and so on to create a single row of information.
I found a way to solve this issue. I used UPDATE tablename SET columname = newColumnName WHERE Form_ID = newID
So this way when I want to update rows that have blanks spaces I have it finding the matching ID's

How to create a PostgreSQL partitioned sequence?

Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example:
Using a normal sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
Using a partitioned sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
I do not believe there is a simple way that is as easy as regular sequences, because:
A sequence stores only one number stream (next value, etc.). You want one for each partition.
Sequences have special handling that bypasses the current transaction (to avoid the race condition). It is hard to replicate this at the SQL or PL/pgSQL level without using tricks like dblink.
The DEFAULT column property can use a simple expression or a function call like nextval('myseq'); but it cannot refer to other columns to inform the function which stream the value should come from.
You can make something that works, but you probably won't think it simple. Addressing the above problems in turn:
Use a table to store the next value for all partitions, with a schema like multiseq (partition_id, next_val).
Write a multinextval(seq_table, partition_id) function that does something like the following:
Create a new transaction independent on the current transaction (one way of doing this is through dblink; I believe some other server languages can do it more easily).
Lock the table mentioned in seq_table.
Update the row where the partition id is partition_id, with an incremented value. (Or insert a new row with value 2 if there is no existing one.)
Commit that transaction and return the previous stored id (or 1).
Create an insert trigger on your projects table that uses a call to multinextval('projects_table', NEW.Project_ID) for insertions.
I have not used this entire plan myself, but I have tried something similar to each step individually. Examples of the multinextval function and the trigger can be provided if you want to attempt this...