Dropping partitions in Spark

Dropping partitions in Spark - scala

I'm trying to drop partitions of a table A that has 3 partition columns: X, Y, Z.
When I insert data to A and execute
val partitions = sparkSession.sql("SHOW PARTITIONS A")
partitions.show()
I see:
+-----------------------------------------------------------------------------------------------------------------------+
|partition |
+-----------------------------------------------------------------------------------------------------------------------+
|X=x1/Y=y2/Z=zn|
|X=x1/Y=y1/Z=zn|
|X=x2/Y=y2/Z=zn|
|X=x2/Y=y1/Z=zn|
+-----------------------------------------------------------------------------------------------------------------------+
Note that Z values could be more than 1 (thus the n there).
Then some processes happen and I need to drop partitions where X=x1,x2, Y=y1 and Z=everything
I'm trying to:
sparkSession.sql("alter table A drop partition(X='x1', Y='y1'), partition(X=x2, Y=y1))
But I'm getting the error that the partitions don't exist. I'm assuming it's because I'm not providing values for partition Column Z. Is there a way to do this? Or should I provide explicitly all possible values of Z?
EDIT: Ok, so I removed partition column Z just to make sure that was the issue... And it turns out, it's not.
Partitions:
+---------------------------------------------------------------+
|partition |
+---------------------------------------------------------------+
|X=x1/Y=y2|
|X=x1/Y=y1|
|X=x2/Y=y2|
|X=x2/Y=y1|
+---------------------------------------------------------------+
The following partitions not found in table 'A' database 'DB':
Map(X -> x1, Y -> y1)
===
Map(X -> x2, Y -> y1);
So... Any idea? Is my syntax wrong?
Edit 2: So I'm back at my assumption 1. I managed to drop the partitions in the 2 partition column case with appropriate quotes. But the 3 partition column case still doesn't work unless I add explicitly the value of Z column.
Thanks.

Related

Spark Dataframe changing through execution

I'm fairly new to Spark, so most likely I'm having a huge gap in my understanding. Apologies in advance if what you see here is very silly. So, what I'm trying to achieve is:
Get a set of rows from a table in Hive (let's call it T_A) and save them in a DataFrame (let's call it DF_A). This is done.
Get extra information from another Hive table (T_B) and join it with DF_A to get a new Dataframe (DF_B). And then cache it. This is also done.
val DF_A = sparkSession.sql("select * from T_A where whatever=something").toDF()
val extraData = sparkSession.sql("select * from T_B where whatever=something").toDF()
val DF_B = DF_A.join(extraData,
col(something_else=other_thing), "left"
).toDF().cache()
Now this is me assuming Spark + Hive works similarly than regular java app + SQL, which is where I might need a hard course correction.
Here, I attempt to store in one of the Hive Tables I used before (T_B), partitioned by column X, whatever N rows I transformed (Tr1(DF_B)) from DF_B. I use:
val DF_C = DF_B.map(row => {
Tr1(row)
}).toDF()
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")
After saving it to this table, I want to reuse the information from DF_B (not the transformed data reinserted in T_B, but the joined data in DF_B based on previous state of T_B) to make a second transformation over it (Tr2(DF_B)).
I want to update the same N rows written in T_B with the data transformed by previous step, using an "INSERT OVERWRITE" operation and the same partition column X.
val DF_D = DF_B.map(row => {
Tr2(row)
}).toDF()
DF_D.write.mode(SaveMode.Overwrite).insertInto("T_B")
What I expect:
T_B having N rows.
Having DF_B unchanged, with N rows.
What it's happening:
DF_B having 3*N rows.
T_C having 3*N rows.
Now, after some debugging, I found that DF_B has 3N rows after DF_C write finishes. So DF_B will have 3N rows too and that will cause T_B to have 3*N rows as well.
So, my question is... Is there a way to retain the original DF_B data and use it for the second map transformation, since it relies on the original DF_B state for the transformation process? Is there a reference somewhere I can read to know why this happens?
EDIT: I don't know if this is useful information, but I log the count of records before and after doing the first write. And I get the following
val DF_C = DF_B.map(row => {
Tr1(row)
}).toDF()
Logger.info("DF_C.count {} - DF_B.count {}"...
DF_C.write.mode(SaveMode.Overwrite).insertInto("T_B")
Logger.info("DF_C.count {} - DF_B.count {}"...
With persist(MEMORY_AND_DISK) or no persist at all, instead of cache and 3 test rows. I get:
DF_C.count 3 - DF_B.count 3
write
DF_C.count 3 - DF_B.count 9
With cache, I get:
DF_C.count 3 - DF_B.count 3
write
DF_C.count 9 - DF_B.count 9
Any idea?
Thank you so much.

In Spark, execution happens in lazy way, only when action gets called.
So when you calling some action two times on same dataframe (in your case DF_B) that dataframe(DB_B) will be created and transformed two times from starting at time of execution.
So try to persist your dataframe DF_B before calling first action, then you can use same DF for both Tr1 and Tr2.
After persist dataframe will be stored in memory/disk and can be reused multiple times.
You can learn more about persistance here

PostgreSQL10 - is it possible to do PARTITION BY LIST (col1, col2, .., colN)?

I am looking at the PostgreSQL official documentation page on Table Partitioning for my version of postgres.
I would like to create table partitions over three columns, and I wish to use declarative partition with BY LIST method to do that.
However, I cannot seem to find a good example on how to deal with more columns, and BY LIST specifically.
In the aforementioned docs I only read:
You may decide to use multiple columns in the partition key for range
partitioning, if desired. (...) For example, consider a table range
partitioned using columns lastname and firstname (in that order) as
the partition key.
It seems that declarative partition on multiple columns is only for BY RANGE or is that not right?
However, if it is not, I found an answer on SO that tells me how to deal with BY LIST and one column. But in my case I have three columns.
My idea would be to do something like the following (I am pretty sure it's wrong):
CREATE TABLE my_partitioned_table (
col1 type CONSTRAINT col1_constraint CHECK (col1 = 1 or col1 = 0),
col2 type CONSTRAINT col2_constraint CHECK (col2 = 'A' or col2 = 'B'),
col3 type,
col4 type) PARTITION BY LIST (col1, col2);
CREATE TABLE part_1a PARTITION OF my_partitioned_table
FOR VALUES IN (1, 'A');
CREATE TABLE part_1b PARTITION OF my_partitioned_tabel
FOR VALUES IN (1, 'B');
...
I would need a correct implemenation as the combination of possible partitions in my case is quite a lot.

That is true, you cannot use list partitioning with more than one partitioning key. You also cannot bent range partitioning to do what you want.
But you could use a composite type to get what you want:
CREATE TYPE part_type AS (a integer, b text);
CREATE TABLE partme (p part_type, val text) PARTITION BY LIST (p);
CREATE TABLE partme_1_B PARTITION OF partme FOR VALUES IN (ROW(1, 'B'));
INSERT INTO partme VALUES (ROW(1, 'B'), 'x');
INSERT INTO partme VALUES (ROW(1, 'C'), 'x');
ERROR: no partition of relation "partme" found for row
DETAIL: Partition key of the failing row contains (p) = ((1,C)).
SELECT (p).a, (p).b, val FROM partme;
a | b | val
---+---+-----
1 | B | x
(1 row)
But perhaps the best way to go is to use subpartitioning: partition the original table by the first column and the partitions by the second column.

Can (aggregate) functions be used to define a column?

Assume a table like this one:
a | b | total
--|---|------
1 | 2 | 3
4 | 7 | 11
…
CREATE TEMPORARY TABLE summedup (
a double precision DEFAULT 0
, b double precision DEFAULT 0
--, total double precision
);
INSERT INTO summedup (a, b) VALUES (1, 2);
INSERT INTO summedup (a, b) VALUES (4, 7);
SELECT a, b, a + b as total FROM summedup;
It's easy to sum up the first two columns on SELECT.
But does Postgres (9.6) also support the ability to define total as the sum of the other two columns? If so:
What is the syntax?
What is this type of operation called (aggregates typically sum up cells over multiple rows, not columns.)

What you are looking for is typically called a "computed column".
Postgres 9.6 does not support that (Postgres 12 - to be released in Q4 2019 - will).
But for such a simple sum, I wouldn't bother storing redundant information.
If you don't want to repeat the expression, create a view.

I think what you want is a View.
CREATE VIEW table_with_sum AS
SELECT id, a, b, a + b as total FROM summedup;
then you can query the view for the sum.
SELECT total FROM table_with_sum where id=5;
The View does not store the sum for each row, the totalcolumn is computed every time you query the View. If your goal is to make your query more efficient, this will not help.
There is an other way: add the column to the table and create triggers for update and insert that update the total column every time a row is modified.

TopologyTestDriver sending incorrect message on KTable aggregations

I have a topology that aggregates on a KTable.
This is a generic method I created to build this topology on different topics I have.
public static <A, B, C> KTable<C, Set<B>> groupTable(KTable<A, B> table, Function<B, C> getKeyFunction,
Serde<C> keySerde, Serde<B> valueSerde, Serde<Set<B>> aggregatedSerde) {
return table
.groupBy((key, value) -> KeyValue.pair(getKeyFunction.apply(value), value),
Serialized.with(keySerde, valueSerde))
.aggregate(() -> new HashSet<>(), (key, newValue, agg) -> {
agg.remove(newValue);
agg.add(newValue);
return agg;
}, (key, oldValue, agg) -> {
agg.remove(oldValue);
return agg;
}, Materialized.with(keySerde, aggregatedSerde));
}
This works pretty well when using Kafka, but not when testing via `TopologyTestDriver`.
In both scenarios, when I get an update, the subtractor is called first, and then the adder is called. The problem is that when using the TopologyTestDriver, two messages are sent out for updates: one after the subtractor call, and another one after the adder call. Not to mention that the message that is sent after the subrtractor and before the adder is in an incorrect stage.
Any one else could confirm this is a bug? I've tested this for both Kafka versions 2.0.1 and 2.1.0.
EDIT:
I created a testcase in github to illustrate the issue: https://github.com/mulho/topology-testcase

It is expected behavior that there are two output records (one "minus" record, and one "plus" record). It's a little tricky to understand how it works, so let me try to explain.
Assume you have the following input table:
key | value
-----+---------
A | <10,2>
B | <10,3>
C | <11,4>
On KTable#groupBy() you extract the first part of the value as new key (ie, 10 or 11) and later sum the second part (ie, 2, 3, 4) in the aggregation. Because A and B record both have 10 as new key, you would sum 2+3 and you would also sum 4 for new key 11. The result table would be:
key | value
-----+---------
10 | 5
11 | 4
Now assume that an update record <B,<11,5>> change the original input KTable to:
key | value
-----+---------
A | <10,2>
B | <11,5>
C | <11,4>
Thus, the new result table should sum up 5+4 for 11 and 2 for 10:
key | value
-----+---------
10 | 2
11 | 9
If you compare the first result table with the second, you might notice that both rows got update. The old B|<10,3> record is subtracted from 10|5 resulting in 10|2 and the new B|<11,5> record is added to 11|4 resulting in 11|9.
This is exactly the two output records you see. The first output record (after subtract is executed), updates the first row (it subtracts the old value that is not part of the aggregation result any longer), while the second record adds the new value to the aggregation result. In our example, the subtract record would be <10,<null,<10,3>>> and the add record would be <11,<<11,5>,null>> (the format of those record is <key, <plus,minus>> (note that the subtract record only set the minus part while the add record only set the plus part).
Final remark: it is not possible to put plus and minus records together, because the key of the plus and minus record can be different (in our example 11 and 10), and thus might go into different partitions. This implies that the plus and minus operation might be executed by different machines and thus it's not possible to only emit one record that contains both plus and minus part.

Removing Duplicates From Multiple Unique Columns

I am accessing table that takes in every encounter between two vehicles (i do not have permissions to change this table). When an encounter occurs, it'll take in one row for each perspective of the encounter- Vehicle X encountered Vehicle Y and another row for Vehicle Y encountered Vehicle X. Here's some sample data:
Location Vehicle1 Vehicle2
103923 5594800 54114
105938 40547 1855442
103923 2588603 5659158
103923 54114 5594800
103923 5659158 2588603
105938 1855442 40547
There are no duplicates in any row, values are all unique. But every value in Vehicle1 exists in vehicle2. How would i get it so only one of each pair exists?

GREATEST and LEAST functions might help.
DELETE ... USING syntax
DELETE
FROM t a USING
( SELECT location,
greatest(Vehicle1 , Vehicle2) as vehicle1,
least(Vehicle1 , Vehicle2) as vehicle2
FROM t
GROUP BY 1,2,3 HAVING COUNT(*) > 1 ) b
WHERE a.location = b.location
AND a.Vehicle1 = b.Vehicle1
AND a.Vehicle2 = b.Vehicle2;