dataframe use udf have problem caused by partition - pyspark

I know that the dataframe in pyspark has their partition and when I apply a function (udf) on one column, different partition will apply the same function in parallel.
df = sqlCtx.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')
],
('index', 'X', 'label', 'date')
)
data=df.rdd.map(lambda x:x['label']).collect()
def ad(x):
return data.pop(0).lower()
AD=F.udf(ad,StringType())
df.withColumn('station',AD('label')).select('station').rdd.flatMap(lambda x:x).collect()
here is the output:
['a', 'a', 'a', 'a']
which should be:
['a', 'b', 'a', 'b']
And the most strange thing is that
data
didn't even change after we call the functio
data.pop(0)

Well, It turns out when the number of partition increases, the function will apply on each partition with the same
data
which means, the data is deepcopyed and will not be change.
Every time we use F.udf, it will deepcopy every variable inside the function.

Related

Scala to Pyspark - Join method

I have to transform the code below into pyspark.
val tempFactDF = unionTempDF.join(fact.select("x","y","d","f","s"),
Seq("x","y","d","f")).dropDuplicates
Is this a correct approach? is this considering Seq?
What if I want to do a left join?
unionTempDF.join(joiningTable, ['x', 'y', 'd', 'f']).dropDuplicates()

Updating multiple values in a single Postgres jsonb column

I have a Postgres jsonb column. It has multiple top-level attributes, including sub-hashes and sub-arrays. An example:
{
first: { a: 'a', b: 'b' },
second: ['a', 'b'],
third: { something: 'else' },
...
}
I want to merge multiple attributes into this value, ideally in a single operation. For example, I'd like to be able to merge new keys into first (leaving everything else untouched), and new values into second, resulting in e.g.
{
first: { a: 'a', b: 'b', c: 'c' },
second: ['a', 'b', 'c'],
third: { something: 'else' },
...
}
I've tried using both || operator and jsonb_set but have yet to get this working in a single operation.
For context, I'm doing this in Ruby on Rails using ActiveRecord, but am assuming I'll need to perform this action using raw SQL with the ActiveRecord::Base.connection.execute command. Also, in practice, the values are more complex – 'second' might actually be an array of hashes, but I don't see that impacting how the operation would work.
If necessary, I could perform this in multiple operations (one for 'first', one for 'second' etc), but a single operation would be preferred.
Any help is very much appreciated.

How can I refer a column by its index?

I can use col("mycolumnname") function to get the column object.
Based on the documentation the only possible parameter is the name of the column.
Is there any way to get the column object by its index?
Try this:
Let n be the index variable (integer).
df.select(df.columns[n]).show()
Is the expected result like this?
import pyspark.sql.functions as F
...
data = [
(1, 'AC Milan'),
(2, 'Real Madrid'),
(3, 'Bayern Munich')
]
df = spark.createDataFrame(data, ['id', 'club'])
df.select(F.col('club')).show()
df.select(df['club']).show()

How can I remove rows that are 100% duplicates in a PostgreSQL table without a primary key? [duplicate]

This question already has answers here:
Delete duplicate rows from small table
(15 answers)
Closed 3 years ago.
I have a PostgreSQL table with a very large number of columns. The table does not have a primary key and now contains several rows that are 100% duplicates of another row.
How can I remove those duplicates without deleting the original along with them?
I found this answer on a related question, but I'd have to spell out each and every column name, which is error-prone. How can I avoid having to know anything about the table structure?
Example:
Given
create table duplicated (
id int,
name text,
description text
);
insert into duplicated
values (1, 'A', null),
(2, 'B', null),
(2, 'B', null),
(3, 'C', null),
(3, 'C', null),
(3, 'C', 'not a DUPE!');
after deletion, the following rows should remain:
(1, 'A', null)
(2, 'B', null)
(3, 'C', null)
(3, 'C', 'not a DUPE!')
As proposed in this answer, use the system column ctid to distinguish the physical copies of otherwise indentical rows.
To avoid having to spell out a non-existing 'key' for the rows, simply use the row constructor row(table), which returns a
row value containing the entire row as returned by select * from table:
DELETE FROM duplicated
USING (
SELECT MIN(ctid) as ctid, row(duplicated) as row
FROM duplicated
GROUP BY row(duplicated) HAVING COUNT(*) > 1
) uniqued
WHERE row(duplicated) = uniqued.row
AND duplicated.ctid <> uniqued.ctid;
You can try it in this DbFiddle.

How can I conditionally add elements to a jsonb array? [duplicate]

This question already has answers here:
PostgreSQL JSON building an array without null values
(4 answers)
Closed 8 months ago.
Is there a way in which I can conditionally add elements into a postgres jsonb array? I'm trying to construct an array to be added into a larger object where most of the elements are always required but I'd like to have some of them optional.
As a simplified example:
select jsonb_build_array(
jsonb_build_object('a', a),
jsonb_build_object('b', b),
jsonb_build_object('c', c),
case when a + b <> c then
jsonb_build_object('error', c - (a + b))
end
) from ( values (2, 2, 5) ) as things (a,b,c);
This works fine when a+b<>c but when a+b=c i get a null in the array.e.g.
sophia=> \i ~/cc/dpdb/migration/foo.sql
jsonb_build_array
----------------------------------------------
[{"a": 2}, {"b": 2}, {"c": 5}, {"error": 1}]
(1 row)
sophia=> \i ~/cc/dpdb/migration/foo.sql
jsonb_build_array
--------------------------------------
[{"a": 2}, {"b": 2}, {"c": 4}, null]
(1 row)
sophia=>
Is there a way to add the element without the null or if added, remove the null? Obviously, I could put the whole block in a case and duplicate the first few lines but that would be rather ugly and verbose. There's jsonb_strip_nulls but that only works on objects not arrays.
You have to use a second step because you cannot create "no element" in your syntax. Either you really separate both cases with two different array creations or you have to conditionally adjust the created array afterwards:
demo:db<>fiddle
SELECT
CASE WHEN a + b <> c THEN
my_array || jsonb_build_object('error', c - (a + b))
ELSE
my_array
END
FROM (
select
a, b, c,
jsonb_build_array(
jsonb_build_object('a', a),
jsonb_build_object('b', b),
jsonb_build_object('c', c)
) AS my_array
from ( values (2, 2, 5), (2, 2, 4) ) as things (a,b,c)
) s