Replace/Convert null value to empty array in pyspark - pyspark

I have Pyspark dataframe:
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | null | ["67"]
--------------------------------------------
2 | null | ["78"] | ["90"]
--------------------------------------------
3 | ["""] | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
And I need to convert all null values for column1 to empty array []
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | null | ["67"]
--------------------------------------------
2 | [] | ["78"] | ["90"]
--------------------------------------------
3 | ["""] | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
Used this code, but it's not working for me.
df.withColumn("column_1", coalesce(column_1, array().cast("array<string>")))
Appreciate your help!

The code working just fine for me, except that you need to wrap column_1 into quotes "column_1". Plus, you don't need to cast, just array() is enough.
df.withColumn("column_1", coalesce('column_1', array()))

Use fillna() with subset.
Reference https://stackoverflow.com/a/45070181

Related

convert empty array to null pyspark

I have a pyspark Dataframe:
Dataframe example:
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | ["""] | ["67"]
--------------------------------------------
2 | ["""] | ["78"] | ["90"]
--------------------------------------------
3 | ["""] | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
I want to convert all the values ["""] of the columns: column_1, column_2, column_3 to null. types of these 3 columns is an Array.
Excpect result:
id | column_1 | column_2 | column_3
--------------------------------------------
1 | ["12"] | null | ["67"]
--------------------------------------------
2 | null | ["78"] | ["90"]
--------------------------------------------
3 | null | ["93"] | ["56"]
--------------------------------------------
4 | ["100"] | ["78"] | ["90"]
--------------------------------------------
I tried this solution bellow:
df = df.withColumn(
"column_1",
F.when((F.size(F.col("column_1")) == ""),
F.lit(None)).otherwise(F.col("column_1"))
).withColumn(
"column_2",
F.when((F.size(F.col("column_2")) == ""),
F.lit(None)).otherwise(F.col("column_2"))
).withColumn(
"column_3",
F.when((F.size(F.col("column_3")) == ""),
F.lit(None)).otherwise(F.col("column_3"))
)
But it convert all to null.
How can I test on an empty array that contain an empty String normally, [""] not [].
Thank you
you can test with a when and replace the values:
df.withColumn(
"column_1",
F.when(F.col("column_1") != F.array(F.lit('"')), # or '"""' ?
F.col("column_1")
))
Do that for each of your columns.

sort by column from pivot table

I have 3 tables,
auctions
+----+----------+
| ID | NAME |
+----+----------+
| 1 | Lelang 1 |
+----+----------+
| 2 | Lelang 2 |
+----+----------+
| 3 | Lelang 3 |
+----+----------+
levels
+----+---------+
| ID | NAME |
+----+---------+
| 1 | Level 1 |
+----+---------+
| 2 | Level 2 |
+----+---------+
| 3 | Level 3 |
+----+---------+
auction_level
+----+---------+----------+------------+------------+
| ID | AUCT_ID | LEVEL_ID | DATE_START | DATE_END |
+----+---------+----------+------------+------------+
| 1 | 1 | 1 | 2017-01-01 | 2017-01-10 |
+----+---------+----------+------------+------------+
| 2 | 1 | 2 | 2017-01-11 | 2017-01-31 |
+----+---------+----------+------------+------------+
| 3 | 2 | 1 | 2017-01-05 | 2017-01-15 |
+----+---------+----------+------------+------------+
I want to select all entries of auctions table and sort by DATE_START of Level 1 on table auction_level. honestly I have no idea how to make the desired query, so there is no any queries that I have created yet.

Create a Common Identifier Across Columns with the Same Data

I have a dictionary containing the original word, and altforms of the word. What I have currently is something like this:
|---------------------|---------------------|------------------|
| WordID | Word | OrigWord |
|---------------------|---------------------|------------------|
| 1 | aah | null |
|---------------------|---------------------|------------------|
| 2 | aahs | aah |
|---------------------|---------------------|------------------|
| 3 | aahed | aah |
|---------------------|---------------------|------------------|
| 4 | aahing | aah |
|---------------------|---------------------|------------------|
I have around 270,000 words in the dictionary with a similar layout to this.
Is there a fairly simple way to create an ID for each inflected form that links back to the original word similar to below?
|---------------------|---------------------|---------------------|------------------|
| WordID | LinkID | Word | OrigWord |
|---------------------|---------------------|---------------------|------------------|
| 1 | null | aah | null |
|---------------------|---------------------|---------------------|------------------|
| 2 | 1 | aahs | aah |
|---------------------|---------------------|---------------------|------------------|
| 3 | 1 | aahed | aah |
|---------------------|---------------------|---------------------|------------------|
| 4 | 1 | aahing | aah |
|---------------------|---------------------|---------------------|------------------|
| 5 | null | aardvark | null |
|---------------------|---------------------|---------------------|------------------|
| 6 | 5 | aardvarks | aardvark |
|---------------------|---------------------|---------------------|------------------|
EDIT:
Added second example word to further explain LinkID functionality
This can be achieved by a self join:
SELECT t1.WordID,
t2.WordID as LinkID,
t1.Word,
t1.OrigWord
FROM Words t1 LEFT JOIN Words t2 ON t1.OrigWord = t2.Word

I want to exchange value of 2 fields in crystal report

I want to exchange value of 2 fields in crystal report
(if Columnd5 is null want to get the value of Columnd6 to Columnd5
I use this formula
if isnull ({DataTable1.Columnd5}) then
tonumber ({DataTable1.Columnd6})
else if isnull({DataTable1.Columnd6}) then
0.00
else
tonumber ({DataTable1.Columnd6})
but this one isn't working
I understand that a truth table of your formula result as follows:
| Column5 value | Column6 value | result |
| 5* | 6* | 6 |
| 5* | null | 0 |
| null | 6* | 6 |
| null | null | error |
*means a supposed value as an example, could be any numeric value
But i understand your desired result would be:
| Column5 value | Column6 value | result |
| 5* | 6* | 5 |
| 5* | null | 5 |
| null | 6* | 6 |
| null | null | 0 |
*means a supposed value as an example, could be any numeric value
Did i really understand the problem? If so, i would suggest the following formula:
if not isnull ({DataTable1.Columnd5}) then
{DataTable1.Columnd5}
else if not isnull({DataTable1.Columnd6}) then
{DataTable1.Columnd6}
else
0
Check if it would be necessary to call the ToNumber function as you need. I don't believe so. It depends on your schema.

Postgres values as columns

I am working with PostgreSQL 9.3, and I have this:
PARENT_TABLE
ID | NAME
1 | N_A
2 | N_B
3 | N_C
CHILD_TABLE
ID | PARENT_TABLE_ID | KEY | VALUE
1 | 1 | K_A | V_A
2 | 1 | K_B | V_B
3 | 1 | K_C | V_C
5 | 2 | K_A | V_D
6 | 2 | K_C | V_E
7 | 3 | K_A | V_F
8 | 3 | K_B | V_G
9 | 3 | K_C | V_H
Note that I might add K_D in KEY's, it's completely dynamic.
What I want is a query that returns me the following:
QUERY_TABLE
ID | NAME | K_A | K_B | K_C | others K_...
1 | N_A | V_A | V_B | V_C | ...
2 | N_B | V_D | | V_E | ...
3 | N_C | V_F | V_G | V_H | ...
Is this possible to do ? If so, how ?
Since there can be values missing, you need the "safe" form of crosstab() with the column names as second parameter:
SELECT * FROM crosstab(
'SELECT p.id, p.name, c.key, c."value"
FROM parent_table p
LEFT JOIN child_table c ON c.parent_table_id = p.id
ORDER BY 1'
,$$VALUES ('K_A'::text), ('K_B'), ('K_C')$$)
AS t (id int, name text, k_a text, k_b text, k_c text; -- use actual data types
Details in this related answer:
PostgreSQL Crosstab Query
About adding "extra" columns:
Pivot on Multiple Columns using Tablefunc