How to add new columns based on row data in spark

How to add new columns based on row data in spark - scala

I have a data frame like below.
-------------------------------------------------------------
Col_A | Col_B | Col_C
-------------------------------------------------------------
ABC1 | XYZ1 | {json data}
-------------------------------------------------------------
ABC2 | XYZ2 | {json data}
-------------------------------------------------------------
ABC3 | XYZ3 | {json data}
-------------------------------------------------------------
I need to transform Col_C josn data for each row and add this transformed data column and value again to row and output will be like below.
------------------------------------------------------------------------------------------
Col_A | Col_B | Col_C | New_Col_1 |..... |New_Col_N
------------------------------------------------------------------------------------------
ABC1 | XYZ1 | {json data} | extracted value |..... |extracted value
------------------------------------------------------------------------------------------
ABC2 | XYZ2 | {json data} | extracted value |..... |extracted value
------------------------------------------------------------------------------------------
ABC3 | XYZ3 | {json data} | extracted value |..... |extracted value
------------------------------------------------------------------------------------------
I have the transformation logic which gives me key value pair of Col_C json data but how I can add this new columns to row again. As column names are dynamic(Not a fixed schema) i can't use withColumn function to add new columns.
One thing which i can do is add unique column and share it with actual data frame and transformed data frame and later do join on that unique column.
Is there any other way to achieve this.

Since the schema is not known, we can take one json and extract it's schema.
Essentially this would work:
Consider this input
df = spark.createDataFrame([["ABC1","XYZ1", "{'name':'Sam', 'Age':26}"],["ABC2","XYZ2", "{'name':'Raj', 'Age':26}"]]).toDF("Col_A", "Col_B", "Col_C");
df.show(truncate=False)
This should do the job:
schema = spark.read.json(sc.parallelize([df.select("Col_C").first()[('Col_C')]])).schema
df.withColumn("asJson", F.from_json("Col_C", schema)).select(F.col('Col_A'), F.col('Col_B'), F.col('asJson.*')).show()
Output:

Related

Spark/Scala:Finding count of delimited values in a column eliminating duplicates

I've a column like
+-----------------+----------------------------+
|Race_Track | EngineType |
+----------------------------------------------+
|800-RDUO | 881,652,EWQ,300x,652,PXZ |
+----------------------------------------------+
i should remove one specific value say EWQ and all duplicates like below
+-----------------+----------------------------+
|Race_Track | EngineType |
+----------------------------------------------+
|800-RDUO | 881,300x,652,PXZ |
+----------------------------------------------+
How to achieve this in Scala?

You can achieve your desired output by combining split, concat_ws and array_distinct as below (assuming data is your dataset):
data = data
.withColumn("EngineType", array_distinct(
filter(split(col("EngineType"), ","), x => x.notEqual("EWQ")))
)
.withColumn("EngineType", concat_ws(",", col("EngineType")))
Final output:
+----------+----------------+
|Race_Track|EngineType |
+----------+----------------+
|800-RDUO |881,652,300x,PXZ|
+----------+----------------+
Good luck!

Unpack dictionary in column in SQL table [duplicate]

In a Postgres 9.3 database I have a table in which one column contains JSON, as in the test table shown in the example below.
test=# create table things (id serial PRIMARY KEY, details json, other_field text);
CREATE TABLE
test=# \d things
Table "public.things"
Column | Type | Modifiers
-------------+---------+-----------------------------------------------------
id | integer | not null default nextval('things_id_seq'::regclass)
details | json |
other_field | text |
Indexes:
"things_pkey" PRIMARY KEY, btree (id)
test=# insert into things (details, other_field)
values ('[{"json1": 123, "json2": 456},{"json1": 124, "json2": 457}]', 'nonsense');
INSERT 0 1
test=# insert into things (details, other_field)
values ('[{"json1": 234, "json2": 567}]', 'piffle');
INSERT 0 1
test=# select * from things;
id | details | other_field
----+-------------------------------------------------------------+-------------
1 | [{"json1": 123, "json2": 456},{"json1": 124, "json2": 457}] | nonsense
2 | [{"json1": 234, "json2": 567}] | piffle
(2 rows)
The JSON is always an array containing a variable number of hashes. Each hash always has the same set of keys. I am trying to write a query which returns a row for each entry in the JSON array, with columns for each hash key and the id from the things table. I'm hoping for output like the following:
thing_id | json1 | json2
----------+-------+-------
1 | 123 | 456
1 | 124 | 457
2 | 234 | 567
i.e. two rows for entries with two items in the JSON array. Is it possible to get Postgres to do this?
json_populate_recordset feels like an essential part of the answer, but I can't get it to work with more than one row at once.

select id,
(details ->> 'json1')::int as json1,
(details ->> 'json2')::int as json2
from (
select id, json_array_elements(details) as details
from things
) s
;
id | json1 | json2
----+-------+-------
1 | 123 | 456
1 | 124 | 457
2 | 234 | 567

How to order rows with linked parts in PostgreSQL

I have a table A with columns: id, title, condition
And i have another table B with information about position for some rows from table A. Table B have columns id, next_id, prev_id
How to sort rows from A based on information from table B?
For example,
Table A
id| title
---+-----
1 | title1
2 | title2
3 | title3
4 | title4
5 | title5
Table B
id| next_id | prev_id
---+-----
2 | 1 | null
5 | 4 | 3
I want to get this result:
id| title
---+-----
2 | title2
1 | title1
3 | title3
5 | title5
4 | title4
And after apply this sort, i want to sort by condition column yet.
I've already spent a lot of time looking for a solution, and hope for your help.

You have to add weights to your data, so you can order accordingly. This example uses next_id, not sure if you need to use prev_id, you don't explain the use of it.
Anyway, here's a code example:
-- Temporal Data for the test:
CREATE TEMP TABLE table_a(id integer,tittle text);
CREATE TEMP TABLE table_b(id integer,next_id integer, prev_id integer);
INSERT INTO table_a VALUES
(1,'title1'),
(2,'title2'),
(3,'title3'),
(4,'title4'),
(5,'title5');
INSERT INTO table_b VALUES
(2,1,null),
(5,4,3);
-- QUERY:
SELECT
id,tittle,
CASE -- Adding weight
WHEN next_id IS NULL THEN (id + 0.1)
ELSE next_id
END AS orden
FROM -- Joining tables
(SELECT ta.*,tb.next_id
FROM table_a ta
LEFT JOIN table_b tb
ON ta.id=tb.id)join_a_b
ORDER BY orden
And here's the result:
id | tittle | orden
--------------------------
2 | title2 | 1
1 | title1 | 1.1
3 | title3 | 3.1
5 | title5 | 4
4 | title4 | 4.1

Filter out null strings and empty strings in hivecontext.sql

I'm using pyspark and hivecontext.sql and I want to filter out all null and empty values from my data.
So I used simple sql commands to first filter out the null values, but it doesen't work.
My code:
hiveContext.sql("select column1 from table where column2 is not null")
but it work without the expression "where column2 is not null"
Error:
Py4JavaError: An error occurred while calling o577.showString
I think it was due to my select is wrong.
Data example:
column 1 | column 2
null | 1
null | 2
1 | 3
2 | 4
null | 2
3 | 8
Objective:
column 1 | column 2
1 | 3
2 | 4
3 | 8
Tks

We cannot pass the Hive table name directly to Hive context sql method since it doesn't understand the Hive table name. One of the way to read Hive table is using the pysaprk shell.
We need to register the data frame we get from reading the hive table. Then we can run the SQL query.

You have to give database_name.table and run the same query it will work. Please let me know if that helps

It work for me:
df.na.drop(subset=["column1"])

Have you entered null values manually?
If yes then it will treat those as normal strings,
I tried following two use cases
dbname.person table in hive
name age
aaa null // this null is entered manually -case 1
Andy 30
Justin 19
okay NULL // this null came as this field was left blank. case 2
---------------------------------
hiveContext.sql("select * from dbname.person").show();
+------+----+
| name| age|
+------+----+
| aaa |null|
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
-----------------------------
case 2
hiveContext.sql("select * from dbname.person where age is not null").show();
+------+----+
| name|age |
+------+----+
| aaa |null|
| Andy| 30 |
|Justin| 19 |
+------+----+
------------------------------------
case 1
hiveContext.sql("select * from dbname.person where age!= 'null'").show();
+------+----+
| name| age|
+------+----+
| Andy| 30|
|Justin| 19|
| okay|null|
+------+----+
------------------------------------
I hope above use cases would clear your doubts about filtering null values
out.
and if you are querying a table registered in spark then use sqlContext.

How to compare multiple rows?

I'd like to compare two consecutive rows i with i-1 of col2 (sorted by col1).
If item_i of the i-th row and the item_[i-1]_row are different, I'd like to increment the count of item_[i-1] by 1.
+--------------+
| col1 col2 |
+--------------+
| row_1 item_1 |
| row_2 item_1 |
| row_3 item_2 |
| row_4 item_1 |
| row_5 item_2 |
| row_6 item_1 |
+--------------+
In the above example, if we scan two rows at a time downwards, we see that row_2 and row_3 are different therefore we add one to item_1. Next, we see that row_3 is different from row_4, then add one to item_2. Continue until we end up with:
+-------------+
| col2 col3 |
+-------------+
| item_1 2 |
| item_2 2 |
+-------------+

You can use a combination of a window function and an aggregate to do this. The window function is used to get the next value of col2 (using col1 for ordering). The aggregate then counts the times we encounter a differences. This is implemented in the code below:
val data = Seq(
("row_1", "item_1"),
("row_2", "item_1"),
("row_3", "item_2"),
("row_4", "item_1"),
("row_5", "item_2"),
("row_6", "item_1")).toDF("col1", "col2")
import org.apache.spark.sql.expressions.Window
val q = data.
withColumn("col2_next",
coalesce(lead($"col2", 1) over Window.orderBy($"col1"), $"col2")).
groupBy($"col2").
agg(sum($"col2" =!= $"col2_next" cast "int") as "col3")
scala> q.show
17/08/22 10:15:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+----+
| col2|col3|
+------+----+
|item_1| 2|
|item_2| 2|
+------+----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to add new columns based on row data in spark - scala

Related

Spark/Scala:Finding count of delimited values in a column eliminating duplicates

Unpack dictionary in column in SQL table [duplicate]

How to order rows with linked parts in PostgreSQL

Filter out null strings and empty strings in hivecontext.sql

How to compare multiple rows?

Categories

Resources