Spark Scala - concatenate column in rows in a certain order - scala

Okay, I have a table with column definitions and corresponding ordinal positions. I'm building a metadata-driven ETL framework using Spark (scala). I have a table that contains the following info:
tablename
columnname
datatype
ordinalposition
I have to build a CREATE TABLE statement from that data. Not a big deal, right? I tried what appears to be the standard answer:
var metadatadef = spark.sql("SELECT tablename, columnname, datatype, ordinalposition FROM metadata")
.withColumn("columndef", concat($"columnname", lit(" "), $"datatype"))
.sort($"tablename", $"ordinalposition")
.groupBy("tablename")
.agg(concat_ws(", ", collect_list($"columndef")).as("columndefs"))
But the sort() call seems to be ignored here. Or between collect_list() and concat_ws() it gets reshuffled. Given source data like this:
+-----------+--------------+---------------+-----------------+
| tablename | columnname | datatype | ordinalposition |
+ ----------+--------------+---------------+-----------------+
| table1 | IntColumn | int | 0 |
| table2 | StringColumn | string | 2 |
| table1 | StringColumn | string | 2 |
| table2 | IntColumn | int | 0 |
| table1 | DecColumn | decimal(15,2) | 1 |
| table2 | DecColumn | decimal(15,2) | 1 |
+-----------+--------------+---------------+-----------------+
I am trying to get output like this:
+-----------+----------------------------------------------------------------+
| tablename | columndefs |
+-----------+----------------------------------------------------------------+
| table1 | IntColumn int, DecColumn decimal(15,2), StringColumn string |
| table2 | IntColumn int, DecColumn decimal(15,2), StringColumn string |
+-----------+----------------------------------------------------------------+
Instead, I wind up with something like this:
+-----------+----------------------------------------------------------------+
| tablename | columndefs |
+-----------+----------------------------------------------------------------+
| table1 | IntColumn int, StringColumn string, DecColumn decimal(15,2) |
| table2 | StringColumn string, IntColumn int, DecColumn decimal(15,2) |
+-----------+----------------------------------------------------------------+
Do I need to build a UDF to ensure I get proper order? I need the output to end up in a dataframe for comparison purposes, not just building the CREATE TABLE statement.

You can create a struct column of (ordinalposition, columndef) and apply sort_array to sort the aggregated columndef in the wanted order during the groupBy transformation as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("table1", "IntColumn", "int", "0"),
("table2", "StringColumn", "string", "2"),
("table1", "StringColumn", "string", "2"),
("table2", "IntColumn", "int", "0"),
("table1", "DecColumn", "decimal(15,2)", "1"),
("table2", "DecColumn", "decimal(15,2)", "1")
).toDF("tablename", "columnname", "datatype", "ordinalposition")
df.
withColumn("columndef",
struct($"ordinalposition", concat($"columnname", lit(" "), $"datatype").as("cdef"))
).
groupBy("tablename").agg(sort_array(collect_list($"columndef")).as("sortedlist")).
withColumn("columndefs", concat_ws(", ", $"sortedlist.cdef")).
drop("sortedlist").
show(false)
// +---------+-----------------------------------------------------------+
// |tablename|columndefs |
// +---------+-----------------------------------------------------------+
// |table2 |IntColumn int, DecColumn decimal(15,2), StringColumn string|
// |table1 |IntColumn int, DecColumn decimal(15,2), StringColumn string|
// +---------+-----------------------------------------------------------+

Related

How to expand columns into individual timesteps in PostgreSQL

I have a table of columns that represent a time series. The datatypes are not important, but anything after timestep2 could potentially be NULL.
| id | timestep1 | timestep2 | timestep3 | timestep4 |
|----|-----------|-----------|-----------|-----------|
| a | foo1 | bar1 | baz1 | qux1 |
| b | foo2 | bar2 | baz2 | NULL |
I am attempting to retrieve a view of the data more suitable for modeling. My modeling use-case requires that I break each time series (row) into rows representing their individual "states" at each step. That is:
| id | timestep1 | timestep2 | timestep3 | timestep4 |
|----|-----------|-----------|-----------|-----------|
| a | foo1 | NULL | NULL | NULL |
| a | foo1 | bar1 | NULL | NULL |
| a | foo1 | bar1 | baz1 | NULL |
| a | foo1 | bar1 | baz1 | qux1 |
| b | foo2 | NULL | NULL | NULL |
| b | foo2 | bar2 | NULL | NULL |
| b | foo2 | bar2 | baz2 | NULL |
How can I accomplish this in PostgreSQL?
Use UNION.
select id, timestep1, timestep2, timestep3, timestep4
from my_table
union
select id, timestep1, timestep2, timestep3, null
from my_table
union
select id, timestep1, timestep2, null, null
from my_table
union
select id, timestep1, null, null, null
from my_table
order by
id,
timestep2 nulls first,
timestep3 nulls first,
timestep4 nulls first
There is a more compact solution, maybe more convenient when dealing with a greater number of timesteps:
select distinct
id,
timestep1,
case when i > 1 then timestep2 end as timestep2,
case when i > 2 then timestep3 end as timestep3,
case when i > 3 then timestep4 end as timestep4
from my_table
cross join generate_series(1, 4) as i
order by
id,
timestep2 nulls first,
timestep3 nulls first,
timestep4 nulls first
Test it in Db<>fiddle.

Scala Spark : Convert struct columns type to decimal type

I have csv stored in s3 location which has data like this
column1 | column2 |
--------+----------
| adsf | 2000.0 |
| fff | 232.34 |
I have a AWS Glue Job in Scala which reads this file into a dataframe
var srcDF= glueContext.getCatalogSource(database = '',
tableName = '',
redshiftTmpDir = "",
transformationContext = "").getDynamicFrame().toDF()
When I print the schema, it infers itself like this
srcDF.printSchema()
|-- column1 : string |
|-- column2 : struct (double, string) |
And the dataframe looks like
column1 | column2 |
--------+-------------
| adsf | [2000.0,] |
| fff | [232.34,] |
When I try to save the dataframe to csv it complains that
org.apache.spark.sql.AnalysisException CSV data source does not support struct<double:double,string:string> data type.
How do I convert dataframe so that only the columns of Struct type (if exist) to decimal type? Output like this
column1 | column2 |
--------+----------
| adsf | 2000.0 |
| fff | 232.34 |
Edit:
Thanks for the response. I have tried using following code
df.select($"column2._1".alias("column2")).show()
But got the same error for both
org.apache.spark.sql.AnalysisException No such struct field _1 in double, string;
Edit 2:
It seems the spark, the columns were flattened and renamed as "double,string"
So, this solution worked for me
df.select($"column2.double").show()
You can extract fields from struct using getItem. Code can be something like that:
import spark.implicits._
import org.apache.spark.sql.functions.{col, getItem}
val df = Seq(
("adsf", (2000.0,"")),
("fff", (232.34,""))
).toDF("A", "B")
df.show()
df.select(col("A"), col("B").getItem("_1").as("B")).show()
it will print:
before select:
+----+----------+
| A| B|
+----+----------+
|adsf|[2000.0, ]|
| fff|[232.34, ]|
+----+----------+
after select:
+----+------+
| A| B|
+----+------+
|adsf|2000.0|
| fff|232.34|
+----+------+
You can also use the dot notation column2._1 to get the struct field by name:
val df = Seq(
("adsf", (2000.0,"")),
("fff", (232.34,""))
).toDF("column1", "column2")
df.show
+-------+----------+
|column1| column2|
+-------+----------+
| adsf|[2000.0, ]|
| fff|[232.34, ]|
+-------+----------+
val df2 = df.select($"column1", $"column2._1".alias("column2"))
df2.show
+-------+-------+
|column1|column2|
+-------+-------+
| adsf| 2000.0|
| fff| 232.34|
+-------+-------+
df2.coalesce(1).write.option("header", "true").csv("output")
and your csv file will be in the output/ folder:
column1,column2
adsf,2000.0
fff,232.34

Join and combine tables to get common rows in a specific column together in Postgres

I have a couple of tables in Postgres database. I have joined and merges the tables. However, I would like to have common values in a specific column to appear together in the final table (In the end, I would like to perform groupby and maximum value calculation on the table).
The schema of the test tables looks like this:
Schema (PostgreSQL v11)
CREATE TABLE table1 (
id CHARACTER VARYING NOT NULL,
seq CHARACTER VARYING NOT NULL
);
INSERT INTO table1 (id, seq) VALUES
('UA502', 'abcdef'), ('UA503', 'ghijk'),('UA504', 'lmnop')
;
CREATE TABLE table2 (
id CHARACTER VARYING NOT NULL,
score FLOAT
);
INSERT INTO table2 (id, score) VALUES
('UA502', 2.2), ('UA503', 2.6),('UA504', 2.8)
;
CREATE TABLE table3 (
id CHARACTER VARYING NOT NULL,
seq CHARACTER VARYING NOT NULL
);
INSERT INTO table3 (id, seq) VALUES
('UA502', 'qrst'), ('UA503', 'uvwx'),('UA504', 'yzab')
;
CREATE TABLE table4 (
id CHARACTER VARYING NOT NULL,
score FLOAT
);
INSERT INTO table4 (id, score) VALUES
('UA502', 8.2), ('UA503', 8.6),('UA504', 8.8);
;
I performed join and union and oepration of the tables to get the desired columns.
Query #1
SELECT table1.id, table1.seq, table2.score
FROM table1 INNER JOIN table2 ON table1.id = table2.id
UNION
SELECT table3.id, table3.seq, table4.score
FROM table3 INNER JOIN table4 ON table3.id = table4.id
;
The output looks like this:
| id | seq | score |
| ----- | ------ | ----- |
| UA502 | qrst | 8.2 |
| UA502 | abcdef | 2.2 |
| UA504 | yzab | 8.8 |
| UA503 | uvwx | 8.6 |
| UA504 | lmnop | 2.8 |
| UA503 | ghijk | 2.6 |
However, the desired output should be:
| id | seq | score |
| ----- | ------ | ----- |
| UA502 | qrst | 8.2 |
| UA502 | abcdef | 2.2 |
| UA504 | yzab | 8.8 |
| UA504 | lmnop | 2.8 |
| UA503 | uvwx | 8.6 |
| UA503 | ghijk | 2.6 |
View on DB Fiddle
How should I modify my query to get the desired output?

Transpose rows to columns where transposed column changes based on another column

I want to transpose the rows to columns using Pivot function in Oracle and/or SQL Server using Pivot function. My use case is very similar to this Efficiently convert rows to columns in sql server
However, I am organizing data by specific data type (below StringValue and NumericValue is shown).
This is my example:
----------------------------------------------------------------------
| Id | Person_ID | ColumnName | StringValue | NumericValue |
----------------------------------------------------------------------
| 1 | 1 | FirstName | John | (null) |
| 2 | 1 | Amount | (null) | 100 |
| 3 | 1 | PostalCode | (null) | 112334 |
| 4 | 1 | LastName | Smith | (null) |
| 5 | 1 | AccountNumber | (null) | 123456 |
----------------------------------------------------------------------
This is my result:
---------------------------------------------------------------------
| FirstName |Amount| PostalCode | LastName | AccountNumber |
---------------------------------------------------------------------
| John | 100 | 112334 | Smith | 123456 |
---------------------------------------------------------------------
How can I build the SQL Query?
I have already tried using MAX(DECODE()) and CASE statement in Oracle. However the performance is very poor. Looking to see if Pivot function in Oracle and/or SQL server can do this faster. Or should I go to single column value?
Below code will satisfy your requirement
Create table #test
(id int,
person_id int,
ColumnName varchar(50),
StringValue varchar(50),
numericValue varchar(50)
)
insert into #test values (1,1,'FirstName','John',null)
insert into #test values (2,1,'Amount',null,'100')
insert into #test values (3,1,'PostalCode',null,'112334')
insert into #test values (4,1,'LastName','Smith',null)
insert into #test values (5,1,'AccountNumber',null,'123456')
--select * from #test
Declare #Para varchar(max)='',
#Para1 varchar(max)='',
#main varchar(max)=''
select #Para += ','+QUOTENAME(ColumnName)
from (select distinct ColumnName from #test) as P
set #Para1= stuff(#para ,1,1,'')
print #Para1
set #main ='select * from (
select coalesce(StringValue,numericValue) as Val,ColumnName from #test) as Main
pivot
(
min(val) for ColumnName in ('+#Para1+')
) as pvt'
Exec(#main)

Spark Scala: Joining 2 tables and extract the data with max date(please see description)

I want to join two tables A and B and pick the records having max date from table B for each value.
Consider the following tables:
Table A:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | a | 4/1/2018 |
| 3 | a | 8/1/2018 |
| 4 | c | 1/1/2018 |
| 5 | d | 1/1/2018 |
| 6 | e | 1/1/2018 |
+---+-----+----------+
Table B:
+---+-----+----------+
|Key|Value|sent_date |
+---+---- +----------+
| x | a | 2/1/2018 |
| y | a | 7/1/2018 |
| z | a | 11/1/2018|
| p | c | 5/1/2018 |
| q | d | 5/1/2018 |
| r | e | 5/1/2018 |
+---+-----+----------+
The aim is to bring in column id from Table A to Table B for each value in Table B.
For the same, table A and B needs to be joined together with column value and for each record in B, max(A.start_date) for each data in column Value in Table A is found with condition A.start_date < B.sent_date
Lets consider the value=a here.
In table A, we can see 3 records for Value=a with 3 different start_date.
So when joining Table B, for value=a with sent_date=2/1/2018, record with max(start_date) for start_date which are less than sent_date in Table B is taken(in this case 1/1/2018) and corresponding data in column A.id is pulled to Table B.
Similarly for record with value=a and sent_date = 11/1/2018 in Table B, id=3 from table A needs to be pulled to table B.
The result must be as follows:
+---+-----+----------+---+
|Key|Value|sent_date |id |
+---+---- +----------+---+
| x | a | 2/1/2018 | 1 |
| y | a | 7/1/2018 | 2 |
| z | a | 11/1/2018| 3 |
| p | c | 5/1/2018 | 4 |
| q | d | 5/1/2018 | 5 |
| r | e | 5/1/2018 | 6 |
+---+-----+----------+---+
I am using Spark 2.3.
I have joined the two tables(using Dataframe) and found the max(start_date) based on the condition.
But I am unable to figure out how to pull the records here.
Can anyone help me out here
Thanks in Advance!!
I just changed the date "11/1/2018" to "9/1/2018" as the string sorting gives incorrect results. When converted to date, the logic would still work. See below
scala> val df_a = Seq((1,"a","1/1/2018"),
| (2,"a","4/1/2018"),
| (3,"a","8/1/2018"),
| (4,"c","1/1/2018"),
| (5,"d","1/1/2018"),
| (6,"e","1/1/2018")).toDF("id","value","start_date")
df_a: org.apache.spark.sql.DataFrame = [id: int, value: string ... 1 more field]
scala> val df_b = Seq(("x","a","2/1/2018"),
| ("y","a","7/1/2018"),
| ("z","a","9/1/2018"),
| ("p","c","5/1/2018"),
| ("q","d","5/1/2018"),
| ("r","e","5/1/2018")).toDF("key","valueb","sent_date")
df_b: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 1 more field]
scala> val df_join = df_b.join(df_a,'valueb==='valuea,"inner")
df_join: org.apache.spark.sql.DataFrame = [key: string, valueb: string ... 4 more fields]
scala> df_join.filter('sent_date >= 'start_date).withColumn("rank", rank().over(Window.partitionBy('key,'valueb,'sent_date).orderBy('start_date.desc))).filter('rank===1).drop("valuea","start_date","rank").show()
+---+------+---------+---+
|key|valueb|sent_date| id|
+---+------+---------+---+
| q| d| 5/1/2018| 5|
| p| c| 5/1/2018| 4|
| r| e| 5/1/2018| 6|
| x| a| 2/1/2018| 1|
| y| a| 7/1/2018| 2|
| z| a| 9/1/2018| 3|
+---+------+---------+---+
scala>
UPDATE
Below is the udf to handle date strings with MM/dd/yyyy formats
scala> def dateConv(x:String):String=
| {
| val y = x.split("/").map(_.toInt).map("%02d".format(_))
| y(2)+"-"+y(0)+"-"+y(1)
| }
dateConv: (x: String)String
scala> val udfdateconv = udf( dateConv(_:String):String )
udfdateconv: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df_a_dt = df_a.withColumn("start_date",date_format(udfdateconv('start_date),"yyyy-MM-dd").cast("date"))
df_a_dt: org.apache.spark.sql.DataFrame = [id: int, valuea: string ... 1 more field]
scala> df_a_dt.printSchema
root
|-- id: integer (nullable = false)
|-- valuea: string (nullable = true)
|-- start_date: date (nullable = true)
scala> df_a_dt.show()
+---+------+----------+
| id|valuea|start_date|
+---+------+----------+
| 1| a|2018-01-01|
| 2| a|2018-04-01|
| 3| a|2018-08-01|
| 4| c|2018-01-01|
| 5| d|2018-01-01|
| 6| e|2018-01-01|
+---+------+----------+
scala>