How I can reverse the content of a table? - tsql

I need a general procedure that can reverse the content of a table.
Basically I have a ragged table (a dimension table for an OLAP DB) built from top level and I need to convert into from a bottom level perspective.
How can do it?
+-----+------+------+------+------+
| Co1 | Col2 | Col3 | Col4 | Col5 |
+-----+------+------+------+------+
| 1 | 9765 | 1234 | A | |
| 2 | 9765 | 1235 | A | |
| 3 | 9765 | 1235 | | |
| 4 | 9764 | 4567 | 789 | A1 |
| 5 | 9764 | | | |
| 6 | 9764 | 4568 | 3453 | A2 |
+-----+------+------+------+------+
+------+------+------+------+------+
| Co1 | Col2 | Col3 | Col4 | Col5 |
+------+------+------+------+------+
| A | 1234 | 9765 | 1 | |
| A | 1235 | 9765 | 2 | |
| 1235 | 9765 | 3 | | |
| A1 | 789 | 4567 | 9764 | 4 |
| 9764 | 5 | | | |
| A2 | 3453 | 4568 | 9764 | 6 |
+------+------+------+------+------+

if I properly understand the example tables you have provided, you can just insert them in reverse order if you are creating a new table with the data, or you can simply give the fields new names in query results by providing an alias using the AS keyword.
Both shown together to be more concise and because they won't conflict with one another. If you only need to select the data with different column names, then you can remove the INSERT clause. If you are creating a new, table you may omit the alias.
INSERT INTO table2
( Col5, Col4, Col3, Col2, Col1 )
SELECT
Col1 as Col5
, Col2 as Col4
, Col3
, Col4 as Col2
, Col5 as Col1
FROM table1

Related

how to use update query in psql

I have a table.(table name: test_table)
+------+
| Col1 |
+------+
| a1 |
| b1 |
| b1 |
| c1 |
| c1 |
| c1 |
+------+
I wanted to delete duplicate rows except one row in duplicates, like this
+------+
| Col1 |
+------+
| a1 |
| b1 |
| c1 |
+------+
So, I thought
make row number
delete duplicates by row number
and failed to make row number with this query
ALTER TABLE test_table ADD COLUMN row_num INTEGER;
UPDATE test_table SET row_num = subquery.row_num
FROM (SELECT ROW_NUMBER() OVER () AS row_num
FROM test_table) AS subquery;
the result is below
+------+---------+
| Col1 | row_num |
+------+---------+
| a1 | 1 |
| b1 | 1 |
| b1 | 1 |
| c1 | 1 |
| c1 | 1 |
| c1 | 1 |
+------+---------+
what part need to change for getting like this?
+------+---------+
| Col1 | row_num |
+------+---------+
| a1 | 1 |
| b1 | 2 |
| b1 | 3 |
| c1 | 4 |
| c1 | 5 |
| c1 | 6 |
+------+---------+

How to combine pyspark dataframes with different shapes and different columns

I have two dataframes in Pyspark. One has more than 1000 rows and the other only 4 rows. The columns also are not matching.
df1 with more than 1000 rows:
+----+--------+--------------+-------------+
| ID | col1 | col2 | col 3 |
+----+--------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 |
| 2 | time 2 | value2_col2 | value2_col3 |
+----+--------+--------------+-------------+
...
df2 with only 4 rows:
+-----+--------------+--------------+
| key | col_c | col_d |
+-----+--------------+--------------+
| a | valuea_colc | valuea_cold |
| b | valueb_colc | valueb_cold |
+-----+--------------+--------------+
I want to create a dataframe looking like this:
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| ID | col1 | col2 | col 3 | a_col_c | a_col_d | b_col_c | b_col_d |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
| 1 | time1 | value_col2 | value_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
| 2 | time 2 | value2_col2 | value2_col3 | valuea_colc | valuea_cold | valueb_colc | valueb_cold |
+----+--------+-------------+-------------+--------------+---------------+--------------+-------------+
Can you please help with this? I prefer not to use Pandas.
Thank you!
I actually figured this out using crossJoin.
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html explains how to use crossJoin with Pyspark DataFrames.

Get the row corresponding to the latest timestamp in a Spark Dataset using Scala

I am relatively new to Spark and Scala. I have a dataframe which has the following format:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | afsdf | ewqre | 1970-01-01 00:00:00.0 | false |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 1234 | AAAA | 1111 | dafsd | afwew | 2015-01-17 07:09:32.748 | false |
| 5678 | BBBB | 2222 | afsdf | qwerq | 1970-01-01 00:00:00.0 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
What I need to do is to get the row that corresponds to the latest timestamp.
In the example above, the keys are Col1, Col2 and Col3. Col_TS represents the timestamp and Col_7 is a boolean that determines the validity of the record.
What I want to do is to find a way to group these records based on the keys and retain the one that has the latest timestamp.
So the output of the operation in the dataframe above should be:
| Col1 | Col2 | Col3 | Col_4 | Col_5 | Col_TS | Col_7 |
| 1234 | AAAA | 1111 | ewqrw | dafda | 2017-01-17 07:09:32.748 | true |
| 5678 | BBBB | 2222 | bafva | qweqe | 2016-12-08 07:58:43.04 | false |
| 9101 | CCCC | 3333 | caxad | fsdaa | 1970-01-01 00:00:00.0 | false |
I came up with a partial solution but this way I can only return the dataframe of the Column keys on which the records are grouped and not the other columns.
df = df.groupBy("Col1","Col2","Col3").agg(max("Col_TS"))
| Col1 | Col2 | Col3 | max(Col_TS) |
| 1234 | AAAA | 1111 | 2017-01-17 07:09:32.748 |
| 5678 | BBBB | 2222 | 2016-12-08 07:58:43.04 |
| 9101 | CCCC | 3333 | 1970-01-01 00:00:00.0 |
Can someone help me in coming up with a Scala code for performing this operation?
You can use window function as following
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("Col1","Col2","Col3").orderBy(col("Col_TS").desc)
df.withColumn("maxTS", first("Col_TS").over(windowSpec))
.select("*").where(col("maxTS") === col("Col_TS"))
.drop("maxTS")
.show(false)
You should get output as following
+----+----+----+-----+-----+----------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5|Col_TS |Col_7|
+----+----+----+-----+-----+----------------------+-----+
|5678|BBBB|2222|bafva|qweqe|2016-12-0807:58:43.04 |false|
|1234|AAAA|1111|ewqrw|dafda|2017-01-1707:09:32.748|true |
|9101|CCCC|3333|caxad|fsdaa|1970-01-0100:00:00.0 |false|
+----+----+----+-----+-----+----------------------+-----+
One option is firstly order the data frame by Col_TS, then group by Col1, Col2 and Col3 and take the last item from each other column:
val val_columns = Seq("Col_4", "Col_5", "Col_TS", "Col_7").map(x => last(col(x)).alias(x))
(df.orderBy("Col_TS")
.groupBy("Col1", "Col2", "Col3")
.agg(val_columns.head, val_columns.tail: _*).show)
+----+----+----+-----+-----+--------------------+-----+
|Col1|Col2|Col3|Col_4|Col_5| Col_TS|Col_7|
+----+----+----+-----+-----+--------------------+-----+
|1234|AAAA|1111|ewqrw|dafda|2017-01-17 07:09:...| true|
|9101|CCCC|3333|caxad|fsdaa|1970-01-01 00:00:...|false|
|5678|BBBB|2222|bafva|qweqe|2016-12-08 07:58:...|false|
+----+----+----+-----+-----+--------------------+-----+

Assign a constant integer to existing newly inserted values

I'm looking for an intelligent sequence generator that will be able to assign a constant int value to a column if a string column values exists in the table already. The scenario is as below
_____________________________
| Col1 | Col2 | Col 3 |
|---------------------------|
| a | a | 1 |
| b | a | 1 |
| c | a | 1 |
| u | b | 2 |
| v | b | 2 |
| w | b | 2 |
-----------------------------
Let's say I insert another value which is ('d','a') to Col1 & Col2 respectively, I want Col3 to become '1' automatically as the Col3 value corresponding to 'a' already exists as '1' and become as seen below
_____________________________
| Col1 | Col2 | Col 3 |
|---------------------------|
| a | a | 1 |
| b | a | 1 |
| c | a | 1 |
| u | b | 2 |
| v | b | 2 |
| w | b | 2 |
| d | a | 1 |
------------------------------
Is there a way I can define it in 'Create Table' so that the Col3 value update happens upon value insertion into Col1, Col2?
Edit :
The scenario is something like this
______________________________________________________________
| Col1 | Col2 | Col 3 |
|------------------------------------------------------------|
| Adobe | Adobe | 1 |
| Adobe Systems | Adobe | 1 |
| Adobe Systems Inc | Adobe | 1 |
| Honeywell | Honeywell | 2 |
| Honeywell Inc | Honeywell | 2 |
| Honeywell Inc. | Honeywell | 2 |
--------------------------------------------------------------
And when I add new data, I would like it to be
______________________________________________________________
| Col1 | Col2 | Col 3 |
|------------------------------------------------------------|
| Adobe | Adobe | 1 |
| Adobe Systems | Adobe | 1 |
| Adobe Systems Inc | Adobe | 1 |
| Honeywell | Honeywell | 2 |
| Honeywell Inc | Honeywell | 2 |
| Honeywell Inc. | Honeywell | 2 |
| Adobe Systems Incorporated | Adobe | 1 |
--------------------------------------------------------------
The Col3 value has to be an integer for faster joins with other tables. I will insert values for Col1 & Col2, and the corresponding value should be available in Col3 upon insert.
Just normalize it:
create table corporation (
corporation_id serial,
short_name text
);
insert into corporation (short_name) values
('Adobe'),('Honeywell');
select * from corporation;
corporation_id | short_name
----------------+------------
1 | Adobe
2 | Honeywell
Now you table is:
| Adobe | 1
| Adobe Systems | 1
| Adobe Systems Inc | 1
| Honeywell | 2
| Honeywell Inc | 2
| Honeywell Inc. | 2
| Adobe Systems Incorporated | 1
The Col3 value has to be an integer for faster joins with other
tables.
So you need unique, but not necessarily sequential, values. You can use a hash function to map strings into unique integers, e. g.:
postgres=# select hashtext('Adobe');
hashtext
-----------
173079840
(1 row)
postgres=# select hashtext('Honeywell');
hashtext
------------
-453308048
(1 row)
With such a function, you avoid the need to lookup existing values in the table.
But are you really sure you will have performance problems using strings for foreign keys? You should test it on your data, I think. (Also test on upcoming 9.5, with its abbreviated keys feature.)

Merge multiple tables with a common column name

I am trying to merge multiple tables that have a common column name which need not have the same values across the tables. For ex,
-tmp1-
id dat
1 234
2 432
3 412
-tmp2-
id nom
1 jim
2
3 ryan
4 jack
-tmp3-
id pin
1 gi23
2 x4ed
3 yit42
8 hiu11
If above are the input, the output needs to be,
id dat nom pin
1 234 jim gi23
2 432 x4ed
3 412 ryan yit42
4 jack
8 hiu11
Thanks in advance.
postgresql 8.2.15 on greenplum from R(pass-through queries)
use FULL JOIN ... USING (id) syntax.
please see example: http://sqlfiddle.com/#!12/3aff2/1
this is how diffrent join types work (provided that tab1.row3 meets joining condition with tab2.row1, and tab1.row3 meets tab2.row2):
| tab1 | | tab2 | | JOIN | | LEFT JOIN | | RIGHT JOIN | | FULL JOIN |
-------- -------- ------------------------- ------------------------- ------------------------- -------------------------
| row1 | | tab1.row1 | | tab1.row1 |
| row2 | | tab1.row2 | | tab1.row2 |
| row3 | | row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 |
| row4 | | row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 |
| row3 | | tab2.row3 | | tab2.row3 |
| row4 | | tab2.row4 | | tab2.row4 |