how to use update query in psql - postgresql

I have a table.(table name: test_table)
+------+
| Col1 |
+------+
| a1 |
| b1 |
| b1 |
| c1 |
| c1 |
| c1 |
+------+
I wanted to delete duplicate rows except one row in duplicates, like this
+------+
| Col1 |
+------+
| a1 |
| b1 |
| c1 |
+------+
So, I thought
make row number
delete duplicates by row number
and failed to make row number with this query
ALTER TABLE test_table ADD COLUMN row_num INTEGER;
UPDATE test_table SET row_num = subquery.row_num
FROM (SELECT ROW_NUMBER() OVER () AS row_num
FROM test_table) AS subquery;
the result is below
+------+---------+
| Col1 | row_num |
+------+---------+
| a1 | 1 |
| b1 | 1 |
| b1 | 1 |
| c1 | 1 |
| c1 | 1 |
| c1 | 1 |
+------+---------+
what part need to change for getting like this?
+------+---------+
| Col1 | row_num |
+------+---------+
| a1 | 1 |
| b1 | 2 |
| b1 | 3 |
| c1 | 4 |
| c1 | 5 |
| c1 | 6 |
+------+---------+

Related

How to replace null values in a dataframe based on values in other dataframe?

Here's a dataframe, df1, I have
+---------+-------+---------+
| C1 | C2 | C3 |
+---------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| by | 8 | srjs |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
| mi | 1 | pdlo |
| lp | 7 | ztpq |
+---------+-------+---------+
Here's another, df2, that I have
+----------+-------+---------+
| V1 | V2 | V3 |
+----------+-------+---------+
| Null | 6 | ixfg |
| Null | 2 | jsfd |
| Null | 2 | hfga |
| Null | 7 | qwks |
| Null | 1 | khfd |
| Null | 9 | gdbu |
+----------+-------+---------+
What I would like to have is another dataframe that
Ignores values in V2 and takes values in C2 whereever V3 and C3 match, and
Replaces V1 with values in C1 wherever V3 and C3 match.
The result should look like the following:
+----------+-------+---------+
| M1 | M2 | M3 |
+----------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
+----------+-------+---------+
You can join and use coalesce to take a value which has a higher priority.
** coalesce will take any number of columns (the highest priority to least in the order of arguments) and return first non-null value, so if you do want to replace with null when there is a null in the lower priority column, you cannot use this function.
df = (df1.join(df2, on=(df1.C3 == df2.V3))
.select(F.coalesce(df1.C1, df2.V1).alias('M1'),
F.coalesce(df2.V2, df1.C2).alias('M2'),
F.col('C3').alias('M3')))

Pyspark left joins dataframes using incorrect joining key values

I have 2 spark dataframes.
df1 with columns customerid, salary
df2 with column customerid2, education
Example:
df1
| customerid | salary |
|------------|--------|
| C1 | 120 |
| C2 | 90 |
| C3 | 90 |
| C4 | 100 |
df2
| customerid2 | education |
|-------------|-----------|
| C1 | BA |
| C2 | BS |
| C5 | PhD |
| C4 | BS Physics|
I want a new data frame names df_new1 that joins above 2 data frames using following code.
I want to left join df1 with df2 using joining key customerid and customerid2.
df_new = df1.join(df2, on=x[df1.customerid==df2.customerid2],how='left')
Expected Output:
df_new
| customerid | salary | customerid2 | education |
|------------|--------|-------------|-----------|
| C1 | 120 | C1 | BA |
| C2 | 90 | C2 | BS |
| C3 | 90 | NULL | NULL |
| C4 | 100 | C4 | BS Physics|
Current Output:
df_new
| customerid | salary | customerid2 | education |
|------------|--------|-------------|-----------|
| C1 | 120 | C1 | BA |
| C2 | 90 | C5 | PhD | <-- Issue in this line
| C3 | 90 | NULL | NULL |
| C4 | 100 | C4 | BS Physics|
Issue is, when I perform a join for some of the records in spark dataframe, it's joining the 2 tables even though the customer ID values are different.
Appreciate a response from this great community on this very rare issue.
Taking your data as example it is generating expected output as you posted
>>> columns2 = ["customerid2","education"]
>>> data2=[("c1","BA"),("c2","BS"),("c5","phD"),("c4","BS Physics")]
>>> rdd2=sc.parallelize(data2)
>>> df2=rdd2.toDF(columns2)
>>> columns = ["customerid","salary"]
>>> data=[("c1","120"),("c2","90"),("c3","90"),("c4","100")]
>>> rdd=sc.parallelize(data)
>>> df1=rdd.toDF(columns)
>>> df_new = df1.join(df2,df1.customerid == df2.customerid2,"leftouter")
>>> df_new.show()
+----------+------+-----------+----------+
|customerid|salary|customerid2| education|
+----------+------+-----------+----------+
| c1| 120| c1| BA|
| c4| 100| c4|BS Physics|
| c3| 90| null| null|
| c2| 90| c2| BS|
+----------+------+-----------+----------+
can you check whether any of the Data does not contains leading and trailing spaces.

Interpretation of rows in Phoenix SYSTEM.CATALOG

When I create a Phoenix table there are two extra rows in SYSTEM.CATALOG. These are the first and the second rows in the output of SELECT * FROM SYSTEM.CATALOG ...... below. Can someone please help me understand what these two rows signify?
The third and fourth rows in the output of SELECT * FROM SYSTEM.CATALOG ...... below are easily relatable to the CREATE TABLE statement. Therefore, they look fine.
0: jdbc:phoenix:t40aw2.gaq> CREATE TABLE C5 (company_id INTEGER PRIMARY KEY, name VARCHAR(225));
No rows affected (4.618 seconds)
0: jdbc:phoenix:t40aw2.gaq> select * from C5;
+-------------+-------+
| COMPANY_ID | NAME |
+-------------+-------+
+-------------+-------+
No rows selected (0.085 seconds)
0: jdbc:phoenix:t40aw2.gaq> SELECT * FROM SYSTEM.CATALOG WHERE TABLE_NAME='C5';
+------------+--------------+-------------+--------------+----------------+----------------+-------------+----------+---------------+---------------+------------------+--------------+----+
| TENANT_ID | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME | COLUMN_FAMILY | TABLE_SEQ_NUM | TABLE_TYPE | PK_NAME | COLUMN_COUNT | SALT_BUCKETS | DATA_TABLE_NAME | INDEX_STATE | IM |
+------------+--------------+-------------+--------------+----------------+----------------+-------------+----------+---------------+---------------+------------------+--------------+----+
| | | C5 | | | 0 | u | | 2 | null | | | fa |
| | | C5 | | 0 | null | | | null | null | | | |
| | | C5 | COMPANY_ID | | null | | | null | null | | | |
| | | C5 | NAME | 0 | null | | | null | null | | | |
+------------+--------------+-------------+--------------+----------------+----------------+-------------+----------+---------------+---------------+------------------+--------------+----+
4 rows selected (0.557 seconds)
0: jdbc:phoenix:t40aw2.gaq>
The Phoenix version I am using is: 4.1.8.29
Kindly note that no other operations where done on the table other than the 3 listed above, namely, create table, select * from the table, and select * from system.catalog where TABLE_NAME=the concerned table name.

Replace null by negative id number in not consecutive rows in hive

I have this table in my database:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| NULL | C |
| 3 | D |
| NULL | D |
| NULL | E |
| 4 | F |
---------------
And I want to transform this table into a table that replace nulls by consecutive negative ids:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| -1 | C |
| 3 | D |
| -2 | D |
| -3 | E |
| 4 | F |
---------------
Anyone knows how can I do this in hive?
Below approach works
select coalesce(id,concat('-',ROW_NUMBER() OVER (partition by id))) as id,desc from database_name.table_name;

calculate sum of sal based on two tables

Two tables like
table1
+----+-------+-----+
| id | sname | sal |
+----+-------+-----+
| 1 | X | 100 |
| 2 | Y | 200 |
| 3 | Z | 400 |
+----+-------+-----+
Table2
+----+-------+-----+
| id | sname | sal |
+----+-------+-----+
| 1 | A | 500 |
| 2 | B | 200 |
| 3 | C | 400
| 4 A 100
+----+-------+-----+
Both the tables having relation ship id column
i need calculate the sum sal group by table1.sname at the same time those who are matched to table2
the output like
+-------+-------+---------------------
| Table1.sname | Table2.sname | sum |
+-------+-------+-----+ ----------------
| A | W | 600 |
| B | Y | 200 |
| B | F | 300 |
| C | Z | 400 |
+-------+-------+----------------------
select sum(sal),a.sname,b.sname
from table1 a,
(select id,sname from table2 group by sname,id) as b
where a.id=b.id
group by a.sname,b.sname;
but its not given proper o/p
your question is little ambiguous...but maybe you want this.
query
select Table11.id, Table1.sname, Table2.sname, (Table1.sal+Table2.sal) as Sum
from Table1, Table2
where Table1.id = Table2.id;
result
Table1.id | Table2.sname | Table2.sname | sum
-----------+--------------+--------------+-----
1 | a | d | 500
2 | b | e | 700
3 | c | f | 900