I want to transpose columns into rows (without using UNION):
|Dimension1 | Measure1 | Measure2 |
-----------------------------------
| 1 | x1 | y1 |
| 0 | x2 | y2 |
Into:
| Dimension1 | Measures | Values |
-----------------------------------
| 1 | Measure1 | x1 |
| 1 | Measure2 | y1 |
| 0 | Measure1 | x2 |
| 0 | Measure2 | y2 |
The number of the measure is fixed.
I'm using Amazon Redshift.
You need to use Union for that. Why don't you want to use it ? There is no other way.
Related
Here's a dataframe, df1, I have
+---------+-------+---------+
| C1 | C2 | C3 |
+---------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| by | 8 | srjs |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
| mi | 1 | pdlo |
| lp | 7 | ztpq |
+---------+-------+---------+
Here's another, df2, that I have
+----------+-------+---------+
| V1 | V2 | V3 |
+----------+-------+---------+
| Null | 6 | ixfg |
| Null | 2 | jsfd |
| Null | 2 | hfga |
| Null | 7 | qwks |
| Null | 1 | khfd |
| Null | 9 | gdbu |
+----------+-------+---------+
What I would like to have is another dataframe that
Ignores values in V2 and takes values in C2 whereever V3 and C3 match, and
Replaces V1 with values in C1 wherever V3 and C3 match.
The result should look like the following:
+----------+-------+---------+
| M1 | M2 | M3 |
+----------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
+----------+-------+---------+
You can join and use coalesce to take a value which has a higher priority.
** coalesce will take any number of columns (the highest priority to least in the order of arguments) and return first non-null value, so if you do want to replace with null when there is a null in the lower priority column, you cannot use this function.
df = (df1.join(df2, on=(df1.C3 == df2.V3))
.select(F.coalesce(df1.C1, df2.V1).alias('M1'),
F.coalesce(df2.V2, df1.C2).alias('M2'),
F.col('C3').alias('M3')))
I have 2 datasets that I can represent as:
The first dataframe is my raw data. It contains millions of row and around 6000 areas.
+--------+------+------+-----+-----+
| user | area | time | foo | bar |
+--------+------+------+-----+-----+
| Alice | A | 5 | ... | ... |
| Alice | B | 12 | ... | ... |
| Bob | A | 2 | ... | ... |
| Charly | C | 8 | ... | ... |
+--------+------+------+-----+-----+
This second dataframe is a mapping table. It has around 200 areas (not 5000) for 150 places. Each area can have 1-N places (and a place can have 1-N areas too). It can be represented unpivoted this way:
+------+--------+-------+
| area | place | value |
+------+--------+-------+
| A | placeZ | 0.1 |
| B | placeB | 0.6 |
| B | placeC | 0.4 |
| C | placeA | 0.1 |
| C | placeB | 0.04 |
| D | placeA | 0.4 |
| D | placeC | 0.6 |
| ... | ... | ... |
+------+--------+-------+
or pivoted
+------+--------+--------+--------+-----+
| area | placeA | placeB | placeC | ... |
+------+--------+--------+--------+-----+
| A | 0 | 0 | 0 | ... |
| B | 0 | 0.6 | 0.4 | ... |
| C | 0.1 | 0.04 | 0 | ... |
| D | 0.4 | 0 | 0.6 | ... |
+------+--------+--------+--------+-----+
I would like to create a kind of product-join to have something like:
+--------+--------+--------+--------+-----+--------+
| user | placeA | placeB | placeC | ... | placeZ |
+--------+--------+--------+--------+-----+--------+
| Alice | 0 | 7.2 | 4.8 | 0 | 0.5 | <- 7.2 & 4.8 comes from area B and 0.5 from area A
| Bob | 0 | 0 | 0 | 0 | 0.2 |
| Charly | 0.8 | 0.32 | 0 | 0 | 0 |
+--------+--------+--------+--------+-----+--------+
I see 2 options so far:
Perform a left join between the main table and the pivoted one
Multiply each column by the time (around 150 columns)
Groupby user with a sum
Perform a outer join between the main table and the unpivoted one
Multiply the time by value
Pivot place
Groupby user with a sum
I don't like the first option because of the number of multiplications involved (the mapping dataframe is quite sparse).
I prefer the second option but I see two problems :
If someday, the dataset does not have a place represented, the column will not exist and the dataset will have a different shape (hence failing).
Some other features like foo, bar will be duplicated with the outer-join and I'll have to handle it on case by case at the grouping stage (sum or average).
I would like to know if there is something more ready-to-use for this kind of product-join in spark ? I have seen the OneHotEncoder but it only provides only a "1" on each column (so it is even worse than the first solution).
Thanks in advance,
Nicolas
I have a table like the following one:
+---------+-------+-------+-------------+--+
| Section | Group | Level | Fulfillment | |
+---------+-------+-------+-------------+--+
| A | Y | 1 | 82.2 | |
| A | Y | 2 | 23.2 | |
| A | M | 1 | 81.1 | |
| A | M | 2 | 28.2 | |
| B | Y | 1 | 89.1 | |
| B | Y | 2 | 58.2 | |
| B | M | 1 | 32.5 | |
| B | M | 2 | 21.4 | |
+---------+-------+-------+-------------+--+
And this would be my desired output:
+---------+-------+--------------------+--------------------+
| Section | Group | Level1_Fulfillment | Level2_Fulfillment |
+---------+-------+--------------------+--------------------+
| A | Y | 82.2 | 23.2 |
| A | M | 81.1 | 28.2 |
| B | Y | 89.1 | 58.2 |
| B | M | 32.5 | 21.4 |
+---------+-------+--------------------+--------------------+
Thus, for each section and group I'd like to obtain their percents of fulfillment for level 1 and level 2. To achieve this, I've tried crosstab(), but using this function returns me an error ("The provided SQL must return 3 columns: rowid, category, and values.") because I'm using more than three columns (I need to maintain section and group as identifiers for each row). Is possible to use crosstab in this case?
Regards.
I find crosstab() unnecessary complicated to use and prefer conditional aggregation:
select section,
"group",
max(fulfillment) filter (where level = 1) as level_1,
max(fulfillment) filter (where level = 2) as level_2
from the_table
group by section, "group"
order by section;
Online example
I have 2 tables as a result of query as following :
select customer,date,product,orderId,version,size from tableA where date=2020.04.08,product in (`Derivative)
+----------+----------+------------+---------+---------+------+
| customer | date | product | orderId | version | size |
+----------+----------+------------+---------+---------+------+
| XYZ fund | 4/8/2020 | Derivative | 1 | 6 | |
| XYZ fund | 4/8/2020 | Derivative | 2 | 6 | 1000 |
| XYZ fund | 4/8/2020 | Derivative | 3 | 4 | |
+----------+----------+------------+---------+---------+------+
select sum size by date,product,parent_orderId,parent_version from tableB where date=2020.04.08,product in (`Derivative)
+----------+------------+----------------+----------------+------+
| date | product | parent_orderId | parent_version | size |
+----------+------------+----------------+----------------+------+
| 4/8/2020 | Derivative | 1 | 1 | 10 |
| 4/8/2020 | Derivative | 1 | 2 | 10 |
| 4/8/2020 | Derivative | 1 | 3 | 10 |
| 4/8/2020 | Derivative | 1 | 4 | 10 |
| 4/8/2020 | Derivative | 1 | 5 | 10 |
| 4/8/2020 | Derivative | 1 | 6 | 10 |
| 4/8/2020 | Derivative | 3 | 1 | 20 |
| 4/8/2020 | Derivative | 3 | 2 | 20 |
| 4/8/2020 | Derivative | 3 | 3 | 20 |
| 4/8/2020 | Derivative | 3 | 4 | 20 |
+----------+------------+----------------+----------------+------+
So basically I want that if the Result 1 has missing size then it should be populated from Result 2 based on matching columns i.e date=date,product=product,orderId=parent_orderId,version=parent_version. Is there any way to do it using query in KBD?
Following is expected o/p :
+----------+----------+------------+---------+---------+------+
| customer | date | product | orderId | version | size |
+----------+----------+------------+---------+---------+------+
| XYZ fund | 4/8/2020 | Derivative | 1 | 6 | 10 |
| XYZ fund | 4/8/2020 | Derivative | 2 | 6 | 1000 |
| XYZ fund | 4/8/2020 | Derivative | 3 | 4 | 20 |
+----------+----------+------------+---------+---------+------+
You can use the left join operator to achieve this:
q)res1:select customer,date,product,orderId,version,size from tableA where date=2020.04.08,product in (`Derivative);
q)res2:select sum size by date,product,orderId:parent_orderId,version:parent_version from tableB where date=2020.04.08,product in (`Derivative);
q)res1 lj res2
customer date product orderId version size
-------------------------------------------------
XYZ fund 4/8/2020 Derivative 1 6 10
XYZ fund 4/8/2020 Derivative 2 6 1000
XYZ fund 4/8/2020 Derivative 3 4 20
Note that we had to ensure that the column names in the second tables matched those we wanted to join on in the first table.
I have this table in my database:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| NULL | C |
| 3 | D |
| NULL | D |
| NULL | E |
| 4 | F |
---------------
And I want to transform this table into a table that replace nulls by consecutive negative ids:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| -1 | C |
| 3 | D |
| -2 | D |
| -3 | E |
| 4 | F |
---------------
Anyone knows how can I do this in hive?
Below approach works
select coalesce(id,concat('-',ROW_NUMBER() OVER (partition by id))) as id,desc from database_name.table_name;