I am using this postgresql code,
SELECT id as DT_RowId , title
FROM table_name
ORDER BY title asc
LIMIT 25 OFFSET 0
The results returned like this.
+--------+-----+
|dt_rowid|title|
+--------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 4 |D |
| 5 |E |
| 6 |F |
+--------+-----+
But i want results should return like this.
+--------+-----+
|DT_RowId|title|
+--------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 4 |D |
| 5 |E |
| 6 |F |
+--------+-----+
Note - DT_RowId field i want same like this(upper and lower case mixed).
As explained in the manual unquoted identifier are folded to lowercase (which violates the SQL standard where unquoted identifiers should be folded to uppercase).
You need to use a quoted identifier in order to preserve the case:
SELECT id as "DT_RowId",
title
FROM table_name
ORDER BY title asc
LIMIT 25 OFFSET 0
Related
I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+
I have a dataframe in which I need to add another col based on the grouping logic.
Dataframe
id|x_id|y_id|val_id|
1| 2 | 3 | 4 |
10| 2 | 3 | 40 |
1| 12 | 13 | 14 |
I need to add other col parent_id which will be based on this rule:
over x_id and y_id select the max value in col val_id and use its corresponding id value
Final frame will look like this
id|x_id|y_id|val_id| parent_id
91| 2 | 3 | 4 | 10 (coming from row 2)
10| 2 | 3 | 40 | 10 (coming from row 2)
1| 12 | 13 | 14 | 14
I have tried using withColumn, but I can only set the row over that group that its value will be parent.
Explanation: Here parent_id is 10 because its coming from col id. Row 2 was chosen because it has max value of val_id over group x_id and y_id
I am using scala
Use Window to split the ids and calculate the maximum over the window by sorting for each partition with respect to the val_id.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('x_id, 'y_id).orderBy('val_id.desc)
df.withColumn("parent_id", first('id).over(w))
.show(false)
The result is:
+---+----+----+------+---------+
|id |x_id|y_id|val_id|parent_id|
+---+----+----+------+---------+
|10 |2 |3 |40 |10 |
|1 |2 |3 |4 |10 |
|1 |12 |13 |14 |1 |
+---+----+----+------+---------+
Let's say we have
|bin | min | end | start |
|1 | 5 | 10 |
|2 | 12 | 24 |
|3 | 28 | 36 |
|4 | 40 | 50 |
|5 | null| null |
I would want to populate start as the previous column's end to make continuous bin values. For the missing I would like to fill in with the current min instead. For null row I consider treating it separately.
What lag gives us would be
df.withColumn("start", F.lag(col("end"), 1, ***default_value***).over(orderBy(col("bin"))
|bin | min | end | start |
|1 | 5 | 10 | (5 wanted)
|2 | 12 | 24 | 10
|3 | 28 | 36 | 24
|4 | 40 | 50 | 36
|5 | null| null | null
My questions :
1/ What do we put in default_value for lag to take another column of current row, in this case min
2/ Is there a way to treat null row at the same time without separating ? I intend to filter non-null , perform lag, then union back with the null rows. How will the answer differ if Null is the first(bin 1) or last (bin 5) ?
Use coalesce to get a column value for the first row in a group.
from pyspark.sql import functions as F
df.withColumn("start", F.coalesce(F.lag(col("end"), 1).over(orderBy(col("bin")),col("min")))
lag currently doesn't support ignorenulls option, so you might have to separate out the null rows, compute the start column for non-null rows and union the data frames.
I have a 2 tables. First table contains information of the object, second table contains related objects. Second table objects have 4 types( lets call em A,B,C,D).
I need a query that does something like this
|table1 object id | A |value for A|B | value for B| C | value for C|D | vlaue for D|
| 1 | 12| cat | 13| dog | 2 | house | 43| car |
| 1 | 5 | lion | | | | | | |
The column "table1 object id" in real table is multiple columns of data from table 1(for single object its all the same, just repeated on multiple rows because of table 2).
Where 2nd table is in form
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
I hope this is clear enough of the thing i want.
I have tryed using AND and OR and JOIN. This does not seem like something that can be done with crosstab.
EDIT
Table 2
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
|C |wolf | 2 | 6 |
Table 1
| id | value1 | value 2|value 3|
| 1 | hello | test | hmmm |
| 2 | bye | test2 | hmm2 |
Result
|value1| value2| value3| A| value| B |value| C|value | D | value|
|hello | test | hmmm |12| cat | 13| dog |2 | house | 23| car |
|hello | test | hmmm |5 | lion | | | | | | |
|bye | test2 | hmm2 | | | | |6 | wolf | | |
I hope this explains bit bettter of what I want to achieve.
I have two tables, as such (where value in all cases is a character varying):
table_one:
|id|value|
|--+-----|
|1 |A |
|2 |B |
|3 |C |
|4 |D |
table_two:
|id|value|
|--+-----|
|11|0 |
|12|1 |
|13|2 |
|14|3 |
|15|4 |
|16|5 |
I want to do a query that results in the following:
result:
|value|
|-----|
|A_0 |
|A_1 |
|A_2 |
|A_3 |
|A_4 |
|A_5 |
|B_0 |
|B_1 |
|B_2 |
|B_3 |
|B_4 |
|B_5 |
|C_0 |
|C_1 |
|C_2 |
|C_3 |
|C_4 |
|C_5 |
|D_0 |
|D_1 |
|D_2 |
|D_3 |
|D_4 |
|D_5 |
This is for a seldom run report on PostgreSQL 8.4.9, so performance isn't critical. If the answer is 9.x specific, upgrading certainly isn't out of the question, so feel free to give 9.x specific answers. Any ideas?
select
table_one.val || '_' || table_two.val
from
table_one
cross join table_two
Documentation for cross join.
SQLFiddle