|id |profile |data
|1 |name1 |2.1
|1 |name2 |400
|1 |name3 |200
|2 |name1 |3.4
|2 |name2 |350
|2 |name3 |500
This is what I have at the moment. When I ran this query:
SELECT id, ARRAY_AGG(profile), ARRAY_AGG(data) FROM "schema"."table" GROUP BY id;
I got:
|id |array(profile) |array(data)
|1 |{name1, name2, name3} |{2.1, 400, 200}
|2 |{name2, name3, name1} |{350, 500, 3.4}
I also tried to pre-sort
SELECT id, ARRAY_AGG(profile), ARRAY_AGG(data) FROM (SELECT * FROM "schema"."table" ORDER BY id) A GROUP BY id;
The data position match on both array but the format is not consistent. I wanted this result:
|id |array(profile) |array(data)
|1 |{name1, name2, name3} |{2.1, 400, 200}
|2 |{name1, name2, name3} |{3.4, 350, 500}
I am using PostgreSQL 8.4 so I cannot use array_agg(profile ORDER BY data).
As far as I remember Postgres 8.4, you can try to execute aggregates on a sorted derived table (unfortunately I cannot run 8.4 to verify this), e.g.:
select id, array_agg(profile), array_agg(data)
from (
select *
from my_table
order by id, profile) s
group by id
order by id;
Related
I have a data frame that looks something along the lines of:
+-----+-----+------+-----+
|col1 |col2 |col3 |col4 |
+-----+-----+------+-----+
|1.1 |2.3 |10.0 |1 |
|2.2 |1.5 |5.0 |1 |
|3.3 |1.3 |1.5 |1 |
|4.4 |0.5 |7.0 |1 |
|5.5 |1.2 |8.1 |2 |
|6.6 |2.3 |8.2 |2 |
|7.7 |4.5 |10.3 |2 |
+-----+-----+------+-----+
I would like to subtract each row from the row above but only if they have the same entry in col4, so 2-1, 3-2 but not 5-4. Also col4 should not be changed, so the result would be
+-----+-----+------+------+
|col1 |col2 |col3 |col4 |
+-----+-----+------+------+
|1.1 |-0.8 |-5.0 |1 |
|1.1 |-0.2 |-3.5 |1 |
|1.1 |-0.8 |5.5 |1 |
|1.1 |1.1 |0.1 |2 |
|1.1 |2.2 |2.1 |2 |
+-----+-----+------+------+
This sounds like it'd be simple, but I can't seem to figure it out
You could accomplish this using spark-sql i.e. creating a temporary view with your dataframe and applying the following sql. It uses window functions LAG to subtract the previous row value ordered by col1 and partitioned by col4. The first row value in each group partitioned by col4 is identified using row_number and filtered.
df.createOrReplaceTempView('my_temp_view')
results = sparkSession.sql('<insert sql below here>')
SELECT
col1,
col2,
col3,
col4
FROM (
SELECT
(col1 - (LAG(col1,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col1,
(col2 - (LAG(col2,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col2,
(col3 - (LAG(col3,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col3,
col4,
ROW_NUMBER() OVER (PARTITION BY col4 ORDER BY col1) rn
FROM
my_temp_view
) t
WHERE rn <> 1
db-fiddle
Here just the idea with a self-JOIN based on RDD with zipWithIndex and back to DF - some overhead, that you can tailor, z being your col4.
At scale I am not sure about the performance that Catalyst Optimizer will apply, I looked at .explain(true); not convinced entirely, but I find it hard to interpret the output sometimes. Ordering of data is guaranteed.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType, ArrayType, LongType}
val df = sc.parallelize(Seq( (1.0, 2.0, 1), (0.0, -1.0, 1), (3.0, 4.0, 1), (6.0, -2.3, 4))).toDF("x", "y", "z")
val newSchema = StructType(df.schema.fields ++ Array(StructField("rowid", LongType, false)))
val rddWithId = df.rdd.zipWithIndex
val dfZippedWithId = spark.createDataFrame(rddWithId.map{ case (row, index) => Row.fromSeq(row.toSeq ++ Array(index))}, newSchema)
dfZippedWithId.show(false)
dfZippedWithId.printSchema()
val res = dfZippedWithId.as("dfZ1").join(dfZippedWithId.as("dfZ2"), $"dfZ1.z" === $"dfZ2.z" &&
$"dfZ1.rowid" === $"dfZ2.rowid" -1
,"inner")
.withColumn("newx", $"dfZ2.x" - $"dfZ1.x")//.explain(true)
res.show(false)
returns the input:
+---+----+---+-----+
|x |y |z |rowid|
+---+----+---+-----+
|1.0|2.0 |1 |0 |
|0.0|-1.0|1 |1 |
|3.0|4.0 |1 |2 |
|6.0|-2.3|4 |3 |
+---+----+---+-----+
and the result which you can tailor by selecting and adding extra calculations:
+---+----+---+-----+---+----+---+-----+----+
|x |y |z |rowid|x |y |z |rowid|newx|
+---+----+---+-----+---+----+---+-----+----+
|1.0|2.0 |1 |0 |0.0|-1.0|1 |1 |-1.0|
|0.0|-1.0|1 |1 |3.0|4.0 |1 |2 |3.0 |
+---+----+---+-----+---+----+---+-----+----+
I have a table with a table_id row and 2 other rows. I want type of numbering with row_number function and I want result to seem like this:
id |col1 |col2 |what I want
------------------------------
1 |x |a |1
2 |x |b |2
3 |x |a |3
4 |x |a |3
5 |x |c |4
6 |x |c |4
7 |x |c |4
please consider that;
there's only one x, so "partition by col1" is OK. other than that;
there are two sequences of a's, and they'll be counted seperately
(not 1,2,1,1,3,3,3). and sorting must be by id, not by col2 (so
order by col2 is NOT OK).
I want that number to increase by one anytime col2 changes compared to previous line.
row_number () over (partition by col1 order by col2) DOESN'T WORK. because I want it ordered by id.
Using LAG and a windowed COUNT appears to get you what you are after:
WITH Previous AS(
SELECT V.id,
V.col1,
V.col2,
V.[What I want],
LAG(V.Col2,1,V.Col2) OVER (ORDER BY ID ASC) AS PrevCol2
FROM (VALUES(1,'x','a',1),
(2,'x','b',2),
(3,'x','a',3),
(4,'x','a',3),
(5,'x','c',4),
(6,'x','c',4),
(7,'x','c',4))V(id, col1, col2, [What I want]))
SELECT P.id,
P.col1,
P.col2,
P.[What I want],
COUNT(CASE P.Col2 WHEN P.PrevCol2 THEN NULL ELSE 1 END) OVER (ORDER BY P.ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) +1 AS [What you get]
FROM Previous P;
DB<>Fiddle
I am working with the following two tables;
Table 1
Key |Clicks |Impressions
-------------+-------+-----------
USA-SIM-CARDS|55667 |544343
DE-SIM-CARDS |4563 |234829
AU-SIM-CARDS |3213 |232242
UK-SIM-CARDS |3213 |1333223
CA-SIM-CARDS |4321 |8883111
MX-SIM-CARDS |3193 |3291023
Table 2
Key |Conversions |Final Conversions|Active Sims
-----------------+------------+-----------------+-----------
USA-SIM-CARDS |456 |43 |4
USA-SIM-CARDS |65 |2 |1
UK-SIM-CARDS |123 |4 |3
UK-SIM-CARDS |145 |34 |5
The goal is to get the following output;
Key |Clicks |Impressions|Conversions|Final Conversions|Active Sims
-------------+-------+-----------+-----------+-----------------+-----------
USA-SIM-CARDS|55667 |544343 |521 |45 |5
DE-SIM-CARDS |4563 |234829 | | |
AU-SIM-CARDS |3213 |232242 | | |
UK-SIM-CARDS |3213 |1333223 |268 |38 |8
CA-SIM-CARDS |4321 |8883111 | | |
MX-SIM-CARDS |3193 |3291023 | | |
The most crucial part of this function involves aggregating the second table based on conversions
I would then I imagine execute this with an inner join.
Thank you.
Take this in two steps then:
1) Aggregate the second table:
SELECT Key, sum(Conversions) as Conversions, sum("Final Conversions") as FinalConversions, Sum("Active Sims") as ActiveSims FROM Table2 GROUP BY key
2) Use that as a subquery/derived table joining to your first table:
SELECT
t1.key,
t1.clicks,
t1.impressions,
t2.conversions,
t2.finalConversions,
t2.ActiveSims
From Table1 t1
LEFT OUTER JOIN (SELECT Key, sum(Conversions) as Conversions, sum("Final Conversions") as FinalConversions, Sum("Active Sims") as ActiveSims FROM Table2 GROUP BY 2) t2
ON t1.key = t2.key;
As an alternative, you could join and then group by as well since there isn't any need to aggregate twice or anything:
SELECT
t1.key,
t1.clicks,
t1.impressions,
sum(Conversions) as Conversions,
sum("Final Conversions") as FinalConversions,
Sum("Active Sims") as ActiveSims
From Table1 t1
LEFT OUTER JOIN table2 t2
ON t1.key = t2.key
GROUP BY t1.key, t1.clicks, t1.impressions
The only other important thing here is that we are using a LEFT OUTER JOIN since we want all record from Table1 and any records from Table2 that match on the key.
Assume I have following table, plus some data.
create table "common"."log"("id" bigserial primary key,
"level" int not null default 0);
Now I have this select query that would return something like this.
select * from common.log where id=147;
+------+--------+
|id |level |
+------+--------+
|147 |1 |
|147 |2 |
|147 |2 |
|147 |6 |
|147 |90 |
+------+--------+
Now I like to have something like following rather above
+------+---------------+
|id |arr_level |
+------+---------------+
|147 |{1,2,2,6,90} |
+------+---------------+
So is there any implicit select clause/way for doing this? thanks.
pgsql v9.3
You can user array function like this
Select '147' as id,array(select level from common.log where id=147) as arr_level;
Another way, probably more useful if you have more than one id to query:
SELECT id, array_agg(level) FROM common.log GROUP BY id;
See: aggregate functions.
I am struggling, maybe the simplest problem ever. My SQL knowledge pretty much limits me from achieving this. I am trying to build an sql query that should show JobTitle, Note and NoteType. Here is the thing, First job doesn't have any note but we should see it in the results. System notes never and ever should be displayed. An expected result should look like this
Result:
--------------------------------------------
|ID |Title |Note |NoteType |
--------------------------------------------
|1 |FirstJob |NULL |NULL |
|2 |SecondJob |CustomNot1|1 |
|2 |SecondJob |CustomNot2|1 |
|3 |ThirdJob |NULL |NULL |
--------------------------------------------
.
My query (doesn't work, doesn't display third job)
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN NOTE N ON N.JobId = J.ID
WHERE N.NoteType IS NULL OR N.NoteType = 1
My Tables:
My JOB Table
----------------------
|ID |Title |
----------------------
|1 |FirstJob |
|2 |SecondJob |
|3 |ThirdJob |
----------------------
My NOTE Table
--------------------------------------------
|ID |JobId |Note |NoteType |
--------------------------------------------
|1 |2 |CustomNot1|1 |
|2 |2 |CustomNot2|1 |
|3 |2 |SystemNot1|2 |
|4 |2 |SystemNot3|2 |
|5 |3 |SystemNot1|2 |
--------------------------------------------
This can't be true together (NoteType can't be NULL as well as 1 at the same time):
WHERE N.NoteType IS NULL AND N.NoteType = 1
You may want to use OR instead to check if NoteType is either NULL or 1.
WHERE N.NoteType IS NULL OR N.NoteType = 1
EDIT: With corrected query, your third job will not be retrieved as JOB_ID is matching but its the row getting filtered out because of the where condition.
Try below as work around to get the third job with null values.
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN
( SELECT JOBID NOTE, NOTETYPE FROM NOTE
WHERE N.NoteType IS NULL OR N.NoteType = 1) N
ON N.JobId = J.ID
just exclude the systemNotes and use a sub-select:
select * from job j
left outer join (
select * from note where notetype!=2
) n
on j.id=n.jobid;
if you include the joined table into where then left outer join might work as an inner join.