Joining data without creating duplicate metric rows from the first table, (second table contains more rows but not metrics) - postgresql

I have the following two tables that I would like to join for a comprehensive digital marketing report without creating duplicates in regards to metrics. The idea is to take competitor adverts and join them with my existing marketing data which is as follows;
Campaign|Impressions | Clicks | Conversions | CPC |Key
---------+------------+--------+-------------+-----+----
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12
The competitor data is as follows;
Key | Ad Copie |
---------+------------+
Hgdy24 |Click here! |
Hgdy24 |Free Trial! |
Hgdy24 |Sign Up now |
dhfg12 |Check it out|
dhfg12 |World known |
dhfg12 |Sign up |
Using conventional join queries produces the following unusable result
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Click here!
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Free Trial!
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Check it out
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|World known
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Sign up
Here is the desired output
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|Click here!
USA-SIM| | | | |Hgdy24|Free Trial!
USA-SIM| | | | |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|Check it out
DE-SIM | | | | |dhfg12|World known
DE-SIM | | | | |dhfg12|Sign up
Or as an alternative that would also work would be
Campaign|Impressions | Clicks | Conversions | CPC |Key |Ad Copie
---------+------------+--------+-------------+-----+------+---------
USA-SIM|53432 | 5001 | 5| 2$ |Hgdy24|
USA-SIM| | | | |Hgdy24|Click here!
USA-SIM| | | | |Hgdy24|Free Trial!
USA-SIM| | | | |Hgdy24|Sign Up now
DE-SIM |5389 | 4672 | 3| 4$ |dhfg12|
DE-SIM | | | | |dhfg12|Check it out
DE-SIM | | | | |dhfg12|World known
DE-SIM | | | | |dhfg12|Sign up
I have yet to find a work around that does not produce the extra metrics as a result.
MOST RECENT RESULT
campaing | impressions | clicks | conversions | cpc | key | ad_copie
----------+-------------+--------+-------------+-----+--------+------------
USA-SIM | 53432 | 5001 | 5 | 2$ | |
USA-SIM | | | | | Hgdy24 | Click here!
USA-SIM | | | | | Hgdy24 | Free Trial!
USA-SIM | | | | | Hgdy24 | Sign Up now
DE-SIM | 5389 | 4672 | 3 | 4$ | |
DE-SIM | | | | | dhfg12 | Check it out
DE-SIM | | | | | dhfg12 | World known
DE-SIM | | | | | dhfg12 | Sign up

You can use window function lag() to check what key was in previous row and either display metrics or null them.
select campaing,
case when prev_key is null or prev_key != key then impressions end as impressions,
case when prev_key is null or prev_key != key then clicks end as clicks,
case when prev_key is null or prev_key != key then conversions end as conversions,
case when prev_key is null or prev_key != key then cpc end as cpc,
key, ad_copie
from (
select campaing, lag(key) over () AS prev_key, impressions, clicks, conversions, cpc, key, ad_copie
from ad1
join comp1 using(key)
order by campaing desc, key
) sub;
result:
campaing | impressions | clicks | conversions | cpc | key | ad_copie
----------+-------------+--------+-------------+-----+--------+--------------
USA-SIM | 53432 | 5001 | 5 | 2$ | Hgdy24 | Click here!
USA-SIM | | | | | Hgdy24 | Free Trial!
USA-SIM | | | | | Hgdy24 | Sign Up now
DE-SIM | 5389 | 4672 | 3 | 4$ | dhfg12 | Check it out
DE-SIM | | | | | dhfg12 | World known
DE-SIM | | | | | dhfg12 | Sign up
(6 wierszy)
EDIT: You might need to tinker with what columns you compare before you NULL metrics and possibly by what columns you will order data. If key is unique for campaing then I suppose this will suffice.

Related

How to create a new column based on if certain strings exist in another column?

I have a table looks like this:
+--------+-------------+
| Time | Locations |
+--------+-------------+
| 1/1/22 | A300-abc |
+--------+-------------+
| 1/2/22 | A300-FFF |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc |
+--------+-------------+
| 1/5/22 | B750-EEE |
+--------+-------------+
| 1/6/22 | M-200-68 |
+--------+-------------+
| 1/7/22 | ABC-abc |
+--------+-------------+
I would like to derive to a table that looks like this:
+--------+-------------+-----------------+
| Time | Locations | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc | A300 |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF | A300 |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300 |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc | B700 |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE | B750 |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68 | M-200 |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc | "not_listed" |
+--------+-------------+-----------------+
Essentially I have a list of what the location code should be e.g. ["A300","B700","B750","M-200"], but currently the location column is very messy with other random strings. I want to create a new column that shows the "cleaned" version of the location code, and anything that is not in that list should be marked as "not_listed".
Use regex and when condition. In this case I check if string begins with a digit ^[0-9] then extract the the leading digits in the string. If it doesn then attribute it with not listed. Code below
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()
+--------------------+---------+---------------+
| Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456| 300abc| 300|
|0.022727272727272728| 300FFF| 300|
| 0.01515151515151515| 300ABC| 300|
|0.011363636363636364| 700abc| 700|
|0.009090909090909092| 750EEE| 750|
|0.007575757575757575| ABCabc| not_listed|
+--------------------+---------+---------------+
With your new question, use regexp_replace
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))
+------+-----------+---------------+
| Time| Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22| A300-abc| A300|
|1/2/22| A300-FFF| A300|
|1/3/22|A300-ABC123| A300|
|1/4/22| B700-abc| B700|
|1/5/22| B750-EEE| B750|
|1/7/22| M-200-68| M-200|
|1/6/22| ABCabc| not_listed|
+------+-----------+---------------+

Check if a value is between two columns, spark scala

I have two dataframes, one with my data and another one to compare. What I want to do is check if a value is in a range of two different columns, for example:
Df_player
+--------+-------+
| Baller | Power |
+--------+-------+
| John | 1.5 |
| Bilbo | 3.7 |
| Frodo | 6 |
+--------+-------+
Df_Check
+--------+--------+--------+
| First | Second | Value |
+--------+--------+--------+
| 1 | 1.5 | Bad- |
| 1.5 | 3 | Bad |
| 3 | 4.2 | Good |
| 4.2 | 6 | Good+ |
+--------+--------+--------+
The result would be:
Df_out
+--------+-------+--------+
| Baller | Power | Value |
+--------+-------+--------+
| John | 1.5 | Bad- |
| Bilbo | 3.7 | Good |
| Frodo | 6 | Good+ |
+--------+-------+--------+
You can do a join based on a between condition, but note that .between is not appropriate here because you want inequality in one of the comparisons:
val result = df_player.join(
df_check,
df_player("Power") > df_check("First") && df_player("Power") <= df_check("Second"),
"left"
).select("Baller", "Power", "Value")
result.show
+------+-----+-----+
|Baller|Power|Value|
+------+-----+-----+
| John| 1.5| Bad-|
| Bilbo| 3.7| Good|
| Frodo| 6.0|Good+|
+------+-----+-----+

how can i bring the months in calender order like from jan to dec in scala dataframe

+---------+------------------+
| Month|sum(buss_days)|
+---------+------------------+
| April| 83.93|
| August| 94.895|
| December| 53.47|
| February| 22.90|
| January| 97.45|
| July| 95.681|
| June| 23.371|
| March| 35.957|
| May| 4.24|
| November| 1.56|
| October| 1.00|
|September| 93.51|
+---------+------------------+
and i want output like this
+---------+------------------+
| Month|sum(avg_buss_days)|
+---------+------------------+
| January| 97.45
February| 22.90
March| 35.957
April| 83.93|
| May| 4.24
June| 23.371
July| 95.681
August| 94.895|
| September| 93.51
October| 1.00
November| 1.56
December| 53.47|
+---------+------------------+
this is what it is i did
df.groupBy("Month[order(match(month$month, month.abb)), ]")
And i got this.....
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Month[order(match(month$month, month.abb)), ]".Here Month is Column name in dataframe
Convert the Month Into Date form and sort the items should do.
Please find the snippet unix_timestamp(col("Month"),"MMMMM")
Df.sort(unix_timestamp(col("Month"),"MMMMM")).show
+---------+-------------+
| Month|avg_buss_days|
+---------+-------------+
| January| 97.45|
| February| 22.90|
| March| 35.957|
| April| 83.93|
| May| 4.24|
| June| 23.371|
| July| 95.681|
| August| 94.895|
|September| 93.51|
| October| 1.00|
| November| 1.56|
| December| 53.47|
+---------+-------------+

Postgres select from table and spread evenly

I have a 2 tables. First table contains information of the object, second table contains related objects. Second table objects have 4 types( lets call em A,B,C,D).
I need a query that does something like this
|table1 object id | A |value for A|B | value for B| C | value for C|D | vlaue for D|
| 1 | 12| cat | 13| dog | 2 | house | 43| car |
| 1 | 5 | lion | | | | | | |
The column "table1 object id" in real table is multiple columns of data from table 1(for single object its all the same, just repeated on multiple rows because of table 2).
Where 2nd table is in form
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
I hope this is clear enough of the thing i want.
I have tryed using AND and OR and JOIN. This does not seem like something that can be done with crosstab.
EDIT
Table 2
|type|value|table 1 object id| id |
|A |cat | 1 | 12|
|B |dog | 1 | 13|
|C |house| 1 | 2 |
|D |car | 1 | 43 |
|A |lion | 1 | 5 |
|C |wolf | 2 | 6 |
Table 1
| id | value1 | value 2|value 3|
| 1 | hello | test | hmmm |
| 2 | bye | test2 | hmm2 |
Result
|value1| value2| value3| A| value| B |value| C|value | D | value|
|hello | test | hmmm |12| cat | 13| dog |2 | house | 23| car |
|hello | test | hmmm |5 | lion | | | | | | |
|bye | test2 | hmm2 | | | | |6 | wolf | | |
I hope this explains bit bettter of what I want to achieve.

Select single value from sphinx MVA

I'm currently using Sphinx MVAs (Multi Value Attribute) for indexer performance reasons, each MVA only has a single value. I'm basically using the MVA's in the same way as a sql_joined_field (I can't use sql_joined_field since you cannot filter by joined values).
I want to be able to sort by the value of the MVA. According to sphinx docs, you cannot actually do this, however, you can sort by selected derived values. (eg, MAX(price) AS sort_field or GROUP_CONCAT(tag) AS sort_field)
Is there a way to select a single value from the MVA (or possibly concatenating all values in the MVA)?
ok, while it appears you can sort by a MVA,
sphinxQL>select id,bucket_id from gi_stemmed where match('bridge') order by bucket_id desc;
+---------+-----------+
| id | bucket_id |
+---------+-----------+
| 4135611 | 492 |
| 4135609 | 492 |
| 4132078 | 492 |
| 4130626 | 492 |
| 4117904 | 492 |
| 4114632 | 490 |
| 4087884 | 490 |
| 4087786 | 490 |
| 4087767 | 490 |
| 4087010 | 490 |
| 4086927 | 490 |
| 4086920 | 490 |
| 4086125 | 490 |
| 4083465 | 761 |
| 4081812 | 491 |
| 4081713 | 490 |
| 4065533 | 490 |
| 4065427 | 490 |
| 4065338 | 490 |
| 4065321 | 490 |
+---------+-----------+
Server version: 2.2.1-dev (r4133)
ie no error. It doesn't work completely. There are a few results out of order (see 2/3rds down in the example above)
But there is a GREATEST() function, which works like MAX in your question.
sphinxQL>select id,bucket_id,greatest(bucket_id) as two from gi_stemmed where match('bridge road') order by two desc;
You can sort by MVA's...
sphinxQL>select id,bucket_id from gi_stemmed order by bucket_id desc;
+---------+-----------+
| id | bucket_id |
+---------+-----------+
| 4138739 | 492 |
| 4138708 | 492 |
| 4138671 | 492 |
| 4138663 | 492 |
| 4138661 | 492 |
| 4138615 | 492 |
bucket_id is a MVA (for a similar reason to you)
sphinxQL>describe gi_stemmed like 'bucket_id';
+-----------+------+
| Field | Type |
+-----------+------+
| bucket_id | mva |
+-----------+------+
Server version: 2.2.1-dev (r4133)