Task: I need to sum up relevant values from a json for a specific id. How can I accomplish this in postgreSQL?
I receive post insights from Facebook's Graph API and it contains a cell with a json listing countries with their two letter abbreviation and the corresponding watchtime in ms from that country.
post_id
date
watchtime_per_country
107_[pageID]
2022-09-01
** see json below **
The second part is a table that contains the relevant countries for each [page_id]
page_id
target country
P01
Germany (DE)
P01
Italy (IT)
P02
Mozambique (MZ)
P02
Colombia (CO)
Now I would like to get the sum of
Germany (DE): 162 and Japan (JP): 24 --> 186 for P01
Mozambique (MZ): 3 and 6 --> 9 for P02
So far I have unnested the json and unpacked all possible +-250 country values into own columns but I am not sure whether this is a good approach. After that I am not sure how to build those sums in a flexible efficient way. Or whether it is possible at all in postgreSQL.
Does anyone have an idea?
**** json ****
{"Brazil (BR)": 9210, "Germany (DE)": 162, "Portugal (PT)": 68, "Japan (JP)": 24, "United States (US)": 17, "Italy (IT)": 13, "France (FR)": 9, "United Kingdom (GB)": 8, "Netherlands (NL)": 6, "Belgium (BE)": 6, "Colombia (CO)": 6, "Austria (AT)": 5, "Sweden (SE)": 4, "Canada (CA)": 4, "Argentina (AR)": 3, "Mozambique (MZ)": 3, "Angola (AO)": 3, "Switzerland (CH)": 2, "Saudi Arabia (SA)": 2, "New Zealand (NZ)": 2, "Norway (NO)": 2, "Indonesia (ID)": 2, "Denmark (DK)": 2, "United Arab Emirates (AE)": 2, "Russia (RU)": 2, "Spain (ES)": 1, "China (CN)": 1, "Israel (IL)": 1, "Chile (CL)": 0, "Bulgaria (BG)": 0, "Australia (AU)": 0, "Cape Verde (CV)": 0, "Ireland (IE)": 0, "Egypt (EG)": 0, "Luxembourg (LU)": 0, "Bolivia (BO)": 0, "Paraguay (PY)": 0, "Uruguay (UY)": 0, "Czech Republic (CZ)": 0, "Hungary (HU)": 0, "Finland (FI)": 0, "Algeria (DZ)": 0, "Peru (PE)": 0, "Mexico (MX)": 0, "Guinea-Bissau (GW)": 0}
You have a couple ways you can go. If you will do little the post insights then you can get page sums directly processing the JSON.
Your later comment indicates there may be more. Unpacking the JSON into a single table is the way to go; it is data normalization.
One very slight correction. The 2 character is not a MS coding for the country. It is the ISO-3166 alpha-2 code (defined in ISO-3166-1) (yes MS uses it).
Either way the first step is to extract the keys from the JSON then use those keys to extract the Values. Then JOIN the relevant_countries table on the alpha2 code.
with work_insights (jkey,country_watchtime) as
( select json_object_keys(country_watchtime), country_watchtime
from insights_data_stage
)
, watch_insights(cntry, alpha2, watchtime) as
( select trim(replace(substring(jkey, '^.*\('),'(',''))
, upper(trim(replace(replace(substring(jkey, '\(.*\)'),'(',''),')','')) )
, (country_watchtime->> jkey)::numeric
from work_insights
)
, relevant_codes (page_id, alpha2) as
( select page_id, substring(substring(target_country, '\(..\)'),2,2) alpha2
from relevant_countries
)
select rc.page_id, sum(watchtime) watchtime
from relevant_codes rc
join watch_insights wi
on (wi.alpha2 = rc.alpha2)
where rc.page_id in ('P01','P02')
group by rc.page_id
order by rc.page_id;
For the normalization process you need a country table (as you already said you have) and another table for the normalized insights data. Populating begins the same
parsing as above, but developing columns for each value. Once created you JOIN this table with relevant_countries. (See demo containing both). Note: I normalized the relevant_countries table.
select rc.page_id, sum(pi.watchtime) watchtime
from post_insights pi
join relevant_countries_rev rc on (rc.alpha2 = pi.alpha2)
group by rc.page_id
order by rc.page_id;
Update: The results for P01 do not match your expected results. Your expectations indicate to sum Germany and Japan, but your relevant_countries table indicates Germany and Italy.
I am a beginner in r and trying to create some tables using this great gtsummary package. My question is that is it possible to use add_difference to add two separate difference statistics at the same time (or to combine these somehow)? I am able to create a perfect table with p-values or effect size, but not with both. Also, is it possible to use bonferroni adjusted p-values?
My simple code (with t.test) looks like this:
table1 <- tbl_summary(df, by = gr, statistic = list(all_continuous() ~ "{mean} ({sd})")) %>% add_difference(test = list(all_continuous() ~ "t.test"), group = gr, conf.level = 0.95, pvalue_fun = function(x) style_pvalue(x, digits = 2)))
thanks for help.
Yes, the table you're after is possible in gtsummary. Use the tbl_merge() function to merge the tables with standardized differences and the p-values. There is an example of this below. An alternative to this is to use add_stat() to construct customized columns.
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.1'
# standardized difference
tbl1 <-
trial %>%
select(trt, age, marker) %>%
tbl_summary(by = trt, missing = "no",
statistic = all_continuous() ~ "{mean} ({sd})") %>%
add_difference(all_continuous() ~ "cohens_d")
# table with p-value and corrected p-values
tbl2 <-
trial %>%
select(trt, age, marker) %>%
tbl_summary(by = trt, missing = "no") %>%
add_p(all_continuous() ~ "t.test") %>%
add_q(method = "bonferroni") %>%
# hide the trt summaries, because we don't need them repeated
modify_column_hide(all_stat_cols())
#> add_q: Adjusting p-values with
#> `stats::p.adjust(x$table_body$p.value, method = "bonferroni")`
# merge tbls together
tbl_final <-
tbl_merge(list(tbl1, tbl2)) %>%
# remove spanning headers
modify_spanning_header(everything() ~ NA)
Created on 2021-07-20 by the reprex package (v2.0.0)
I am using spark-sql-2.4.1v with java 1.8.
Have source data as below :
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
).toDF("industry_id","industry_name","country","state","revenue","generated_date");
Query :
val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));
For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data
val cols = {col("industry_id")}
val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));
val newDf = df_data
.withColumn("rank",rank().over(ws))
.where(col("rank").equalTo(lit(1)))
//.drop(col("rank"))
.select("*");
How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.
more details :
Example, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-02-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-02-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesm..how to pick distinct dataset for each distinct "generated_date" parallell ?
I don't know how to do this with window functions but a self join can solve your problem.
First, you need a DataFrame with distinct dates:
val df_dates = df_data
.select("generated_date")
.withColumnRenamed("generated_date", "distinct_date")
.distinct()
Next, for each row in your industries data you need to calculate up to which date that industry will be included, i.e., add 6 months to generated_date. I think of them as active dates. I've used add_months() to do this but you can think of different logics.
import org.apache.spark.sql.functions.add_months
val df_active = df_data.withColumn("active_date", add_months(col("generated_date"), 6))
If we start with this data (separated by date just for our eyes):
industry_id generated_date
(("Indus_1", ..., "2020-03-01"),
("Indus_1", ..., "2019-12-01"),
("Indus_2", ..., "2019-12-01"),
("Indus_3", ..., "2018-06-01"))
It has now:
industry_id generated_date active_date
(("Indus_1", ..., "2020-03-01", "2020-09-01"),
("Indus_1", ..., "2019-12-01", "2020-06-01"),
("Indus_2", ..., "2019-12-01", "2020-06-01")
("Indus_3", ..., "2018-06-01", "2018-12-01"))
Now proceed with self join based on dates, using the join condition that will match your 6 month period:
val condition: Column = (
col("distinct_date") >= col("generated_date")).and(
col("distinct_date") <= col("active_date"))
val df_joined = df_dates.join(df_active, condition, "inner")
df_joined has now:
distinct_date industry_id generated_date active_date
(("2020-03-01", "Indus_1", ..., "2020-03-01", "2020-09-01"),
("2020-03-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2020-03-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2018-06-01", "Indus_3", ..., "2018-06-01", "2018-12-01"))
Drop that auxiliary column active_date or even better, drop duplicates based on your needs:
val df_result = df_joined.dropDuplicates(Seq("distinct_date", "industry_id"))
Which drops the duplicated "Indus_1" in "2020-03-01" (It appeared twice because it's retrieved from two different generated_dates):
distinct_date industry_id
(("2020-03-01", "Indus_1"),
("2020-03-01", "Indus_2"),
("2019-12-01", "Indus_1"),
("2019-12-01", "Indus_2"),
("2018-06-01", "Indus_3"))
I have two dataframes. [AllAccounts]: contains audit for all accounts for all users
UserId, AccountId, Balance, CreatedOn
1, acc1, 200.01, 2016-12-06T17:09:36.123-05:00
1, acc2, 189.00, 2016-12-06T17:09:38.123-05:00
1, acc1, 700.01, 2016-12-07T17:09:36.123-05:00
1, acc2, 189.00, 2016-12-07T17:09:38.123-05:00
1, acc3, 010.01, 2016-12-07T17:09:39.123-05:00
1, acc1, 900.01, 2016-12-08T17:09:36.123-05:00
[ActiveAccounts]: contains audit for only the active account(could be zero or 1) for any user
UserId, AccountId, CreatedOn
1, acc2, 189.00, 2016-12-06T17:09:38.123-05:00
1, acc3, 010.01, 2016-12-07T17:09:39.123-05:00
I want to transform these into a single DF which is of the format
UserId, AccountId, Balance, CreatedOn, IsActive
1, acc1, 200.01, 2016-12-06T17:09:36.123-05:00, false
1, acc2, 189.00, 2016-12-06T17:09:38.123-05:00, true
1, acc1, 700.01, 2016-12-07T17:09:36.123-05:00, false
1, acc2, 189.00, 2016-12-07T17:09:38.123-05:00, true
1, acc3, 010.01, 2016-12-07T17:09:39.123-05:00, true
1, acc1, 900.01, 2016-12-08T17:09:36.123-05:00, false
So based on accounts in ActiveAccounts, i need to flag the rows in first df appropriately. As in the example, acc2 for userId 1 was marked active on 2016-12-06T17:09:38.123-05:00 and acc3 was marked active on 2016-12-07T17:09:39.123-05:00. So btw these time ranges acc2 will be marked true and 2016-12-07T17:09:39 onwards acc3 will be marked true.
What will be a an efficient way to do this.
If I understand properly the account (1, acc1) is active between its creation time and that of (1, acc2).
We can do this in a few steps:
create a data frame with the start/end times for each account
join with AllAccounts
flag the rows of the resulting dataframe
I haven't tested this, so there may be syntax mistakes.
To accomplish the first task, we need to partition the dataframe by user and then look at the next creation time. This calls for a window function:
val window = Window.partitionBy("UserId").orderBy("StartTime")
val activeTimes = ActiveAccounts.withColumnRenamed("CreatedOn", "StartTime")
.withColumn("EndTime", lead("StartTime") over window)
Note that the last EndTime for each user will be null. Now join:
val withActive = AllAcounts.join(activeTimes, Seq("UserId", "AccountId"))
(This should be a left join if you might be missing active times for some accounts.)
Then you have to go through and flag the accounts as active:
val withFlags = withActive.withColumn("isActive",
$"CreatedOn" >= $"StartTime" &&
($"EndTime".isNull || ($"CreatedOn" < $"EndTime)))
Could anyone explain what the exclude_nodata_value argument to ST_DumpValues does?
For example, given the following:
WITH
-- Create a raster 4x4 raster, with each value set to 8 and NODATA set to -99.
tbl_1 AS (
SELECT
ST_AddBand(
ST_MakeEmptyRaster(4, 4, 0, 0, 1, -1, 0, 0, 4326),
1, '32BF', 8, -99
) AS rast
),
-- Set the values in rows 1 and 2 to -99.
tbl_2 AS (
SELECT
ST_SetValues(
rast, 1, 1, 1, 4, 2, -99, FALSE
) AS rast FROM tbl_1)
Why does the following select statement return NULLs in the first two rows:
SELECT ST_DumpValues(rast, 1, TRUE) AS cell_values FROM tbl_2;
Like this:
{{NULL,NULL,NULL,NULL},{NULL,NULL,NULL,NULL},{8,8,8,8},{8,8,8,8}}
But the following select statement return -99s?
SELECT ST_DumpValues(rast, 1, FALSE) AS cell_values FROM tbl_2;
Like this:
{{-99,-99,-99,-99},{-99,-99,-99,-99},{8,8,8,8},{8,8,8,8}}
Clearly, with both statements the first two rows really contain -99s. However, in the first case (exclude_nodata_value=TRUE) these values have been masked (but not replaced) by NULLS.
Thanks for any help. The subtle differences between NULL and NODATA within PostGIS have been driving me crazy for several days.