Using theta sketch to count ad impressions and unqiue users - druid

We're currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc
Currently, we don't have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.
We want to answer the following 2 queries to begin with :-
Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k
Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k
As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).
We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-
Date Dimension Name Dimension Value Unique-User-ID(Theta sketch)
2021/09/12 "Country" "US" 37873-3udif-83748-2973483
2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483
.
.
<Other records>
So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)
I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)
I've also read that the about theta sketch set ops accuracy here
I would like to know if there is a better approach to solve Q2(with or without Druid)
Also I would like to know how can I solve Q1?
If I replace Unique-User-Id with "Impression-Id", can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with "Impression-Id" then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.
Please share your thoughts about solving Q1 and Q2.
Regards
kartik

Related

I need help in data sanitization problem in tableau

I trying doing the manual sanitization, however I am getting a type mismatch error in performing the calculations.
I also need help in sanitizing the data and getting the insight as per the below instructions:
The column sellerproductcount gives you the count of products in the
form '1-16 of over 100,000 results' , and you can parse out the product count 100,000.
sellerratings - this columns gives you the % and count of positive ratings (e.g. 88% positive
in the last 12 months (118 ratings) ) if parsed correctly
sellerdetails - you can use this text to parse out phone numbers, and email IDs of
merchants, where available, so our team can reach out to them.
businessaddress - this will give you the business locations of the sellers. You can parse them
to identify if a seller is registered in the US , Germany (DE), or China (CN).
Hero Product 1 #ratings and Hero Product 2 #ratings - these 2 columns give you the number of
ratings of the 2 'hero products' or bestselling products of this seller.
I have attached the dataset for the same.
https://docs.google.com/spreadsheets/d/1PSqRCnmFgq7v7RzZaCXXoV0Edp_vM7QO/edit?usp=sharing&ouid=115547990006782902200&rtpof=true&sd=true
Most of this type of data prep can be done with string & RegEx functions like REGEX_MATCH(). Here are a few examples based on the data you shared:
Seller Product Count
INT(REGEXP_EXTRACT([Sellerproductcount], '(\d*,?\d*) results'))
1-16 of over 6,000 results >> 6000
Seller Rating (Percentage)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*)% positive'))
92% positive in the last 12 months (181 ratings) >> 92
Seller Rating (Count)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*) (?:total )?ratings'))
92% positive in the last 12 months (181 ratings) >> 181
Business Country Code
RIGHT([Businessaddress],2)
AM Treptower Park28-30Berlin12435DE >> DE
These examples all have very straightforward patterns that are present in all rows so they can be done pretty easily with one simple calculation. However, something like sellerdetails which is unstructured, inconsistent, and sometimes incomplete will be a bit more of a challenge. You will need to use a couple of different calculations and techniques combined together to find what you are looking for, as well as some manual data prep. Here's an example of how you can pull out email but it won't work for everything:
Email
REGEXP_EXTRACT([Sellerdetails], '([a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)')
Good luck with your data cleaning, I suggest using sites like https://regex101.com/ and https://regexr.com/ to learn more about and help test regular expressions.

Post-Aggregation Join of two tables in Tableau

I´m new to tableau and need to do some kind of post-aggregation join, i think. My goal is, to match some data from google search console to some other regional data concerning hotels. This way, i hope to see if hotels for a certain region perform better or worse than their popularity in the google searches would suggest.
I have one table with the hotel-data which looks like this:
Table 1
Here we have three hierarchical region levels. Country, state and region (and some KPI that is aggregated according to the drill-down-level).
Table 2
Table 2 does not follow the hierarchical dimensionality as table 1, but has the same regions.
What i want tableau to do:
I want tableau, to join the regions on the lowest region level, but NOT to aggregate the KPI impressions. So, when i drill-up to the country level, i want the "random KPI" to be summed to 389, but the impressions should be 40.000 only. You might ask yourself why - it´s a different thing if somebody only searches for "country 1" or if he searches for a state or region of this country. For this analysis it is the goal, to not aggregate the impressions for each region.
I would be glad for any hints on how to do this. I thought about doing a blend - which i thought is a kind of post-aggregation join, but i found out, that if i join on the lowest region-level of table 1 with the region-variable of table 2, the impressions always get aggregated.
Thanks everyone!

Tableau - Calculated field of different columns based on different partition of the same table

Sorry for the stupid question.
Situation: I have a partitioned table (the partition is the week of the year) with some metrics (e.g. frequency of some keywords); I need to run an analysis of metrics belonging to different partitions (e.g. the trend between the frequency of a keyword in week 32 compared to week 3). The ultimate purpose is to create a dashboard where the user can choose the week of the year and is presented with the calculated analysis on the go.
So far I have used a live query that uses two parameters (week_1 and week_2) that joins data from the same table based on the two different parameters. You can imagine that the dashboard recomputes everything once one of the parameter is changed by the user. To avoid long waiting times, I have set the two parameters to a non-existent default value (0, zero), so that the dashboard can open very quickly. Then I prompt the user to stop the dashboard, insert the new parameters of choice, and then restart the dashboard to load the new computations.
My question is: is it possible to achieve the same by using an extract of the table? The table itself should not be excessively big (it should be 15 million records spanning 3 years) and as far as I know the extracts are performant with those numbers.
I am quite new to Tableau, so I would like to know from more expert people if there is a more optimal way to do such a thing without using live queries.
Please, feel free to ask more information if I was not clear! However, I cannot share my workbook, as it contains sensitive information.
Edit:
+-----------+ -----------+ ------------+
partition keyword frequency
+-----------+ -----------+ ------------+
202032 hello 5000
202032 ciao 567
...
202031 hello 2323
202031 ciao 34567
...
20203 hello 2
20203 ciao 1000
With the live query, I can join the table where partition = 202032 with the same table where partition - 20203 and make a new table with a column where I compute e.g. a trend between the two frequencies:
+----------+ -----------------------+ ---------------+
keyword partitions_compared trend
+----------+ -----------------------+ ---------------+
hello 202032 - 20203 +1billion %
ciao 202032 - 20203 +1K %
With the live query I join on the keywords.
Thanks a lot in advance and have a great day!
Cheers

MySQL Workbench - script storing return in array and performing calculations?

Firstly, this is part of my college homework.
Now that's out of the way: I need to write a query that will get the number of free apps in a DB as a percentage of the total number of apps, sorted by what category the app is in.
I can get the number of free apps and also the number of total apps by category. Now I need to find the percentage, this is where it goes a bit pear-shaped.
Here is what I have so far:
-- find total number of apps per category
select #totalAppsPerCategory := count(*), category_primary
from apps
group by category_primary;
-- find number of free apps per category
select #freeAppsPerCategory:= count(*), category_primary
from apps
where (price = 0.0)
group by category_primary;
-- find percentage of free apps per category
set #totals = #freeAppsPerCategory / #totalAppsPercategory * 100;
select #totals, category_primary
from apps
group by category_primary;
It then lists the categories but the percentage listed in each category is the exact same value.
I had initially thought to use an array, but from what I have read mySql does not seem to support arrays.
I'm a bit lost of how to proceed from here.
Finally figured it out. Since I had been saving the previous results in variables it seems that it was not able to calculate on a row by row basis, which is why all the percentages were identical, it was an average. So the calculation needed to be part of the query.
Here's what I came up with:
SELECT DISTINCT
category_primary,
CONCAT(FORMAT(COUNT(CASE
WHEN price = 0 THEN 1
END) / COUNT(*) * 100,
1),
'%') AS FreeAppSharePercent
FROM
apps
GROUP BY category_primary
ORDER BY FreeAppSharePercent DESC;
Then the query result is:

SQLite vs Memory

I have a situation with my app.
Suppose I have 6 users, each user can have up to 9 score entries (i.e score 1000 points at 8:00pm with gold collected 3, silver 4 etc etc), say score per stage and 9 stages.
All these scores are being taken from an API call, so it can update with an interval of 3+minutes.
Operations I need to do on this data is
find the nearest min, max record from stage 4.
and some more operations like add or subtract two scores etc
All these 6 users, and their score records are already in database, being updated in needed after the API call.
Now my questions is :
Is this a better way for such kind of data (data of scores here) to keep all the data for all the 6 users in memory in NSArray or NSDictionary, and find min and max in that array by a min-max algorithm.
OR
It should be taken from Database by a query like " WHERE score<=200 " AND " WHERE score >=200", in short, 2 database queries which return nearest min and max record each, and not keeping all the data in memory.
What we are focusing on is speed, and memory usage both. The point is, Would a DB call be fast and efficient to find min and max OR a search for min,max in an Array of all the records from DB.
All records can be 6users * 9scores for each = 54.
Update time for records can be 3+ minutes.
Frequency of finding min max for certain values are high.
Please ask, if any more details are required.
Thanks in advance.
You're working with such a small amount of data that I wouldn't imagine it would be worth worrying about. Do whichever method makes your development process easiest!
Edit:
If I had a lot of data (hundreds of competitors) I'd use SQLite. You can do queries like the following:
SELECT MIN(`score`) FROM `T_SCORE` WHERE `stage` = '4';
That way you can let the database handle doing the calculation for you, so you never have to fetch all the results.
My SQL-fu isn't the most awesome, but I think you can also do this:
SELECT `stage`, MIN(`score`) AS min, MAX(`score`) AS max FROM `T_SCORE` GROUP BY `stage`
That would do all the calculations in one single query.