MySQL Workbench - script storing return in array and performing calculations? - mysql-workbench

Firstly, this is part of my college homework.
Now that's out of the way: I need to write a query that will get the number of free apps in a DB as a percentage of the total number of apps, sorted by what category the app is in.
I can get the number of free apps and also the number of total apps by category. Now I need to find the percentage, this is where it goes a bit pear-shaped.
Here is what I have so far:
-- find total number of apps per category
select #totalAppsPerCategory := count(*), category_primary
from apps
group by category_primary;
-- find number of free apps per category
select #freeAppsPerCategory:= count(*), category_primary
from apps
where (price = 0.0)
group by category_primary;
-- find percentage of free apps per category
set #totals = #freeAppsPerCategory / #totalAppsPercategory * 100;
select #totals, category_primary
from apps
group by category_primary;
It then lists the categories but the percentage listed in each category is the exact same value.
I had initially thought to use an array, but from what I have read mySql does not seem to support arrays.
I'm a bit lost of how to proceed from here.

Finally figured it out. Since I had been saving the previous results in variables it seems that it was not able to calculate on a row by row basis, which is why all the percentages were identical, it was an average. So the calculation needed to be part of the query.
Here's what I came up with:
SELECT DISTINCT
category_primary,
CONCAT(FORMAT(COUNT(CASE
WHEN price = 0 THEN 1
END) / COUNT(*) * 100,
1),
'%') AS FreeAppSharePercent
FROM
apps
GROUP BY category_primary
ORDER BY FreeAppSharePercent DESC;
Then the query result is:

Related

Using theta sketch to count ad impressions and unqiue users

We're currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc
Currently, we don't have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.
We want to answer the following 2 queries to begin with :-
Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k
Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k
As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).
We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-
Date Dimension Name Dimension Value Unique-User-ID(Theta sketch)
2021/09/12 "Country" "US" 37873-3udif-83748-2973483
2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483
.
.
<Other records>
So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)
I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)
I've also read that the about theta sketch set ops accuracy here
I would like to know if there is a better approach to solve Q2(with or without Druid)
Also I would like to know how can I solve Q1?
If I replace Unique-User-Id with "Impression-Id", can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with "Impression-Id" then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.
Please share your thoughts about solving Q1 and Q2.
Regards
kartik

Calculate sum of unique tag values

sometimes it appears that one have to calculate SUM of unique TAG values in InfluxDB. How to do it?
For example I have multiple users who downloads software. Now I want to extract how many unique users downloaded it.
Following query was tested in Grafana to calculate unique users and also consider time filter applied to the database.
To do this we need to apply subquery first to calculate mean values, this basically will result a table with value 1 associated for each user:
SELECT mean("count") FROM "autogen"."downloads" WHERE $timeFilter GROUP BY "username"
Here count is an integer value that is equal to 1 for each time user downloads the software.
After we can calculate sum of these mean values, yes, this is not cheap if you have a huge database, but still is a working solution:
SELECT SUM(mean) FROM (
SELECT mean("count") FROM "autogen"."downloads" WHERE $timeFilter GROUP BY "username"
)
Please go ahead and propose best performing/more native solution, this will be nice to apply for larger DBs

How to count the maximum value and make sql display one single row

I am learning SQL and want to make the following:
I need to get the highest value from 2 different tables. OUTPUT Displays all rows, however I need a single row with the maximum value.
P.S. LIMIT 1 is not working in SQL Server Management Studio
SELECT Players.PlayersID, MAX (Participants.EventsID) AS Maximum FROM Players
LEFT JOIN Participants ON Players.PlayersID = Participants.PlayersID
GROUP BY Players.PlayersID
I clearly understand that this can be a dumb question for pros, however Google did not help. Thanks for understanding and your help.
Try using TOP:
SELECT TOP 1
pl.PlayersID,
MAX(pa.EventsID) AS Maximum
FROM Players pl
LEFT JOIN Participants pa
ON pl.PlayersID = pa.PlayersID
GROUP BY
pl.PlayersID
ORDER BY
MAX(pa.EventsID) DESC;
If you want to cater for the possibility of two players being tied for the same maximum, then use TOP 1 WITH TIES instead of just TOP 1.

How to sort by a calculated column in PostgreSQL?

We have a number of fields in our offers table that determine an offer's price, however one crucial component for the price is an exchange rate fetched from an external API. We still need to sort offers by actual current price.
For example, let's say we have two columns in the offers table: exchange and premium_percentage. The exchange is the name of the source for the exchange rate to which an external request will be made. premium_percentage is set by the user. In this situation, it is impossible to sort offers by current price without knowing the exchange rate, and that maybe different depending on what's in the exchange column.
How would one go about this? Is there a way to make Postgres calculate current price and then sort offers by it?
SELECT
product_id,
get_current_price(exchange) * (premium_percentage::float/100 + 1) AS price
FROM offers
ORDER BY 2;
Note the ORDER BY 2 to sort by the second ordinal column.
You can instead repeat the expression you want to sort by in the ORDER BY clause. But that can result in multiple evaluation.
Or you can wrap it all in a subquery so you can name the output columns and refer to them in other clauses.
SELECT product_id, price
FROM
(
SELECT
product_id,
get_current_price(exchange) * (premium_percentage::float/100 + 1)
FROM offers
) product_prices(product_id, price)
ORDER BY price;

Pagination on large data sets? – Abort count(*) after a certain time

We use the following pagination technique here:
get count(*) of given filter
get first 25 records of given filter
-> render some pagination links on the page
This works pretty well as long as count(*) is reasonable fast. In our case the data size has grown to a point where a non-indexd query (although most stuff is covered by indices) takes more than a minute. So at this point the user waits for a mostly unimportant number (total records matching filter, number of pages). The first N records are often ready pretty fast.
Therefore I have two questions:
can I limit the count(*) to a certain number
or would it be possible to limit it by time? (no count() known after 20ms)
Or just in general: are there some easy ways to avoid that problem? We would like to keep the system as untouched as possible.
Database: Oracle 10g
Update
There are several scenarios
a) there's an index -> neither count(*) nor the actual select should be a problem
b) there's no index
count(*) is HUGE, and it takes ages to determine it -> rownum would help
count(*) is zero or very low, here a time limit would help. Or I could just dont do a count(*) if the result set is already below the page limit.
You could use 'where rownum < x' to limit the number of rows to count. And if you need to show to your user that you has more register, you could use x+1 in count just to see if there is more than x registers.