Good Morning/Afternoon,
Fairly new at SQL in general, but I've been able to put together a few queries that give me the output I need for a certain business case, but I have a new requirement that I am struggling with.
In a nutshell, I am extracting a list of "sites" (physical locations) for a given customer along with the lat/lon, and then feeding each location to Google Maps to plot a point for each site. What I am trying to do is query for unique sites. If there are multiple sites that are within 1/2 mile of each other, they are lumped as a single site. For example, if a customer has 10 sites within 1/2 mile of each other, they technically have 1 site, not 10.
Here is an example of what I am doing:
select c.id, i.site_id, s.name, max(i.captured_at), s.center_lat, s.center_lng, CONCAT(s.center_lat, ',' , s.center_lng) AS LOCATION
from images i
inner join sites s on s.id = i.site_id
inner join customers c on c.id = s.customer_id
where i.hidden = 'false' and i.copied_from_id is null and i.status = 'complete' and c.id = '353'
group by c.id, i.site_id, s.name, s.center_lat, s.center_lng
order by max DESC
Here is an example of the output:
As it stands now, it returns a count of 4 sites (I am rendering the results in Google Data Studio displaying the count of records returned), which works fine for another scenario. However, since these sites are within 1/2 mile of each other, they are technically 1 site, not 4. I am trying to determine how to come up with a count of 1 site vs. 4 in this scenario. If there was another entry where the lat/lon (location) was more than 1/2 mile away, I would be looking for a count of 2 sites. I hope this all makes sense.
Currently trying to research where to start so if there are any references, or a push in the right direction, that would be awesome. Thanks very much.
Related
Context: I'm fairly new to coding as a whole and is learning SQL. This is one of my practice/training session
I'm trying to create a Dimension Table called "Employee Info" using the Adventureworks2019 public Database. Below is my attempt query to fetch all the data needed for this table.
SELECT
e.BusinessEntityID AS EmployeeID,
EEKey = ROW_NUMBER() OVER(ORDER BY(SELECT NULL)),
p.FirstName,
p.MiddleName,
p.LastName,
p.PersonType,
e.Gender,
e.JobTitle,
ep.Rate,
ep.PayFrequency,
e.BirthDate,
e.HireDate,
ep.RateChangeDate AS PayFrom,
e.MaritalStatus
From HumanResources.Employee AS e FULL JOIN
Person.Person AS p ON p.BusinessEntityID = e.BusinessEntityID FULL JOIN
Person.BusinessEntityAddress AS bea ON bea.BusinessEntityID = e.BusinessEntityID FULL JOIN
HumanResources.EmployeePayHistory AS ep ON ep.BusinessEntityID = e.BusinessEntityID
Where
PersonType='SP'
OR PersonType='EM'
ORDER BY EmployeeID;
Query result
Each employee (EE for short) will have a unique [EmployeeID]. The [EEKey] is simply used to mark ordinal numbers of each record.
EEs are paid different rates shown in the [Rate] column. There will be duplicate records if any EE receives a change in his/her pay rate.
There is currently a [PayFrom] column indicating the first date a pay rate is being applied to each record.
Current requirements: Create a [PayTo] column on the right of [PayFrom] to return the last date each EE is getting paid their corresponding pay rate. There should be 2 scenarios:
If the EE being checked has multiple records, meaning his/her pay rate was adjusted at some point. [PayTo] will return the [PayFrom] date of the next record minus 1 day.
If the EE being checked does not have any additional record indicating pay rate changes. [PayTo] will return a fixed day that was specified (Say 31/12/2070)
Example:
[EmployeeID] no. 4 - Rob Walters with 3 consecutive records in Line 4,5,6. In Line 4, the [PayTo] column is expected to return the [PayFrom] date of Line 5 minus 1 day (2010-05-30). The same rule should be applied for Line 5, returning (2011-12-14).
As for Line 6, since there is no additional similar record to fetch data from, it will return the specified date (2070-12-31), using the same rule as every single-record EE.
As I have mentioned, I am a fresher and completely new to coding, so my interpretation and method might be off. If you can kindly point out what I'm doing wrong or show me what should I do to solve this issue, it will be much appreciated.
We're currently serving 250 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 180 Billion ad impressions in the US alone.
Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id, site-id etc
Currently, we don't have a data warehouse and ad-hoc OLAP support is pretty much non-existent in our organization. This severely limits our ability to run adhoc queries and get a quick grasp about data.
We want to answer the following 2 queries to begin with :-
Q1) Find the total count of ad impressions which were served from "beginDate" to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = d_k
Q2) Find the total count of unique users which saw our ads from "beginDate" to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk = d_k
As I said each impression can have hundreds of dimensions(listed above) and cardinality of each dimension could be from few hundreds(say for dimension Country) to Billions(for e.g User-id).
We want approximate answers and the least infrastructure cost and query response time within < 5 minutes. I am thinking about using Druid and Apache datasketches(Theta Sketch to be precise) for answering Q2 and using the following data-model :-
Date Dimension Name Dimension Value Unique-User-ID(Theta sketch)
2021/09/12 "Country" "US" 37873-3udif-83748-2973483
2021/09/12 "Browser" "Chrome" 37873-3aeuf-83748-2973483
.
.
<Other records>
So after roll-up, I would end up with 1 theta-sketch per dimension value per day(assuming day level granularity) and I can do unions and intersections on these sketches to answer Q2)
I am planning to set k(nominal entries) to 10^5(please comment about what would be suitable k for this use case and expected storage amount required?)
I've also read that the about theta sketch set ops accuracy here
I would like to know if there is a better approach to solve Q2(with or without Druid)
Also I would like to know how can I solve Q1?
If I replace Unique-User-Id with "Impression-Id", can I use the same data model to answer Q1? I believe that if I replace Unique-User-Id with "Impression-Id" then accuracy to count the total impressions would be way worse than that of Q2, because each ad-impression is assigned a unique id and we are currently serving 250 Billion per day.
Please share your thoughts about solving Q1 and Q2.
Regards
kartik
Firstly, this is part of my college homework.
Now that's out of the way: I need to write a query that will get the number of free apps in a DB as a percentage of the total number of apps, sorted by what category the app is in.
I can get the number of free apps and also the number of total apps by category. Now I need to find the percentage, this is where it goes a bit pear-shaped.
Here is what I have so far:
-- find total number of apps per category
select #totalAppsPerCategory := count(*), category_primary
from apps
group by category_primary;
-- find number of free apps per category
select #freeAppsPerCategory:= count(*), category_primary
from apps
where (price = 0.0)
group by category_primary;
-- find percentage of free apps per category
set #totals = #freeAppsPerCategory / #totalAppsPercategory * 100;
select #totals, category_primary
from apps
group by category_primary;
It then lists the categories but the percentage listed in each category is the exact same value.
I had initially thought to use an array, but from what I have read mySql does not seem to support arrays.
I'm a bit lost of how to proceed from here.
Finally figured it out. Since I had been saving the previous results in variables it seems that it was not able to calculate on a row by row basis, which is why all the percentages were identical, it was an average. So the calculation needed to be part of the query.
Here's what I came up with:
SELECT DISTINCT
category_primary,
CONCAT(FORMAT(COUNT(CASE
WHEN price = 0 THEN 1
END) / COUNT(*) * 100,
1),
'%') AS FreeAppSharePercent
FROM
apps
GROUP BY category_primary
ORDER BY FreeAppSharePercent DESC;
Then the query result is:
I am learning SQL and want to make the following:
I need to get the highest value from 2 different tables. OUTPUT Displays all rows, however I need a single row with the maximum value.
P.S. LIMIT 1 is not working in SQL Server Management Studio
SELECT Players.PlayersID, MAX (Participants.EventsID) AS Maximum FROM Players
LEFT JOIN Participants ON Players.PlayersID = Participants.PlayersID
GROUP BY Players.PlayersID
I clearly understand that this can be a dumb question for pros, however Google did not help. Thanks for understanding and your help.
Try using TOP:
SELECT TOP 1
pl.PlayersID,
MAX(pa.EventsID) AS Maximum
FROM Players pl
LEFT JOIN Participants pa
ON pl.PlayersID = pa.PlayersID
GROUP BY
pl.PlayersID
ORDER BY
MAX(pa.EventsID) DESC;
If you want to cater for the possibility of two players being tied for the same maximum, then use TOP 1 WITH TIES instead of just TOP 1.
This one is kind of weird and my lack of experience has me asking this.
I have an update to do and because of how bad the tables are put together where I'm grabbing this data it makes it a bit difficult.
The scenario:
There could be 1 to x amount of Visits per Patient. I want to grab the last visit. Here is where the problem is - one patient can have two or three IDs. These IDs are linked to ONE ID to help migrate them over to a new database and under one ID.
Now I've tried top 1 in a cross apply and a joining on a maxid. I can get some of it to work but not all. So I used a row number ranking to get how many times a particular person visited. However I have to run a pass on the update for each visit to get the last one as it will overwrite the previous.
Is there a way to use row_number() over (partition by B.Uid order by B.Uid) PID
So I would run a pass where pid = 1 then another pass on where pid = 2 and so on.
I am thinking there must be a way to have it do one pass - either by setting up some while loop or checking to see the highest PID then update.