How can I disaggregate rows of a data frame in Spark? - scala

I have a Spark dataframe containing data similar to the following:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 |
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 |
+----+---------------------+-------+----------+-------------+
I'm looking to turn this into something like the following:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 12:30:00 | 550 | 1 | 5 | ? |
| 1 | 2012-05-02 12:45:00 | 551 | 1 | 1 | ? |
| 1 | 2012-05-02 13:00:00 | 554 | 1 | 3 | ? |
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:15:00 | 578 | 1 | 0 | ? |
| 1 | 2012-05-02 14:30:00 | 584 | 1 | 6 | ? |
+----+---------------------+-------+----------+-------------+------------+
More specifically I want to turn this:
+----+---------------------+-------+----------+-------------+
| ID | Timestamp | Value | Interval | Consumption |
+----+---------------------+-------+----------+-------------+
| 1 | 2012-05-02 14:00:00 | 578 | 4 | 24 |
+----+---------------------+-------+----------+-------------+
Into this:
+----+---------------------+-------+----------+-------------+------------+
| ID | Timestamp | Value | Interval | Consumption | Estimation |
+----+---------------------+-------+----------+-------------+------------+
| 1 | 2012-05-02 13:15:00 | 560 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:30:00 | 566 | 1 | 6 | 4 |
| 1 | 2012-05-02 13:45:00 | 572 | 1 | 6 | 4 |
| 1 | 2012-05-02 14:00:00 | 578 | 1 | 6 | 4 |
+----+---------------------+-------+----------+-------------+------------+
I want to take the rows with more than 1 interval out of the original table, interpolate Values for missing intervals and reinsert the newly created rows into the initial table place of the original rows. I have ideas of how to achieve this (in PostgreSQL for example I would simply use the generate_series() function to create the required Timestamps and calculate new Values), but implementing these in Spark/Scala is proving troublesome.
Assuming I've created a new dataframe containing only rows with Interval > 1, how could I replicate those rows 'n' times with 'n' being the value of Interval? I believe that would give me enough to get going using a Counter function partitioned by some row reference I can create.
If there's a way to replicate the behavior of generate_series() that I've missed, even better.

Related

postgreSQL question: get data by last date of each record and subtract from last date number of days

Please help me make a request. i'm at a dead end.
There are 2 tables:
“Trains”:
+----+---------+
| id | numbers |
+----+---------+
| 1 | 101 |
| 2 | 102 |
| 3 | 103 |
| 4 | 104 |
| 5 | 105 |
+----+---------+
“Passages”:
+----+--------------+-------+---------------------+
| id | train_number | speed | date_time |
+----+--------------+-------+---------------------+
| 1 | 101 | 26 | 2021-11-10 16:26:30 |
| 2 | 101 | 28 | 2021-11-12 16:26:30 |
| 3 | 102 | 24 | 2021-11-14 16:26:30 |
| 4 | 103 | 27 | 2021-11-15 16:26:30 |
| 5 | 101 | 29 | 2021-11-16 16:26:30 |
+----+--------------+-------+---------------------+
The goal is to go through the train numbers from the Trains table, take from the existing ones from the Passages table by the latest date (date_time) and the number of passages for “the last date for each train” - N days. as I understand date_time - interval "N days". should get something like:
+----+--------+---------------------+----------------+
| id | train | last_passage | count_passages |
+----+--------+---------------------+----------------+
| 1 | 101 | 2021-11-10 16:26:30 | 2 |
| 2 | 102 | 2021-11-14 16:26:30 | 1 |
| 3 | 103 | 2021-11-15 16:26:30 | 1 |
| 4 | 104 | null | 0 |
| 5 | 105 | null | 0 |
+----+--------+---------------------+----------------+
ps: "count_passages" - for example, last passage date minus 4 days
I tried through "where in" but I can’t create the necessary and correct request

How to group MVA field for faceted in sphinx

I have an index where some data's has duplicate, all fields are similar except for latitude,longitude and id (field id is not realy ID, just generated row_number() OVER () AS id).
it's example:
mysql> select id,vacancy_id,prof_area_ids,latitude,longitude from jobVacancy;
+------+------------+---------------+----------+-----------+
| id | vacancy_id | prof_area_ids | latitude | longitude |
+------+------------+---------------+----------+-----------+
| 1 | 917 | 11,199,202 | 0.973178 | 0.743566 |
| 2 | 916 | 17,283,288 | 0.973178 | 0.743566 |
| 3 | 915 | 17,288 | 0.973178 | 0.743566 |
| 4 | 914 | 30,482 | 0.973178 | 0.743566 |
| 5 | 919 | 15,243 | 0.825153 | 0.692837 |
| 6 | 919 | 15,243 | 0.825162 | 0.692828 |
| 7 | 918 | 8,154 | 0.825153 | 0.692837 |
| 8 | 918 | 8,154 | 0.825162 | 0.692828 |
| 9 | 920 | 17,283,288 | 0.958914 | 1.282161 |
| 10 | 920 | 17,283,288 | 0.958915 | 1.282215 |
| 11 | 924 | 12,208 | 0.97333 | 0.658246 |
| 12 | 924 | 12,208 | 0.973336 | 0.658237 |
| 13 | 923 | 21,365 | 0.97333 | 0.658246 |
| 14 | 923 | 21,365 | 0.973336 | 0.658237 |
| 15 | 922 | 20,359 | 0.97333 | 0.658246 |
| 16 | 922 | 20,359 | 0.973336 | 0.658237 |
| 17 | 921 | 19,346 | 0.97333 | 0.658246 |
| 18 | 921 | 19,346 | 0.973336 | 0.658237 |
| 19 | 926 | 12,17,208,292 | 0.88396 | 2.389868 |
| 20 | 925 | 12,208 | 0.88396 | 2.389868 |
+------+------------+---------------+----------+-----------+
20 rows in set (0.00 sec)
Now I want to group data by vacancy_id
mysql> select id,vacancy_id,prof_area_ids,latitude,longitude from jobVacancy group by vacancy_id;
+------+------------+---------------+----------+-----------+
| id | vacancy_id | prof_area_ids | latitude | longitude |
+------+------------+---------------+----------+-----------+
| 1 | 917 | 11,199,202 | 0.973178 | 0.743566 |
| 2 | 916 | 17,283,288 | 0.973178 | 0.743566 |
| 3 | 915 | 17,288 | 0.973178 | 0.743566 |
| 4 | 914 | 30,482 | 0.973178 | 0.743566 |
| 5 | 919 | 15,243 | 0.825153 | 0.692837 |
| 7 | 918 | 8,154 | 0.825153 | 0.692837 |
| 9 | 920 | 17,283,288 | 0.958914 | 1.282161 |
| 11 | 924 | 12,208 | 0.97333 | 0.658246 |
| 13 | 923 | 21,365 | 0.97333 | 0.658246 |
| 15 | 922 | 20,359 | 0.97333 | 0.658246 |
| 17 | 921 | 19,346 | 0.97333 | 0.658246 |
| 19 | 926 | 12,17,208,292 | 0.88396 | 2.389868 |
| 20 | 925 | 12,208 | 0.88396 | 2.389868 |
| 21 | 961 | 4,105 | 0.959217 | 1.280721 |
| 23 | 960 | 8,155 | 0.959217 | 1.280721 |
| 25 | 959 | 12,208 | 0.959217 | 1.280721 |
| 27 | 928 | 1,60 | 0.963734 | 1.070297 |
| 29 | 927 | 32,513 | 0.963734 | 1.070297 |
| 31 | 929 | 6,140 | 0.786553 | 0.678649 |
| 33 | 932 | 1,40,46 | 0.824627 | 0.694182 |
+------+------------+---------------+----------+-----------+
20 rows in set (0.00 sec)
Result is awesome! But problem begins when I want to get all grouped data with faceted
mysql> select id,vacancy_id,prof_area_ids,latitude,longitude from jobVacancy where prof_area_ids=199 group by vacancy_id facet prof_area_ids;
+------+------------+-----------------+----------+-----------+
| id | vacancy_id | prof_area_ids | latitude | longitude |
+------+------------+-----------------+----------+-----------+
| 1 | 917 | 11,199,202 | 0.973178 | 0.743566 |
| 191 | 1004 | 11,196,199 | 0.925335 | 2.768874 |
| 313 | 1072 | 1,11,60,197,199 | 0.963968 | 1.070624 |
| 318 | 1136 | 11,196,199 | 0.96071 | 1.448998 |
| 374 | 1097 | 11,199 | 0.785255 | 0.678504 |
+------+------------+-----------------+----------+-----------+
5 rows in set (0.00 sec)
+---------------+----------+
| prof_area_ids | count(*) |
+---------------+----------+
| 202 | 1 |
| 199 | 12 |
| 11 | 12 |
| 196 | 5 |
| 197 | 3 |
| 60 | 3 |
| 1 | 3 |
+---------------+----------+
7 rows in set (0.02 sec)
Faceted result is incorrect. Because in fact data's count where prof_area_ids=199 must be 5 and not 12. So how I can group field for faceted?
Additionaly
I fount here http://sphinxsearch.com/blog/2013/06/21/faceted-search-with-sphinx/ but just written "If you have a MVA facet, you need to use the GROUPBY() function which returns the actual value on which the grouping was made." and without examle.
mysql> select id,vacancy_id,prof_area_ids,latitude,longitude,GROUPBY() as selected,COUNT(*) from jobVacancy where prof_area_ids=199 group by vacancy_id facet prof_area_ids;
+------+------------+-----------------+----------+-----------+----------+----------+
| id | vacancy_id | prof_area_ids | latitude | longitude | selected | count(*) |
+------+------------+-----------------+----------+-----------+----------+----------+
| 1 | 917 | 11,199,202 | 0.973178 | 0.743566 | 917 | 1 |
| 191 | 1004 | 11,196,199 | 0.925335 | 2.768874 | 1004 | 2 |
| 313 | 1072 | 1,11,60,197,199 | 0.963968 | 1.070624 | 1072 | 3 |
| 318 | 1136 | 11,196,199 | 0.96071 | 1.448998 | 1136 | 3 |
| 374 | 1097 | 11,199 | 0.785255 | 0.678504 | 1097 | 3 |
+------+------------+-----------------+----------+-----------+----------+----------+
5 rows in set (0.00 sec)
+---------------+----------+
| prof_area_ids | count(*) |
+---------------+----------+
| 202 | 1 |
| 199 | 12 |
| 11 | 12 |
| 196 | 5 |
| 197 | 3 |
| 60 | 3 |
| 1 | 3 |
+---------------+----------+
7 rows in set (0.02 sec)
Also faceted result is wrong.
Seems, wanting effectively COUNT(DISTINCT vacancy_id) on the FACET rather than the default COUNT(*), but alas it turns out
... FACET prof_area_ids,COUNT(DISTINCT vacancy_id) AS vacancies BY prof_area_ids
doesnt work. The bit before BY only supports attributes, not custom functions.
... will just have to write it out the long way, with full queries...
select id,vacancy_id,prof_area_ids,latitude,longitude from jobVacancy
where prof_area_ids=199 group by vacancy_id;
SELECT GROUPBY() AS prof_area_id, COUNT(DISTINCT vacancy_id) FROM jobVacancy
WHERE prof_area_ids=199 GROUP BY prof_area_id;
Same results, just slightly more verbose. ie rather than using FACET shorthand, write it
out in full, as multiple seperate queries.
Faceted result is incorrect. Because in fact data's count where prof_area_ids=199 must be 5 and not 12. So how I can group field for faceted?
It looks like you misunderstand how FACET works. It seems to me, that you think it takes as a base the main query's result, but it actually just does another grouping. E.g. here:
mysql> select g, t from idx_mva where t = 11 group by g facet t;
+------+----------+
| g | t |
+------+----------+
| 1 | 11,12 |
| 2 | 11,13,15 |
| 3 | 9,11 |
| 5 | 11,12,15 |
+------+----------+
4 rows in set (0.00 sec)
+------+----------+
| t | count(*) |
+------+----------+
| 12 | 2 |
| 11 | 6 |
| 15 | 4 |
| 13 | 1 |
| 9 | 1 |
| 3 | 1 |
+------+----------+
6 rows in set (0.00 sec)
for t=11 you can see that as in your case it's found 3 times in the 1st query's result, but the count for that is 6 in the FACET's query result. This is because it actually occurs 6 times in the index:
mysql> select * from idx_mva where t = 11;
+------+------+----------+
| id | g | t |
+------+------+----------+
| 2 | 1 | 11,12 |
| 3 | 1 | 11,15 |
| 3 | 2 | 11,13,15 |
| 6 | 3 | 9,11 |
| 8 | 5 | 11,12,15 |
| 11 | 2 | 3,11,15 |
+------+------+----------+
6 rows in set, 1 warning (0.00 sec)
and it happens 3 times in the 1st case only because the t's value is returned only once for each of the groups. You can use group_concat() to see more values from the same group:
mysql> select g, group_concat(to_string(t)) from idx_mva where t = 11 group by g facet t;
+------+----------------------------+
| g | group_concat(to_string(t)) |
+------+----------------------------+
| 1 | 11,12,11,15 |
| 2 | 11,13,15,3,11,15 |
| 3 | 9,11 |
| 5 | 11,12,15 |
+------+----------------------------+
4 rows in set (0.00 sec)
+------+----------+
| t | count(*) |
+------+----------+
| 12 | 2 |
| 11 | 6 |
| 15 | 4 |
| 13 | 1 |
| 9 | 1 |
| 3 | 1 |
+------+----------+
6 rows in set (0.00 sec)
If you want to learn more about faceting here's an interactive course about that - https://play.manticoresearch.com/faceting/

Return unique grouped rows with the latest timestamp [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 3 years ago.
At the moment I'm struggling with a problem that looks very easy.
Tablecontent:
Primay Keys: Timestamp, COL_A,COL_B ,COL_C,COL_D
+------------------+-------+-------+-------+-------+--------+--------+
| Timestamp | COL_A | COL_B | COL_C | COL_D | Data_A | Data_B |
+------------------+-------+-------+-------+-------+--------+--------+
| 31.07.2019 15:12 | - | - | - | - | 1 | 2 |
| 31.07.2019 15:32 | 1 | 1 | 100 | 1 | 5000 | 20 |
| 10.08.2019 09:33 | - | - | - | - | 1000 | 7 |
| 31.07.2019 15:38 | 1 | 1 | 100 | 1 | 33 | 5 |
| 06.08.2019 08:53 | - | - | - | - | 0 | 7 |
| 06.08.2019 09:08 | - | - | - | - | 0 | 7 |
| 06.08.2019 16:06 | 3 | 3 | 3 | 3 | 0 | 23 |
| 07.08.2019 10:43 | - | - | - | - | 0 | 42 |
| 07.08.2019 13:10 | - | - | - | - | 0 | 24 |
| 08.08.2019 07:19 | 11 | 111 | 111 | 12 | 0 | 2 |
| 08.08.2019 10:54 | 2334 | 65464 | 565 | 76 | 1000 | 19 |
| 08.08.2019 11:15 | 232 | 343 | 343 | 43 | 0 | 2 |
| 08.08.2019 11:30 | 2323 | rtttt | 3434 | 34 | 0 | 2 |
| 10.08.2019 14:47 | - | - | - | - | 123 | 23 |
+------------------+-------+-------+-------+-------+--------+--------+
Needed query output:
+------------------+-------+-------+-------+-------+--------+--------+
| Timestamp | COL_A | COL_B | COL_C | COL_D | Data_A | Data_B |
+------------------+-------+-------+-------+-------+--------+--------+
| 31.07.2019 15:38 | 1 | 1 | 100 | 1 | 33 | 5 |
| 06.08.2019 16:06 | 3 | 3 | 3 | 3 | 0 | 23 |
| 08.08.2019 07:19 | 11 | 111 | 111 | 12 | 0 | 2 |
| 08.08.2019 10:54 | 2334 | 65464 | 565 | 76 | 1000 | 19 |
| 08.08.2019 11:15 | 232 | 343 | 343 | 43 | 0 | 2 |
| 08.08.2019 11:30 | 2323 | rtttt | 3434 | 34 | 0 | 2 |
| 10.08.2019 14:47 | - | - | - | - | 123 | 23 |
+------------------+-------+-------+-------+-------+--------+--------+
As you can see, I'm trying to get single rows for my primary keys, using the latest timestamp, which is also a primary key.
Currently, I tried a query like:
SELECT Timestamp, COL_A, COL_B, COL_C, COL_D, Data_A, Data_B From Table XY op
WHERE Timestamp = (
SELECT MAX(Timestamp) FROM XY as tsRow
WHERE op.COL_A = tsRow.COL_A
AND op.COL_B = tsRow.COL_B
AND op.COL_C = tsRow.COL_C
AND op.COL_D = tsRow."COL_D
);
which gives me result that looks fine at first glance.
Is there a better or more safe way to get my preferred result?
demo:db<>fiddle
You can use the DISTINCT ON clause, which gives you the first record of an ordered group. Here your group is your (A, B, C, D). This is ordered by the Timestamp column, in descending order, to get the most recent record to be the first.
SELECT DISTINCT ON ("COL_A", "COL_B", "COL_C", "COL_D")
*
FROM
mytable
ORDER BY "COL_A", "COL_B", "COL_C", "COL_D", "Timestamp" DESC
If you want to get your expected order, you need a second ORDER BY after this operation:
SELECT
*
FROM (
SELECT DISTINCT ON ("COL_A", "COL_B", "COL_C", "COL_D")
*
FROM
mytable
ORDER BY "COL_A", "COL_B", "COL_C", "COL_D", "Timestamp" DESC
) s
ORDER BY "Timestamp"
Note: If you have the Timestamp column as part of the PK, are you sure, you really need the four other columns as PK as well? It seems, that the TS column is already unique.

Check previous and next record

I'm trying to compare different costs from different periods. But I dont no how I can compare the single record with the record before and after. What I need is a yes or no in my dataset when the costs from a records is the same as record before and record after.
My dataset looks like this:
+--------+-----------+----------+------------+-------+-----------+
| Client | Provision | CAK Year | CAK Period | Costs | Serial Nr |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 210 | 2017 | 13 | 150 | 1 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 210 | 2018 | 1 | 200 | 2 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 210 | 2018 | 2 | 170 | 3 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 210 | 2018 | 3 | 150 | 4 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 210 | 2018 | 4 | 150 | 5 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 210 | 2018 | 5 | 150 | 6 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 689 | 2018 | 1 | 345 | 1 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 689 | 2018 | 2 | 345 | 1 |
+--------+-----------+----------+------------+-------+-----------+
| 1 | 689 | 2018 | 3 | 345 | 1 |
+--------+-----------+----------+------------+-------+-----------+
What i've tried so far:
CASE
WHEN Provision = Provision
AND Costs = LEAD(Costs, 1, 0) OVER(ORDER BY CAK Year, CAK Period)
AND Costs = LAG(Costs, 1, 0) OVER(ORDER BY CAK Year, CAK Period)
THEN 'Yes
ELSE 'No'
END
My expected result:
+--------+-----------+----------+------------+-------+-----------+--------+
| Client | Provision | CAK Year | CAK Period | Costs | Serial Nr | Result |
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 210 | 2017 | 13 | 150 | 1 | No
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 210 | 2018 | 1 | 200 | 2 | No
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 210 | 2018 | 2 | 170 | 3 | No
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 210 | 2018 | 3 | 150 | 4 | No
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 210 | 2018 | 4 | 150 | 5 | Yes
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 210 | 2018 | 5 | 150 | 6 | No
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 689 | 2018 | 1 | 345 | 1 | No
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 689 | 2018 | 2 | 345 | 1 | Yes
+--------+-----------+----------+------------+-------+-----------+--------+
| 1 | 689 | 2018 | 3 | 345 | 1 | No
+--------+-----------+----------+------------+-------+-----------+--------+
You guys can help me further because I don't get the expected result?
You need to add in partition by Provision otherwise your lag and lead ordering will run across all Provision values:
declare #d table(Client int,Provision int,CAKYear int, CAKPeriod int, Costs int, SerialNr int);
insert into #d values
(1,210,2017,13,150,1)
,(1,210,2018,1,200,2)
,(1,210,2018,2,170,3)
,(1,210,2018,3,150,4)
,(1,210,2018,4,150,5)
,(1,210,2018,5,150,6)
,(1,689,2018,1,345,1)
,(1,689,2018,2,345,1)
,(1,689,2018,3,345,1);
select *
,case when Provision = Provision
and Costs = lead(Costs, 1, 0) over(partition by Provision order by CAKYear, CAKPeriod)
and Costs = lag(Costs, 1, 0) over(partition by Provision order by CAKYear, CAKPeriod)
then 'Yes'
else 'No'
end as Result
from #d
order by Provision
,CAKYear
,CAKPeriod;
Output
+--------+-----------+---------+-----------+-------+----------+--------+
| Client | Provision | CAKYear | CAKPeriod | Costs | SerialNr | Result |
+--------+-----------+---------+-----------+-------+----------+--------+
| 1 | 210 | 2017 | 13 | 150 | 1 | No |
| 1 | 210 | 2018 | 1 | 200 | 2 | No |
| 1 | 210 | 2018 | 2 | 170 | 3 | No |
| 1 | 210 | 2018 | 3 | 150 | 4 | No |
| 1 | 210 | 2018 | 4 | 150 | 5 | Yes |
| 1 | 210 | 2018 | 5 | 150 | 6 | No |
| 1 | 689 | 2018 | 1 | 345 | 1 | No |
| 1 | 689 | 2018 | 2 | 345 | 1 | Yes |
| 1 | 689 | 2018 | 3 | 345 | 1 | No |
+--------+-----------+---------+-----------+-------+----------+--------+

Divison with more than one result from postgresql query

I am using postgresql and I have a table called accidents (state, total accidents) and another table called population. I want to get the top 3 state names with high total accidents and then get the population of those 3 states divided by total accidents in postgresql? How to write the query in the following way?
Explanation:
Population Table
rank| state | population
---+-----------------------------+------------
1 | Uttar Pradesh | 199581477
2 | Maharashtra | 112372972
3 | Bihar | 103804630
4 | West Bengal | 91347736
5 | Madhya Pradesh | 72597565
6 | Tamil Nadu | 72138958
7 | Rajasthan | 68621012
8 | Karnataka | 61130704
9 | Gujarat | 60383628
10 | Andhra Pradesh | 49665533
11 | Odisha | 41947358
12 | Telangana | 35193978
13 | Kerala | 33387677
14 | Jharkhand | 32966238
15 | Assam | 31169272
16 | Punjab | 27704236
17 | Haryana | 25753081
18 | Chhattisgarh | 25540196
19 | Jammu and Kashmir | 12548926
20 | Uttarakhand | 10116752
21 | Himachal Pradesh | 6856509
22 | Tripura | 3671032
23 | Meghalaya | 2964007
24 | Manipur*β* | 2721756
25 | Nagaland | 1980602
26 | Goa | 1457723
27 | Arunachal Pradesh | 1382611
28 | Mizoram | 1091014
29 | Sikkim | 607688
30 | Delhi | 16753235
31 | Puducherry | 1244464
32 | Chandigarh | 1054686
33 | Andaman and Nicobar Islands | 379944
34 | Dadra and Nagar Haveli | 342853
35 | Daman and Diu | 242911
36 | Lakshadweep | 64429
accident table:
state | eqto8 | eqto10 | mrthn10 | ntknwn | total
-----------------------------+-------+--------+---------+--------+--------
Andhra Pradesh | 6425 | 8657 | 8144 | 19298 | 42524
Arunachal Pradesh | 88 | 76 | 87 | 0 | 251
Assam | 0 | 0 | 0 | 6535 | 6535
Bihar | 2660 | 3938 | 3722 | 0 | 10320
Chhattisgarh | 2888 | 7052 | 3571 | 0 | 13511
Goa | 616 | 1512 | 2184 | 0 | 4312
Gujarat | 4864 | 7864 | 7132 | 8089 | 27949
Haryana | 3365 | 2588 | 4112 | 0 | 10065
Himachal Pradesh | 276 | 626 | 977 | 1020 | 2899
Jammu and Kashmir | 1557 | 618 | 434 | 4100 | 6709
Jharkhand | 1128 | 701 | 1037 | 2845 | 5711
Karnataka | 11167 | 14715 | 18566 | 0 | 44448
Kerala | 5580 | 13271 | 17323 | 0 | 36174
Madhya Pradesh | 15630 | 16226 | 19354 | 0 | 51210
Maharashtra | 4117 | 5350 | 10538 | 46311 | 66316
Manipur | 147 | 453 | 171 | 0 | 771
Meghalaya | 210 | 154 | 119 | 0 | 483
Mizoram | 27 | 58 | 25 | 0 | 110
Nagaland | 11 | 13 | 18 | 0 | 42
Odisha | 1881 | 3120 | 4284 | 0 | 9285
Punjab | 1378 | 2231 | 1825 | 907 | 6341
Rajasthan | 5534 | 5895 | 5475 | 6065 | 22969
Sikkim | 6 | 144 | 8 | 0 | 158
Tamil Nadu | 8424 | 18826 | 29871 | 10636 | 67757
Tripura | 290 | 376 | 222 | 0 | 888
Uttarakhand | 318 | 305 | 456 | 393 | 1472
Uttar Pradesh | 8520 | 10457 | 10995 | 0 | 29972
West Bengal | 1494 | 1311 | 974 | 8511 | 12290
Andaman and Nicobar Islands | 18 | 104 | 114 | 0 | 236
Chandigarh | 112 | 39 | 210 | 58 | 419
Dadra and Nagar Haveli | 40 | 20 | 17 | 8 | 85
Daman and Diu | 11 | 6 | 8 | 25 | 50
Delhi | 0 | 0 | 0 | 6937 | 6937
Lakshadweep | 0 | 0 | 0 | 3 | 3
Puducherry | 154 | 668 | 359 | 0 | 1181
All India | 88936 | 127374 | 152332 | 121741 | 490383
So that result should be
21.57
81.03
107.44
explanation:
Highest accidents states Tamilnadu, Maharashtra, Madhyapradesh.
Tamilnadu population/accidents = 21213/983 = 21.57 (Assumed values)
Maharasthra population/accidents = 10000/123 = 81.03
Madhyapradesh population/accidents = 34812/324 = 107.44
My query is:
SELECT POPULATION/
(SELECT TOTAL
FROM accidents
WHERE STATE NOT LIKE 'All %'
ORDER BY TOTAL DESC
LIMIT 3)
aVG FROM population
WHERE STATE IN
(SELECT STATE
FROM accidents
WHERE STATE NOT LIKE 'All %'
ORDER BY TOTAL DESC
LIMIT 3);
throwing ERROR: more than one row returned by a subquery used as an expression.
How to modify the query to get the required result or any other way to get the result in postgresql?
This ought to do it.
SELECT a.state, population.population/a.total FROM
(SELECT total, state FROM accidents WHERE state <> 'All India' ORDER BY total DESC LIMIT 3 ) AS a
INNER JOIN population on a.state = population.state