get average using partitionby - pyspark

i am having the following data
here 1 id can have multiple sub_segment_1 (id = 2036013106) or 1 id can have 1 sub_segemnt_1 and multiple sub_segement_2 (id = 2035867480), irrespective of how many instance of sub_segemnt_1 or sub_segement_2 a id has, i want to derive the average per id so that the explosion in sales is addressed
what i would like to have is something like this - sales / count(id) - this will give average
i tried the following approach
df_u = df_u.withColumn("sales", F.avg("sales").over(Window.partitionBy("id")))
but this gives me wrong results

Related

Calculating correlation coefficient using PostgreSQL with data from same table?

I would like to expand existing question.
I have 3 tables:
Rivers' names and id - Let's name it river
Rivers' id and Hydrols' ids - Let's name it id
Hydrols' ids, volume and date - Let's name it data
I know how to select data from this 3 tables:
SELECT DISTINCT river.name, AVG(data.volume)
FROM data
INNER JOIN id
ON id.id_hydrol = data.id_hydrol
INNER JOIN river
ON river.id_river = id.id_river
AND river.name = 'NAME_1'
GROUP BY river.name
ORDER BY AVG(data.volume) DESC
What do i have to write instead of ? mark ?
How can i compare volume of 2 rivers with different names and date has to be the same?
How i put 2 different requests in corr() function?

Calculating sum with no direct join column

I have a table (ShipJourneys) where I need to calculate the Total Fuel Consumed which is a float value. See the image below.
This value is obtained by summing all the individual consumers of fuel for a given vessel over the timeframe specified. This data is contained in a second table.
Boxed in area in red shows there were 5 fuel consumers (specified by the FK_RmaDataSumsystemConfigID) and that 3 of the consumers had burned 0 units of fuel and 2 had each burned 29.
To calculate the totalFuelConsumed for that range of time frames, for a given vessel (stipulated by the FK_RmaID), the following query could be used
Select sum(FuelCalc)
from FuelCalc
where Timestamp >= '2019-07-24 00:00:00'
and Timestamp <= '2019-07-24 00:02:00'
and FK_RmaID = 660
Using something like the query below does not work, resulting in bogus values
UPDATE ShipJourneys
SET TotalFuelConsumed =
(Select sum(FuelCalc) from FuelCalc as f
WHERE f.timestamp >= StartTimeUTC
and f.timestamp <= EndTimeUTC
and f.FK_RmaID = FK_RmaID)
Any suggestions on how I could join them
You could try something like that:
UPDATE myTable // Put the table correct name here
SET TotalFuelConsumed =
Select sum(FuelUsed) from FuelTimeTbl as fuelTbl
WHERE fuelTbl.timestamp >= '2019-10-21 22:13:55.000'
and fuelTbl.imestamp <= '2019-11-27 17:10:58.000'
and fuelTbl.FK_RmaID = myTable.RmaID // Put the correct attribute name

PySpark - SQL query returns wrong data

I'm working on implementation of collaborative filtering (using Movielens 20m dataset).
ratings data is looking like this:
| userId | movieId | rating | timestamp |
ratings are between 1-5 (if a user did't rate a movie it's not appearing in the table).
The following is part of the code:
ratings = spark.read.option("inferSchema","true").option("header","true").csv("ratings.csv")
ratings.createOrReplaceTempView("ratings")
ratings.createOrReplaceTempView("ratings")
i_ratings = spark.sql("select distinct userId, case when movieId == 1 then rating else 0 end as rating from ratings order by userId asc ")
The SQL query meant to return for movieId == 1 all the ratings it gots from the user, and 0 for users that didn't rate it.
I'm getting the following:dataframe
As you can see, if a user didn't rate the movie I'm getting rating = 0 as desired, however for users that did rate the movie i'm getting two rows, one with the actual rating, and another with rating =0.
Checked the ratings.csv dataset, there is no duplicates, that is, every user rated every movie max one time.
Not sure what i'm missing here.
Try the following sql:
i_ratings = spark.sql("""
select
distinct userId,
case when rating is not null then rating else 0 end as rating
from ratings
where movieId = 1
order by userId asc
""")
Not sure if this what you want but your screenshot only shows two columns. im guessing you want the following: for a movieid if a user has not provided rating then put 0 else take rating. If this is the case you should filter moveId using where clause.

Filter portal for most recently created record by group

I have a portal on my "Clients" table. The related table contains the results of surveys that are updated over time. For each combination of client and category (a field in the related table), I only want the portal to display the most recently collected row.
Here is a link to a trivial example that illustrates the issue I'm trying to address. I have two tables in this example (Related on ClientID):
Clients
Table 1 Get Summary Method
The Table 1 Get Summary Method table looks like this:
Where:
MaxDate is a summary field = Maximum of Date
MaxDateGroup is a calculated field = GetSummary ( MaxDate ;
ClientIDCategory )
ShowInPortal = If ( Date = MaxDateGroup ; 1 ; 0 )
The table is sorted on ClientIDCategory
Issue 1 that I'm stumped on: .
ShowInPortal should equal 1 in row 3 (PKTable01 = 5), row 4 (PKTable01 = 6), and row 6 (PKTable01 = 4) in the table above. I'm not sure why FM is interpreting 1Red and 1Blue as the same category, or perhaps I'm just misunderstanding what the GetSummary function does.
The Clients table looks like this:
Where:
The portal records are sorted on ClientIDCategory
Issue 2 that I'm stumped on:
I only want rows with a ShowInPortal value equal to 1 should appear in the portal. I tried creating a portal filter with the following formula: Table 1 Get Summary Method::ShowInPortal = 1. However, using that filter removes all row from the portal.
Any help is greatly appreciated.
One solution is to use ExecuteSQL to grab the Max Date. This removes the need for Summary functions and sorts, and works as expected. Propose to return it as number to avoid any issues with date formats.
GetAsTimestamp (
ExecuteSQL (
"SELECT DISTINCT COALESCE(MaxDate,'')
FROM Survey
WHERE ClientIDCategory = ? "
; "" ; "";ClientIDCategory )
)
Also, you need to change the ShowInPortal field to an unstored calc field with:
If ( GetAsNumber(Date) = MaxDateGroupSQL ; 1 ; 0 )
Then filter the portal on this field.
I can send you the sample file if you want.

how to get grouped query data from the resultset?

I want to get grouped data from a table in sqlite. For example, the table is like below:
Name Group Price
a 1 10
b 1 9
c 1 10
d 2 11
e 2 10
f 3 12
g 3 10
h 1 11
Now I want get all data grouped by the Group column, each group in one array, namely
array1 = {{a,1,10},{b,1,9},{c,1,10},{h,1,11}};
array2 = {{d,2,11},{e,2,10}};
array3 = {{f,3,12},{g,3,10}}.
Because i need these 2 dimension arrays to populate the grouped table view. the sql statement maybe NSString *sql = #"SELECT * FROM table GROUP BY Group"; But I wonder how to get the data from the resultset. I am using the FMDB.
Any help is appreciated.
Get the data from sql with a normal SELECT statement, ordered by group and name:
SELECT * FROM table ORDER BY group, name;
Then in code, build your arrays, switching to fill the next array when the group id changes.
Let me clear about GroupBy. You can group data but that time its require group function on other columns.
e.g. Table has list of students in which there are gender group mean Male & Female group so we can group this table by Gender which will return two set . Now we need to perform some operation on result column.
e.g. Maximum marks or Average marks of each group
In your case you want to group but what kind of operation you require on price column ?.
e.g. below query will return group with max price.
SELECT Group,MAX(Price) AS MaxPriceByEachGroup FROM TABLE GROUP BY(group)