PySpark - SQL query returns wrong data - pyspark

I'm working on implementation of collaborative filtering (using Movielens 20m dataset).
ratings data is looking like this:
| userId | movieId | rating | timestamp |
ratings are between 1-5 (if a user did't rate a movie it's not appearing in the table).
The following is part of the code:
ratings = spark.read.option("inferSchema","true").option("header","true").csv("ratings.csv")
ratings.createOrReplaceTempView("ratings")
ratings.createOrReplaceTempView("ratings")
i_ratings = spark.sql("select distinct userId, case when movieId == 1 then rating else 0 end as rating from ratings order by userId asc ")
The SQL query meant to return for movieId == 1 all the ratings it gots from the user, and 0 for users that didn't rate it.
I'm getting the following:dataframe
As you can see, if a user didn't rate the movie I'm getting rating = 0 as desired, however for users that did rate the movie i'm getting two rows, one with the actual rating, and another with rating =0.
Checked the ratings.csv dataset, there is no duplicates, that is, every user rated every movie max one time.
Not sure what i'm missing here.

Try the following sql:
i_ratings = spark.sql("""
select
distinct userId,
case when rating is not null then rating else 0 end as rating
from ratings
where movieId = 1
order by userId asc
""")
Not sure if this what you want but your screenshot only shows two columns. im guessing you want the following: for a movieid if a user has not provided rating then put 0 else take rating. If this is the case you should filter moveId using where clause.

Related

get average using partitionby

i am having the following data
here 1 id can have multiple sub_segment_1 (id = 2036013106) or 1 id can have 1 sub_segemnt_1 and multiple sub_segement_2 (id = 2035867480), irrespective of how many instance of sub_segemnt_1 or sub_segement_2 a id has, i want to derive the average per id so that the explosion in sales is addressed
what i would like to have is something like this - sales / count(id) - this will give average
i tried the following approach
df_u = df_u.withColumn("sales", F.avg("sales").over(Window.partitionBy("id")))
but this gives me wrong results

DISTINCT ON still gives me an error that select item should be in GROUP BY

I have a table with a list of customer IDs and a list of dates as follows:
id |take_list_date | customer_id
1 |2016-02-17 | X00001
2 |2016-02-20 | X00002
3 |2016-02-20 | X00003
I am trying to return a count of all the IDs in the table on a specific day in the following format:
label: 2016-02-20 value: 2
The following query produces the required results within the specified date range:
select
count(customer_id)::int as value,
take_list_date::varchar as label
FROM
customer_take_list
where
take_list_date >= '10-12-2017'
and
take_list_date <= '20-12-2017'
GROUP BY
take_list_date
ORDER BY
take_list_date
The problem is I have to include an ID field to make it compatible with Ember Data. When I include an ID field I need to add it to the Group By clause which produces incorrect results.
After looking at some suggestions on other SO questions I tried to resolve this using DISTINCT ON:
select distinct on
(take_list_date) take_list_date::varchar as label
count(customer_id)::int as value
FROM
customer_take_list
where
take_list_date >= '10-12-2017'
and
take_list_date <= '20-12-2017'
order by
take_list_date
Bizarrely this still gives me the same Group By error. What have I done wrong?
I'm not an expert in the technologies involved, but I think you need to create an arbitrary ID rather than use one of the ID's in the table. An example is here: Add Postgres incremental ID. I think your final query should look something like this:
SELECT
COUNT(customer_id)::int as value,
take_list_date::varchar as label,
ROW_NUMBER() OVER (ORDER BY take_list_date) AS id
FROM
customer_take_list
where
take_list_date >= '10-12-2017'
and
take_list_date <= '20-12-2017'
GROUP BY
take_list_date
ORDER BY
take_list_date

sub query in select clause with hive

I am unable to figure out a way to achieve below functionality through a valid query in Hive. Intention is get the top rated movies in a released in a year based on weighted average.
To be more clear this is what I should be able to do in hive in a single query.
var allMoviesRated = select count(movieid) where year(from_unixtime(unixtime)) = 1997;
select movieid, avg(rating), count(movieid), avg(rating)/allMoviesRated as weighted from
(select movieid, rating, year(from_unixtime(unixtime)) as year from u_data where u_data_new.year = 1997) u_data_new group by movieid order by weighted desc limit 10;
sadly .. I dnt think there is a way to this in a single query using a subquery to get a count of all movies rated.
You may to write a script which execute 2 queries
First query one fetch the allMoviesRated and that is stored in a script variable.
Second query is your ranking query to which this value is passed using hiveconf
Thus your script can look like
your script.bash or python------------start--------
var allMoviesRated = os.cmd (hive -S "use db; select count(distinct movieid);")
ranking = os.cmd ( hive -S -hiveconf NUM_MOVIES = allMoviesRated -f ranking_query.hql)
your script.bash or python------------end--------
ranking_query.hql:
select movieid, avg(rating), count(movieid), avg(rating)/${hiveconf:NUM_MOVIES }as weighted
from (
select movieid, rating, year(from_unixtime(unixtime)) as year
from u_data where u_data_new.year = 1997) u_data_new
group by movieid order by weighted desc limit 10;

how to get grouped query data from the resultset?

I want to get grouped data from a table in sqlite. For example, the table is like below:
Name Group Price
a 1 10
b 1 9
c 1 10
d 2 11
e 2 10
f 3 12
g 3 10
h 1 11
Now I want get all data grouped by the Group column, each group in one array, namely
array1 = {{a,1,10},{b,1,9},{c,1,10},{h,1,11}};
array2 = {{d,2,11},{e,2,10}};
array3 = {{f,3,12},{g,3,10}}.
Because i need these 2 dimension arrays to populate the grouped table view. the sql statement maybe NSString *sql = #"SELECT * FROM table GROUP BY Group"; But I wonder how to get the data from the resultset. I am using the FMDB.
Any help is appreciated.
Get the data from sql with a normal SELECT statement, ordered by group and name:
SELECT * FROM table ORDER BY group, name;
Then in code, build your arrays, switching to fill the next array when the group id changes.
Let me clear about GroupBy. You can group data but that time its require group function on other columns.
e.g. Table has list of students in which there are gender group mean Male & Female group so we can group this table by Gender which will return two set . Now we need to perform some operation on result column.
e.g. Maximum marks or Average marks of each group
In your case you want to group but what kind of operation you require on price column ?.
e.g. below query will return group with max price.
SELECT Group,MAX(Price) AS MaxPriceByEachGroup FROM TABLE GROUP BY(group)

PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT)

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.