PySpark giving incorrect result on rank for big data - pyspark

I have a data set with 10000000 rows with a column ROWNUM that have a 0 or 1. Sometimes the window give the correct answer on order by other times she mix and take the 0 first.
[PK0, PK1, PK2, PK3, PK4, PK5, PK6. PK7, A, B,C,D,E, ROW_NUM]
window = Window.partitionBy(primary_keys).orderBy(desc(job_constants.ROW_NUM))
return df_uniondata.withColumn(job_constants.RANK, rank().over(window)).where(col(job_constants.RANK) == 1)
.coalesce(coalesce_count).select(df_all_changes.columns)
Is window() idempotent?
How can i enforce orderby on rank?
Thanks for the help.

Related

groupBy Id and get multiple records for multiple columns in scala

I have a spark dataframe as below.
val df = Seq(("a",1,1400),("a",1,1250),("a",2,1200),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",2,500),("b",4,250),("b",4,200),("b",4,100),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
I am working in scala language to make use of this data frame and trying to get result as shown below.
val df = Seq(("a",1,1400),("a",4,1250),("a",4,1200),("a",4,1100),("b",2,2500),("b",2,1250),("b",4,250),("b",4,200),("b",4,100),("b",5,800)).
toDF("id","hierarchy","amount")
Rules: Grouped by id, if min(hierarchy)==1 then I take the row with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. On the other hand, if min(hierarchy)==2 then I take two rows with the highest amount and then I go on to analyze hierarchy >= 4 and take 3 of each of them in descending order of the amount. And so on for all the id's in the data.
Thanks for the suggestions..
You may use window functions to generate the criteria which you will filter upon eg
val results = df.withColumn("minh",min("hierarchy").over(Window.partitionBy("id")))
.withColumn("rnk",rank().over(Window.partitionBy("id").orderBy(col("amount").desc())))
.withColumn(
"rn4",
when(col("hierarchy")>=4, row_number().over(
Window.partitionBy("id",when(col("hierarchy")>=4,1).otherwise(0)).orderBy(col("amount").desc())
) ).otherwise(5)
)
.filter("rnk <= minh or rn4 <=3")
.select("id","hierarchy","amount")
NB. More verbose filter .filter("(rnk <= minh or rn4 <=3) and (minh in (1,2))")
Above temporary columns generated by window functions to assist in the filtering criteria are
minh : used to determine the minimum hierarchy for a group id and subsequently select the top minh number of columns from the group .
rnk used to determine the rows with the highest amount in each group
rn4 used to determine the rows with the highest amount in each group with hierarchy >=4

select only one record from a query which returns several rows

How do I retrieve only one row from a query which returns several?
Let's say I want only the 3 row?
This is the query but I want only the 3rd result
SELECT (journeys.id, j_starting_channel)
AS JER FROM JOURNEYS
WHERE j_starting_channel = 'channel_name' ORDER BY journeys.id;
The following should get you there:
SELECT (journeys.id, j_starting_channel)
AS JER FROM JOURNEYS
WHERE j_starting_channel = 'channel_name' ORDER BY journeys.id
LIMIT 1
OFFSET 2
LIMIT n will return the first n results. OFFSET m skips the first m rows and only returns everything thereafter.
LIMIT n OFFSET m thus returns rows m+1 to m+n.
See the PostgreSQL documentation for more details:
https://www.postgresql.org/docs/9.5/sql-select.html
If you just need to skip some rows then you can just use OFFSET to skip the rows in the top and then use LIMIT to return just one row
Like this:
SELECT (journeys.id, j_starting_channel)
AS JER FROM JOURNEYS
WHERE j_starting_channel = 'channel_name' ORDER BY journeys.id LIMIT 1 OFFSET 2
Here you have a step-by-step tutorial on those clauses
https://www.postgresqltutorial.com/postgresql-limit/
And you can always refer to the documentation too
by using OFFSET, LIMIT you can get needed portion of rows from result set
SELECT (journeys.id, j_starting_channel)
AS JER FROM JOURNEYS
WHERE j_starting_channel = 'channel_name' ORDER BY journeys.id OFFSET 2 LIMIT 1;

Limiting number of rows in contentResolvser.query to get any group of rows

How can I pick any group of rows if I know the first and last ID?
Suppose the table has IDs from 1 to 10 and I want contentResolver.query to return the rows 4 to 8, how can I do that?
I searched for that question, but all I found are solutions to return the first n consecutive rows:
Answer1 Answer2
using LIMIT and OFFSET clauses
solution
SELECT * FROM TABLE LIMIT 5 OFFSET 3; reference

BigQuery - Partitioning data according to some hash criteria

I have a table in BigQuery. I have a certain string column which represents a unique id (uid). I want to filter only a sample of this table, by taking only a portion of the uids (let's say 1/100).
So my idea is to sample the data by doing something like this:
if(ABS(HASH(uid)) % 100 == 0) ...
The problem is this will actually filter in 1/100 ratio only if the distribution of the hash values is uniform. So, in order to check that, I would like to generate the following table:
(n goes from 0 to 99)
0 <number of rows in which uid % 100 == 0>
1 <number of rows in which uid % 100 == 1>
2 <number of rows in which uid % 100 == 2>
3 <number of rows in which uid % 100 == 3>
.. etc.
If I see the numbers in each row are of the same magnitude, then my assumption is correct.
Any idea how to create such a query, or alternatively do the sampling another way?
Something like
Select ABS(HASH(uid)) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
the UID is of different cases (upper, lower) and types you can use some string manipulation within the hash. something like:
Select ABS(HASH(upper(string(uid)))) % 100 as cluster , count(*) as cnt
From yourtable
Group each by cluster
As an alternative to HASH(), you can try RAND() - it doesn't depend on the ids being uniformly distributed.
For example, this would give you 10 roughly equally sized partitions:
SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]
Verification:
SELECT part, COUNT(*) FROM (
SELECT word, INTEGER(10*RAND()) part
FROM [publicdata:samples.shakespeare]
)
GROUP BY part
ORDER BY part
Each group ends up with about 16465 elements.

Postgres bitmask group by

I have the following flags declared:
0 - None
1 - Read
2 - Write
4 - View
I want to write a query that will group on this bitmask and get the count of each flag used.
person mask
a 0
b 3
c 7
d 6
The result should be:
flag count
none 1
read 2
write 3
view 2
Any tips would be appreciated.
For Craig
SELECT lea.mask as trackerStatusMask,
count(*) as count
FROM Live le
INNER JOIN (
... --some guff
) lea on le.xId = lea.xId
WHERE le.xId = p_xId
GROUP BY lea.mask;
SQL Fiddle
select
count(mask = 0 or null) as "None",
count(mask & 1 > 0 or null) as "Read",
count(mask & 2 > 0 or null) as "Write",
count(mask & 4 > 0 or null) as "View"
from t
Simplest - pivoted result
Here's how I'd approach it:
-- (after fixing the idiotic mistakes in the first version)
SELECT
count(nullif(mask <> 0, True)) AS "none",
count(nullif(mask & 2,0)) AS "write",
count(nullif(mask & 1,0)) AS "read",
count(nullif(mask & 4,0)) AS "view"
FROM my_table;
-- ... though #ClodAldo's version of it below is considerably clearer, per comments.
This doesn't do a GROUP BY as such; instead it scans the table and collects the data in a single pass, producing column-oriented results.
If you need it in row form you can pivot the result, either using the crosstab function from the tablefunc module or by hand.
If you really must GROUP BY, explode the bitmask
You cannot use GROUP BY for this in a simple way, because it expects rows to fall into exactly one group. Your rows appear in multiple groups. If you must use GROUP BY you will have to do so by generating an "exploded" bitmask where one input row gets copied to produce multiple output rows. This can be done with a LATERAL function invocation in 9.3, or with a SRF-in-SELECT in 9.2, or by simply doing a join on a VALUES clause:
SELECT
CASE
WHEN mask_bit = 1 THEN 'read'
WHEN mask_bit = 2 THEN 'write'
WHEN mask_bit = 4 THEN 'view'
WHEN mask_bit IS NULL THEN 'none'
END AS "flag",
count(person) AS "count"
FROM t
LEFT OUTER JOIN (
VALUES (4),(2),(1)
) mask_bits(mask_bit)
ON (mask & mask_bit = mask_bit)
GROUP BY mask_bit;
I don't think you'll have much luck making this as efficient as a single table scan, though.