grouping values that are sequence in two columns pyspark - pyspark

I have the follow df:
index initial_range final_range
1 1000000 5999999
2 6000000 6299999
3 6300000 6399999
4 6400000 6499999
5 6600000 6699999
6 6700000 6749999
7 6750000 6799999
8 7000000 7399999
9 7600000 7699999
10 7700000 7749999
11 7750000 7799999
12 6500000 6549999
See that the 'initial_range' field and 'final_range' field are intervals of abrangency.
When we compare row index 1 and index 2, we observe that the end of the value of the 'final_range' field starts in the next one as sequence+1 in the 'initial_range' index 2. So, in the example ended in 5999999 and started in 6000000 in index 2. I need grouping this cases and return the follow df:
index initial_range final_range grouping
1 1000000 5999999 1000000-6549999
2 6000000 6299999 1000000-6549999
3 6300000 6399999 1000000-6549999
4 6400000 6499999 1000000-6549999
5 6600000 6699999 6600000-6799999
6 6700000 6749999 6600000-6799999
7 6750000 6799999 6600000-6799999
8 7000000 7399999 7000000-7399999
9 7600000 7699999 7600000-7799999
10 7700000 7749999 7600000-7799999
11 7750000 7799999 7600000-7799999
12 6500000 6549999 1000000-6549999
See, that the grouping field there are a news abrangencies, that are the values min(initial) and max(final), until the sequence is broken.
Some details:
The index 4 for 5 the sequence+1 is broken, so the new 'grouping' change. In other words, every time the sequence is broken a new sequence needs to be written.
In index 12 the grouping 1000000-6549999 appear again, because the 6500000 is the next number of 6499999 in index 4.
I tried this code:
comparison = df == df.shift()+1
df['grouping'] = comparison['initial_range'] & comparison['final_range']
But, the logic sequence, don't worked.
Can anyone help me?

Well this was a tough one, here is my answer,
First of all, I am using UDF so expect the performance to be a little bad,
import copy
import pyspark.sql.functions as F
from pyspark.sql.types import *
rn = 0
def check_vals(x, y):
global rn
if (y != None) and (int(x)+1) == int(y):
return rn + 1
else:
# Using copy to deepcopy and not forming a shallow one.
res = copy.copy(rn)
# Increment so that the next value with start form +1
rn += 1
# Return the same value as we want to group using this
return res + 1
return 0
rn_udf = F.udf(lambda x, y: check_vals(x, y), IntegerType())
Next,
from pyspark.sql.window import Window
# We want to check the final_range values according to the initial_value
w = Window().orderBy(F.col('initial_range'))
# First of all take the next row values of initial range in a column called nextRange so that we can compare
# Check if the final_range+1 == nextRange, if yes use rn value, if not then use rn and increment it for the next iteration.
# Now find the max and min values in the partition created by the check_1 column.
# Concat min and max values
# order it by ID to get the initial ordering, I have to cast it to integer but you might not need it
# drop all calculated values
df.withColumn('nextRange', F.lead('initial_range').over(w)) \
.withColumn('check_1', rn_udf("final_range", "nextRange")) \
.withColumn('min_val', F.min("initial_range").over(Window.partitionBy("check_1"))) \
.withColumn('max_val', F.max("final_range").over(Window.partitionBy("check_1"))) \
.withColumn('range', F.concat("min_val", F.lit("-"), "max_val")) \
.orderBy(F.col("ID").cast(IntegerType())) \
.drop("nextRange", "check_1", "min_val", "max_val") \
.show(truncate=False)
Output:
+---+-------------+-----------+---------------+
|ID |initial_range|final_range|range |
+---+-------------+-----------+---------------+
|1 |1000000 |5999999 |1000000-6549999|
|2 |6000000 |6299999 |1000000-6549999|
|3 |6300000 |6399999 |1000000-6549999|
|4 |6400000 |6499999 |1000000-6549999|
|5 |6600000 |6699999 |6600000-6799999|
|6 |6700000 |6749999 |6600000-6799999|
|7 |6750000 |6799999 |6600000-6799999|
|8 |7000000 |7399999 |7000000-7399999|
|9 |7600000 |7699999 |7600000-7799999|
|10 |7700000 |7749999 |7600000-7799999|
|11 |7750000 |7799999 |7600000-7799999|
|12 |6500000 |6549999 |1000000-6549999|
+---+-------------+-----------+---------------+

Related

Spark PIVOT performance is very slow on High volume Data

I have one dataframe with 3 columns and 20,000 no of rows. i need to be convert all 20,000 transid into column.
table macro:
prodid
transid
flag
A
1
1
B
2
1
C
3
1
so on..
Expected Op be like upto 20,000 no of columns:
prodid
1
2
3
A
1
1
1
B
1
1
1
C
1
1
1
I have tried with PIVOT/transpose function but its taking too long time for high volume data. for processing 20,000 rows to column its taking around 10 hrs.
eg.
val array =a1.select("trans_id").distinct.collect.map(x => x.getString(0)).toSeq
val a2=a1.groupBy("prodid").pivot("trans_id",array).sum("flag")
When i used pivot on 200-300 no of rows then it is working fast but when no of rows increase PIVOT is not good.
can anyone please help me to find out the solution.is there any method to avoid PIVOT function as PIVOT is good for low volume conversion only.How to deal with high volume data.
I need this type of conversion for matrix multiplication.
for matrix multiplication my input be like below table and final results will be in matrix multiplication.
|col1|col2|col3|col4|
|----|----|----|----|
|1 | 0 | 1 | 0 |
|0 | 1 | 0 | 0 |
|1 | 1 | 1 | 1 |

Parameterize select query in unary kdb function

I'd like to be able to select rows in batches from a very large keyed table being stored remotely on disk. As a toy example to test my function I set up the following tables t and nt...
t:([sym:110?`A`aa`Abc`B`bb`Bac];px:110?10f;id:1+til 110)
nt:0#t
I select from the table only records that begin with the character "A", count the number of characters, divide the count by the number of rows I would like to fetch for each function call (10), and round that up to the nearest whole number...
aRec:select from t where sym like "A*"
counter:count aRec
divy:counter%10
divyUP:ceiling divy
Next I set an idx variable to 0 and write an if statement as the parameterized function. This checks if idx equals divyUP. If not, then it should select the first 10 rows of aRec, upsert those to the nt table, increment the function argument, x, by 10, and increment the idx variable by 1. Once the idx variable and divyUP are equal it should exit the function...
idx:0
batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
However when I call the function it returns a type error...
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
^
I've tried using it with sublist too, though I get the same result...
batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
^
However issuing either of those above commands outside of the function both return the expected results...
q)select[0 10] from aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
q)0 10 sublist aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
The issue is that in your example, select[] and sublist requires a list as an input but your input is not a list. Reason for that is when there is a variable in items(which will form a list), it is no longer considered as a simple list meaning blank(space) cannot be used to separate values. In this case, a semicolon is required.
q) x:2
q) (1;x) / (1 2)
Select command: Change input to (x;10) to make it work.
q) t:([]id:1 2 3; v: 3 4 5)
q) {select[(x;2)] from t} 1
`id `v
---------
2 4
3 5
Another alternative is to use 'i'(index) column:
q) {select from t where i within x + 0 2} 1
Sublist Command: Convert left input of the sublist function to a list (x;10).
q) {(x;2) sublist t}1
You can't use the select[] form with variable input like that, instead you can use a functional select shown in https://code.kx.com/q4m3/9_Queries_q-sql/#912-functional-forms where you input as the 5th argument the rows you want
Hope this helps!

How do I add a new column to a Spark dataframe for every row that exists?

I'm trying to create a comparison matrix using a Spark dataframe, and am starting by creating a single column dataframe with one row per value:
val df = List(1, 2, 3, 4, 5).toDF
From here, what I need to do is create a new column for each row, and insert (for now), a random number in each space, like this:
Item 1 2 3 4 5
------ --- --- --- --- ---
1 0 7 3 6 2
2 1 0 4 3 1
3 8 6 0 4 4
4 8 8 1 0 9
5 9 5 3 6 0
Any assistance would be appreciated!
Considering to transpose the input DataFrame called df using .pivot() function like the following:
val output = df.groupBy("item").pivot("item").agg((rand()*100).cast(DataTypes.IntegerType))
This will generate a new DataFrame with a random Integer value corrisponding to the row value (null otherwise).
+----+----+----+----+----+----+
|item|1 |2 |3 |4 |5 |
+----+----+----+----+----+----+
|1 |9 |null|null|null|null|
|3 |null|null|2 |null|null|
|5 |null|null|null|null|6 |
|4 |null|null|null|26 |null|
|2 |null|33 |null|null|null|
+----+----+----+----+----+----+
If you don't want the null values you can consider to apply an UDF later.

How do I select all rows whose values exceed a given sum?

Lets say I have a table
name | val
------------
a |10
b |10
c |20
d |30
and I also have a user input of 25. How do I select all the rows so that the sum of val is one row over 25. So for a user input of 25 I would get back the first three rows which give me a value of 40. The equivalent code for what I'm trying to do is
total = 0
user_input = 25
while total < user_input and rows_by_val_asc_iterator.has_next():
row = rows_by_val_asc_iterator.next()
total = total + row.val
You could do this using something called a "window function". Basically it lets you have a running total. Select rows til the sum exceeds your desired total.
For example, try something like this:
select name, val
from (
select name, val, (sum(val) over (order by val, name)) as total
from vals
) as t
where total - val < 25

Checking datetime format of a column in dataframe

I have an input dateframe, which has below data:
id date_column
1 2011-07-09 11:29:31+0000
2 2011-07-09T11:29:31+0000
3 2011-07-09T11:29:31
4 2011-07-09T11:29:31+0000
I want to check whether format of date_column matches the format "%Y-%m-%dT%H:%M:%S+0000", if format matches, i want to add a column, which has value 1 otherwise 0.
Currently, i have defined a UDF to do this operation:
def date_pattern_matching(value, pattern):
try:
datetime.strptime(str(value),pattern)
return "1"
except:
return "0"
It generates below output dataframe:
id date_column output
1 2011-07-09 11:29:31+0000 0
2 2011-07-09T11:29:31+0000 1
3 2011-07-09T11:29:31 0
4 2011-07-09T11:29:31+0000 1
Execution through UDF takes a lot of time, is there an alternate way to achieve it?
Try the regex pyspark.sql.Column.rlike operator with a when otherwise block
from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'],
[1,"2011-07-09 11:29:31+0000"],
[2,"2011-07-09T11:29:31+0000"],
[3,"2011-07-09T11:29:31"],
[4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])
regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column |output|
+---+------------------------+------+
|1 |2011-07-09 11:29:31+0000|0 |
|1 |2011-07-09 11:29:31+0000|0 |
|2 |2011-07-09T11:29:31+0000|1 |
|3 |2011-07-09T11:29:31 |0 |
|4 |2011-07-09T11:29:31+0000|1 |
+---+------------------------+------+