add one column including values from 1 to n in dataframe - pyspark

I am creating a dataframe with pyspark, like this:
+----+------+
| k| v|
+----+------+
|key1|value1|
|key1|value1|
|key1|value1|
|key2|value1|
|key2|value1|
|key2|value1|
+----+------+
I want to add one 'rowNum' column using 'withColumn' method, the result of dataframe changed like this:
+----+------+------+
| k| v|rowNum|
+----+------+------+
|key1|value1| 1|
|key1|value1| 2|
|key1|value1| 3|
|key2|value1| 4|
|key2|value1| 5|
|key2|value1| 6|
+----+------+------+
the range of rowNum is from 1 to n, n is equal to number of raws. I modified my code, like this:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy("v").orderBy('k')
my_df= my_df.withColumn("rowNum", F.rowNumber().over(w))
But, I got error message:
'module' object has no attribute 'rowNumber'
I replaced rowNumber() method with row_number, the above code can run. But, When I run code:
my_df.show()
I got error message again:
Py4JJavaError: An error occurred while calling o898.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number()
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:224)
at org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate.doGenCode(interfaces.scala:342)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
at scala.Option.getOrElse(Option.scala:121)

Solution in Spark 2.2:
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("rowNum", row_number().over(w))

If you require require a sequential rowNum value from 1 to n, rather than a monotonically_increasing_id you can use zipWithIndex()
Recreating your example data as follows:
rdd = sc.parallelize([('key1','value1'),
('key1','value1'),
('key1','value1'),
('key1','value1'),
('key1','value1'),
('key1','value1')])
You can then use zipWithIndex() to add an index to each row. The map is used to reformat the data and to add 1 to the index so it starts at 1.
rdd_indexed = rdd.zipWithIndex().map(lambda x: (x[0][0],x[0][1],x[1]+1))
df = rdd_indexed.toDF(['id','score','rowNum'])
df.show()
+----+------+------+
| id| score|rowNum|
+----+------+------+
|key1|value1| 1|
|key1|value1| 2|
|key1|value1| 3|
|key1|value1| 4|
|key1|value1| 5|
|key1|value1| 6|
+----+------+------+

You can do this with windows
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
your_df= your_df.withColumn("rowNum", rowNumber().over(w))
Here your_df is data frame in which you need this column.

I have used spark2.2 and found "row_number()" working.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
win_row_number = Window.orderBy("col_name")
df_row_number = df.select("col_name", F.row_number().over(win_row_number))

Related

how to select elements in scala dataframe?

Reference to How do I select item with most count in a dataframe and define is as a variable in scala?
Given a table below, how can I select nth src_ip and put it as a variable?
+--------------+------------+
| src_ip|src_ip_count|
+--------------+------------+
| 58.242.83.11| 52|
|58.218.198.160| 33|
|58.218.198.175| 22|
|221.194.47.221| 6|
You can create another column with row number as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf = df.withColumn("row_number", monotonically_increasing_id())
tempdf.withColumn("row_number", row_number().over(Window.orderBy("row_number")))
which should give you tempdf as
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
| 58.242.83.11| 52| 1|
|58.218.198.160| 33| 2|
|58.218.198.175| 22| 3|
|221.194.47.221| 6| 4|
+--------------+------------+----------+
Now you can use filter to filter in the nth row as
.filter($"row_number" === n)
That should be it.
For extracting the ip, lets say your n is 2 as
val n = 2
Then the above process would give you
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
|58.218.198.160| 33| 2|
+--------*------+------------+----------+
getting the ip address* is explained in the link you provided in the question by doing
.head.get(0)
Safest way is to use zipWithIndex in the dataframe converted into rdd and then convert back to dataframe, so that we have unmistakable row_number column.
val finalDF = df.rdd.zipWithIndex().map(row => (row._1(0).toString, row._1(1).toString, (row._2+1).toInt)).toDF("src_ip", "src_ip_count", "row_number")
Rest of the steps are already explained before.

In Spark windowing, how do you fill null for when the number of rows selected are less than window size?

assume there is a dataframe as follows:
machine_id | value
1| 5
1| 3
1| 4
I want to produce a final dataframe like this
machine_id | value | sum
1| 5|null
1| 3| 8
1| 4| 7
basically I have to do a window of size two, but for the first row, we don't want to sum it up with zero. It will just be filled with null.
var winSpec = Window.orderBy("machine_id ").partitionBy("machine_id ").rangeBetween(-1, 0)
df.withColumn("sum", sum("value").over(winSpec))
You can use lag function, add value column with lag(value, 1):
val df = Seq((1,5),(1,3),(1,4)).toDF("machine_id", "value")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("machine_id").orderBy("id")
(df.withColumn("id", monotonically_increasing_id)
.withColumn("sum", $"value" + lag($"value",1).over(window))
.drop("id").show())
+----------+-----+----+
|machine_id|value| sum|
+----------+-----+----+
| 1| 5|null|
| 1| 3| 8|
| 1| 4| 7|
+----------+-----+----+
You should be using rowsBetween rather than rangeBetween api as below
import org.apache.spark.sql.functions._
var winSpec = Window.orderBy("machine_id").partitionBy("machine_id").rowsBetween(-1, 0)
df.withColumn("sum", sum("value").over(winSpec))
.withColumn("sum", when($"sum" === $"value", null).otherwise($"sum"))
.show(false)
which should give you your expected result
+----------+-----+----+
|machine_id|value|sum |
+----------+-----+----+
|1 |5 |null|
|1 |3 |8 |
|1 |4 |7 |
+----------+-----+----+
I hope the answer is helpful
If you want a general solution where n is the size of windows
Spark- Scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val winSpec = Window.partitionBy("machine_id").orderBy("machine_id").rowsBetween(-n, 0)
val winSpec2 = Window.partitionBy("machine_id").orderBy("machine_id")
df.withColumn("sum", when(row_number().over(winSpec2) < n, "null").otherwise(sum("value").over(winSpec))
.show(false)
PySpark
from pyspark.sql.window import Window
from pyspark.sql.functions import *
winSpec = Window.partitionBy("machine_id").orderBy("machine_id").rowsBetween(-n, 0)
winSpec2 = Window.partitionBy("machine_id").orderBy("machine_id")
df.withColumn("sum", when(row_number().over(winSpec2) < n, "null").otherwise(sum("value").over(winSpec))
.show(false)

How to get the lists' length in one column in dataframe spark?

I have a df whose 'products' column are lists like below:
+----------+---------+--------------------+
|member_srl|click_day| products|
+----------+---------+--------------------+
| 12| 20161223| [2407, 5400021771]|
| 12| 20161226| [7320, 2407]|
| 12| 20170104| [2407]|
| 12| 20170106| [2407]|
| 27| 20170104| [2405, 2407]|
| 28| 20161212| [2407]|
| 28| 20161213| [2407, 100093]|
| 28| 20161215| [1956119]|
| 28| 20161219| [2407, 100093]|
| 28| 20161229| [7905970]|
| 124| 20161011| [5400021771]|
| 6963| 20160101| [103825645]|
| 6963| 20160104|[3000014912, 6626...|
| 6963| 20160111|[99643224, 106032...|
How to add a new column product_cnt which are the length of products list? And how to filter df to get specified rows with condition of given products length ?
Thanks.
Pyspark has a built-in function to achieve exactly what you want called size. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size .
To add it as column, you can simply call it during your select statement.
from pyspark.sql.functions import size
countdf = df.select('*',size('products').alias('product_cnt'))
Filtering works exactly as #titiro89 described. Furthermore, you can use the size function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.
filterdf = df.filter(size('products')==given_products_length)
First question:
How to add a new column product_cnt which are the length of products list?
>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])]
>>> df = spark.createDataFrame(a,
["member_srl","click_day","products"])
>>> df.show()
+----------+---------+------------------+
|member_srl|click_day| products|
+----------+---------+------------------+
| 12| 20161223|[2407, 5400021771]|
| 12| 20161226|[7320, 2407, 4344]|
+----------+---------+------------------+
You can find a similar example here
>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf
>>> slen = udf(lambda s: len(s), IntegerType())
>>> df2 = df.withColumn("product_cnt", slen(df.products))
>>> df2.show()
+----------+---------+------------------+-----------+
|member_srl|click_day| products|product_cnt|
+----------+---------+------------------+-----------+
| 12| 20161223|[2407, 5400021771]| 2|
| 12| 20161226|[7320, 2407, 4344]| 3|
+----------+---------+------------------+-----------+
Second question:
And how to filter df to get specified rows with condition of given products length ?
You can use filter function docs here
>>> givenLength = 2
>>> df3 = df2.filter(df2.product_cnt==givenLength)
>>> df3.show()
+----------+---------+------------------+-----------+
|member_srl|click_day| products|product_cnt|
+----------+---------+------------------+-----------+
| 12| 20161223|[2407, 5400021771]| 2|
+----------+---------+------------------+-----------+

How to combine where and groupBy in Spark's DataFrame?

How can I use aggregate functions in a where clause in Apache Spark 1.6?
Consider the following DataFrame
+---+------+
| id|letter|
+---+------+
| 1| a|
| 2| b|
| 3| b|
+---+------+
How can I select all rows where letter occurs more than once, i.e. the expected output would be
+---+------+
| id|letter|
+---+------+
| 2| b|
| 3| b|
+---+------+
This does obviously not work:
df.where(
df.groupBy($"letter").count()>1
)
My example its about count, but I'd like to be able to use other aggregate functions (the results thereof) as well.
EDIT:
Just for counting,I just came up with the following solution:
df.groupBy($"letter").agg(
collect_list($"id").as("ids")
)
.where(size($"ids") > 1)
.withColumn("id", explode($"ids"))
.drop($"ids")
You can use left semi join:
df.join(
broadcast(df.groupBy($"letter").count.where($"count" > 1)),
Seq("letter"),
"leftsemi"
)
or window functions:
import org.apache.spark.sql.expressions.Window
df
.withColumn("count", count($"*").over(Window.partitionBy("letter")))
.where($"count" > 1)
In Spark 2.0 or later you can Bloom filter but it is not available in 1.x

Pyspark Dataframe Apply function to two columns

Say I have two PySpark DataFrames df1 and df2.
df1= 'a'
1
2
5
df2= 'b'
3
6
And I want to find the closest df2['b'] value for each df1['a'], and add the closest values as a new column in df1.
In other words, for each value x in df1['a'], I want to find a y that achieves min(abx(x-y)) for all y in df2['b'](note: can assume that there is only one y that can achieve the minimum distance), and the result would be
'a' 'b'
1 3
2 3
5 6
I tried the following code to create a distance matrix first (before finding the values achieving the minimum distance):
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def dict(x,y):
return abs(x-y)
udf_dict = udf(dict, IntegerType())
sql_sc = SQLContext(sc)
udf_dict(df1.a, df2.b)
which gives
Column<PythonUDF#dist(a,b)>
Then I tried
sql_sc.CreateDataFrame(udf_dict(df1.a, df2.b))
which runs forever without giving error/output.
My questions are:
As I'm new to Spark, is my way to construct the output DataFrame efficient? (My way would be creating a distance matrix for all the a and b values first, and then find the min one)
What's wrong with the last line of my code and how to fix it?
Starting with your second question - you can apply udf only to existing dataframe, I think you were thinking for something like this:
>>> df1.join(df2).withColumn('distance', udf_dict(df1.a, df2.b)).show()
+---+---+--------+
| a| b|distance|
+---+---+--------+
| 1| 3| 2|
| 1| 6| 5|
| 2| 3| 1|
| 2| 6| 4|
| 5| 3| 2|
| 5| 6| 1|
+---+---+--------+
But there is a more efficient way to apply this distance, by using internal abs:
>>> from pyspark.sql.functions import abs
>>> df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
Then you can find matching numbers by calculating:
>>> distances = df1.join(df2).withColumn('distance', abs(df1.a -df2.b))
>>> min_distances = distances.groupBy('a').agg(min('distance').alias('distance'))
>>> distances.join(min_distances, ['a', 'distance']).select('a', 'b').show()
+---+---+
| a| b|
+---+---+
| 5| 6|
| 1| 3|
| 2| 3|
+---+---+