Checking whether a column has proper decimal number for special case - pyspark

I have a dataframe (input_dataframe), which looks like as below:
id test_column
1 0.25
2 1.1
3 12
4 test
5 1.3334
6 12.0
I want to add a column result, which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output:
id test_column result
1 0.25 1
2 1.1 1
3 12 0
4 test 0
5 1.3334 1
6 12.0 1
I have below code for this operation:
import decimal
from pyspark.sql.types import IntType
def is_valid_decimal(s):
try:
return (0 if decimal.Decimal(val)._isinteger() else 1)
except decimal.InvalidOperation:
return 0
# register the UDF for usage
sqlContext.udf.register("is_valid_decimal", is_valid_decimal, IntType())
# Using the UDF
df.withColumn("result", is_valid_decimal("test_column"))
However this code is not working, when decimal values are like: 12.0 or 12.00 or 12.000
Is there a way this can be achieved in pyspark?

You mentioned it's a string column, so, I tired using regular expression. hope it helps,
>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import IntegerType
>>> import re
>>> df = spark.createDataFrame([(1,'0.25'),(2,'1.1'),(3,'12'),(4,'test'),(5,'1.3334'),(6,'12.0')],['id','test_col'])
>>> df.show()
+---+--------+
| id|test_col|
+---+--------+
| 1| 0.25|
| 2| 1.1|
| 3| 12|
| 4| test|
| 5| 1.3334|
| 6| 12.0|
+---+--------+
>>> udf1 = F.udf(lambda x : 1 if re.match('^\d*[.]\d*$',x) else 0,IntegerType())
>>> df = df.withColumn('result',udf1(df.test_col))
>>> df.show()
+---+--------+------+
| id|test_col|result|
+---+--------+------+
| 1| 0.25| 1|
| 2| 1.1| 1|
| 3| 12| 0|
| 4| test| 0|
| 5| 1.3334| 1|
| 6| 12.0| 1|
+---+--------+------+

Related

How to write function in Spark column so a each field in the column increment the value?

It's not about unique id so I don't mean to use the increase unique number api, but try to resolve it by customized query
consider given value like 30, now current dataframe df needs to add a new column called hop_number so each field in the column from top to bottom will increment by 2 starts from 30, so that
with 2 parameters
x -> start number, here is 30
y -> like step or offset, here is 2
hop_number
---------------
30
32
34
36
38
40
......
I know in RDD we can use a map to handle the job, but how to do the same in dataframe with minimal cost?
df.column("hop_number", 30 + map(x => x + 2)) // pseudo code
Check below code.
scala> import org.apache.spark.sql.expressions._
scala> import org.apache.spark.sql.functions._
scala> val x = lit(30)
x: org.apache.spark.sql.Column = 30
scala> val y = lit(2)
y: org.apache.spark.sql.Column = 2
scala> df.withColumn("hop_number",(x + (row_number().over(Window.orderBy(lit(1)))-1) * y)).show(false)
+----------+
|hop_number|
+----------+
|30 |
|32 |
|34 |
|36 |
|38 |
+----------+
Assuming you have a grouping and ordering column, you can use the window function.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql import Window
tst= sqlContext.createDataFrame([(1,1,14),(1,2,4),(1,3,10),(2,1,90),(7,2,30),(2,3,11)],schema=['group','order','value'])
w=Window.partitionBy('group').orderBy('order')
tst_hop= tst.withColumn("temp",F.sum(F.lit(2)).over(w)).withColumn("hop_number",F.col('temp')+28)
The results:
tst_hop.show()
+-----+-----+-----+----+----------+
|group|order|value|temp|hop_number|
+-----+-----+-----+----+----------+
| 1| 1| 14| 2| 30|
| 1| 2| 4| 4| 32|
| 1| 3| 10| 6| 34|
| 2| 1| 90| 2| 30|
| 2| 3| 11| 4| 32|
| 7| 2| 30| 2| 30|
+-----+-----+-----+----+----------+
If you need a different approach, please provide a sample data of the dataframe.

create another columns for checking different value in pyspark

I wish to have below expected output:
My code:
import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
'values': [10,5,3,-1,0,-10,-4,10,0,10]})
sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()
I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.
Expected output:
id values sign numbering
0 0 10 1 1
1 1 5 1 1
2 2 3 1 1
3 3 -1 -1 2
4 4 0 0 3
5 5 -10 -1 4
6 6 -4 -1 4
7 7 10 1 5
8 8 0 0 6
9 9 10 1 7
Here's a way you can do using a custom function:
import pyspark.sql.functions as F
# compare the next value with previous
def f(x):
c = 1
l = [c]
last_value = [x[0]]
for i in x[1:]:
if i == last_value[-1]:
l.append(c)
else:
c += 1
l.append(c)
last_value.append(i)
return l
# take sign column as a list
sign_list = sp_dataframe.select('sign').rdd.map(lambda x: x.sign).collect()
# create a new dataframe using the output
sp = spark.createDataFrame(pd.DataFrame(f(sign_list), columns=['numbering']))
Append a list as a column to a dataframe is a bit tricky in pyspark. For this we'll need to create a dummy row_idx to join the dataframes.
# create dummy indexes
sp_dataframe = sp_dataframe.withColumn("row_idx", F.monotonically_increasing_id())
sp = sp.withColumn("row_idx", F.monotonically_increasing_id())
# join the dataframes
final_df = (sp_dataframe
.join(sp, sp_dataframe.row_idx == sp.row_idx)
.orderBy('id')
.drop("row_idx"))
final_df.show()
+---+------+----+---------+
| id|values|sign|numbering|
+---+------+----+---------+
| 0| 10| 1| 1|
| 1| 5| 1| 1|
| 2| 3| 1| 1|
| 3| -1| -1| 2|
| 4| 0| 0| 3|
| 5| -10| -1| 4|
| 6| -4| -1| 4|
| 7| 10| 1| 5|
| 8| 0| 0| 6|
| 9| 10| 1| 7|
+---+------+----+---------+

PySpark DataFrame multiply columns based on values in other columns

Pyspark newbie here. I have a dataframe, say,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
I want a dataframe with a new column aggregate with count * 2 , when mode is 'DOS' and count * 1 when mode is 'UNO'
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
Appreciate your inputs and also some pointers to best practices :)
Method 1: using pyspark.sql.functions with when :
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
Method 2: using SQL CASE expression with selectExpr:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
The result:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+

pyspark: counting number of occurrences of each distinct values

I think the question is related to: Spark DataFrame: count distinct values of every column
So basically I have a spark dataframe, with column A has values of 1,1,2,2,1
So I want to count how many times each distinct value (in this case, 1 and 2) appears in the column A, and print something like
distinct_values | number_of_apperance
1 | 3
2 | 2
I just post this as I think the other answer with the alias could be confusing. What you need are the groupby and the count methods:
from pyspark.sql.types import *
l = [
1
,1
,2
,2
,1
]
df = spark.createDataFrame(l, IntegerType())
df.groupBy('value').count().show()
+-----+-----+
|value|count|
+-----+-----+
| 1| 3|
| 2| 2|
+-----+-----+
I am not sure if you are looking for below solution:
Here are my thoughts on this. Suppose you have a dataframe like this.
>>> listA = [(1,'AAA','USA'),(2,'XXX','CHN'),(3,'KKK','USA'),(4,'PPP','USA'),(5,'EEE','USA'),(5,'HHH','THA')]
>>> df = spark.createDataFrame(listA, ['id', 'name','country'])
>>> df.show();
+---+----+-------+
| id|name|country|
+---+----+-------+
| 1| AAA| USA|
| 2| XXX| CHN|
| 3| KKK| USA|
| 4| PPP| USA|
| 5| EEE| USA|
| 5| HHH| THA|
+---+----+-------+
I want to know the distinct country code appears in this particular dataframe and should be printed as alias name.
import pyspark.sql.functions as func
df.groupBy('country').count().select(func.col("country").alias("distinct_country"),func.col("count").alias("country_count")).show()
+----------------+-------------+
|distinct_country|country_count|
+----------------+-------------+
| THA| 1|
| USA| 4|
| CHN| 1|
+----------------+-------------+
were you looking something similar to this?

Exploding pipe separated data in spark

I have a spark dataframe(input_dataframe), data in this dataframe looks like as below:
id value
1 a
2 x|y|z
3 t|u
I want to have output_dataframe, having pipe separated fields exploded and it should look like below:
id value
1 a
2 x
2 y
2 z
3 t
3 u
Please help me achieving the desired solution using PySpark. Any help will be appreciated
we can first split and then explode the value column using functions as below,
>>> l=[(1,'a'),(2,'x|y|z'),(3,'t|u')]
>>> df = spark.createDataFrame(l,['id','val'])
>>> df.show()
+---+-----+
| id| val|
+---+-----+
| 1| a|
| 2|x|y|z|
| 3| t|u|
+---+-----+
>>> from pyspark.sql import functions as F
>>> df.select('id',F.explode(F.split(df.val,'[|]')).alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| x|
| 2| y|
| 2| z|
| 3| t|
| 3| u|
+---+-----+