I have a pyspark dataframe with a column label:
label
0
1
2
3
0
And I want to create a new column new_label changing all values that is not 3 to 0.
to have only 2 classes: 0 and 3
I am pretty new to pyspark. How can I do this?
assuming df is your dataframe :
from pyspark.sql import functions as F
df = df.withColumn("new_label", F.when(F.col("label") == 3, 3).otherwise(0))
Related
I have a dataframe where one column contains several information in a 'key=value' format.
There are almost a 30 different 'key=value' that can appear in that column will use 4 columns
for understanding ( _age, _city, _sal, _tag)
id name properties
0 A {_age=10, _city=A, _sal=1000}
1 B {_age=20, _city=B, _sal=3000, tag=XYZ}
2 C {_city=BC, tag=ABC}
How can I convert this string column into multiple columns?
Need to use spark scala dataframe for it.
The expected output is:
id name _age _city _sal tag
0 A 10 A 1000
1 B 20 B 3000 XYZ
2 C BC ABC
Short answer
df
.select(
col("id"),
col("name"),
col("properties.*"),
..
)
Try this :
val s = df.withColumn("dummy", explode(split(regexp_replace($"properties", "\\{|\\}", ""), ",")))
val result= s.drop("properties").withColumn("col1",split($"dummy","=")(0)).withColumn("col1-value",split($"dummy","=")(1)).drop("dummy")
result.groupBy("id","name").pivot("col1").agg(first($"col1-value")).orderBy($"id").show
I am currently learning pyspark and currently working on adding columns to pyspark dataframes using multiple conditions.
I have tried working with UDFs but getting some errors like:
TypeError: 'object' object has no attribute '__getitem__'
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType, StringType, FloatType
from pyspark.sql.functions import pandas_udf, PandasUDFType
#first dataframe (superset)
df1 = simple_example1
#second dataframe
df = diff_cols.dropna()
def func(x,y):
z = df1[(df1['a'] == x) & (df1['b'] <= (y+10000000000)) & (df1['b'] >= (y-10000000000))]
z = z[(z["c"] ==1) | (z["d"] ==1)]
z = z[(z["e"]!=0) | (z["f"]!=0) | (z["g"]!=0) | (z["h"]!=0)]
return 1 if z.count() > 3 else 0
udf_func = udf(func, IntegerType())
df = df.withColumn('status', udf_func(df['a'],df['b']))
what i am trying is as follow:
1. for each row of df filter data from df1 where parameter a is equal to the parameter in df and parameter b should be in between b-10 to b+10
2. then filter that data further with either c or d = 1
3. then filter that data further if any of parameters from e f g h are non 0
4. then count number of rows in the subset and assign 0/1
5. return this 0/1 in status column of df
I have a dataframe like below:
group value
B 2
B 3
A 5
A 6
now i need to subtract rows based on group. i.e 2-3 and 5-6. after transformation it should look like this.
group value
B -1
A -1
i tried below code but couldnt solve my case.
val df2 = df1.groupBy("Group").agg(first("Value")-second(col("Value")))
import org.apache.spark.sql.expressions.Window
val df2 = df1.select("group", "value", $"value" - lead("value").over(Window.partitionBy("group").orderBy("value")))
I guess you're trying to subtract two neighbored values with order.
This is working for me.
val df2 = df1.groupBy("Group").agg(first("Value").minus(last(col("Value"))))
input DF:
A B
1 1
2 1
2 2
3 3
3 1
3 2
3 3
3 4
I am trying to filter the rows based on the combination of
(A, Max(B))
Output Df:
A B
1 1
2 3
3 4
I am able to do this with
df.groupBy()
But there are also other columns in the DF which I want to be selected but do not want to be included in the GroupBy
So that condition on filtering the rows should only be wrt these columns and not the other columns in the DF. Ay suggestions please>
As suggested in How to get other columns when using Spark DataFrame groupby? you can use window functions
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
df.withColumn("maxB", max(col("B")).over(Window.partitionBy("A"))).where(...)
where ... is replaced by a predicate based on A and maxB.
Say I have a dataframe
product_id customers
1 [1,2,4]
2 [1,2]
I want to create a new column, say nb_customer by applying the function len on the column customers.
I tried
df = df.select('*', (map(len, df.customers)).alias('nb_customer'))
but it does not work.
What is the correct way to do that?
import pyspark.sql.functions as f
df = sc.parallelize([
[1,[1,2,4]],
[2,[1,2]]
]).toDF(('product_id', 'customers'))
df.withColumn('nb_customer',f.size(df.customers)).show()