I have the following code in Scala:
val FilteredPSPDF = PSPDF.select("accountname","amount", "currency", "datestamp","orderid","transactiontype")
However, I have some values in column "amount" which are negative and I need to change them to positive values. Is it possible to do this arithmetic function within the Select statement? How do I go about this?
There's an abs function available in Spark SQL
You can either use selectExpr instead of select
PSPDF.selectExpr("accountname","abs(amount) as amount", "currency", "datestamp","orderid","transactiontype")
or use select's overloaded version that takes columns types:
PSPDF.select($"accountname", abs($"amount").as("amount"), $"currency", $"datestamp", $"orderid", $"transactiontype")
You can use when and negate inbuilt function as
import org.apache.spark.sql.functions._
val FilteredPSPDF = PSPDF.select(col("accountname"), when(col("amount") < 0, negate(col("amount"))).otherwise(col("amount")), col("currency"), col("datestamp"),col("orderid"),col("transactiontype"))
Related
I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns
name, age, balance;"
So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?
I am trying to learn PySpark, and have tried to learn how to use SQL when() clauses to better categorize my data. (See here: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/) What I can't seem to get addressed is how to insert actual scalar values into the when() conditions for comparison's sake explicitly. It seems the aggregate functions return more tabular values than actual float() types.
I keep getting this error message unsupported operand type(s) for -: 'method' and 'method' When I tried running functions to aggregate another column in the original data frame I noticed the result didn't seem to be a flat scaler as much as a table (agg(select(f.stddev("Col")) gives a result like: "DataFrame[stddev_samp(TAXI_OUT): double]") Here is a sample of what I am trying to accomplish if you want to replicate, and I was wondering how you might get aggregate values like the standard deviation and mean within the when() clause so you can use that to categorize your new column:
samp = spark.createDataFrame(
[("A","A1",4,1.25),("B","B3",3,2.14),("C","C2",7,4.24),("A","A3",4,1.25),("B","B1",3,2.14),("C","C1",7,4.24)],
["Category","Sub-cat","quantity","cost"])
psMean = samp.agg({'quantity':'mean'})
psStDev = samp.agg({'quantity':'stddev'})
psCatVect = samp.withColumn('quant_category',.when(samp['quantity']<=(psMean-psStDev),'small').otherwise('not small')) ```
psMean and psStdev in your example are dataframes, you need to use collect() method to extract the scalar values
psMean = samp.agg({'quantity':'mean'}).collect()[0][0]
psStDev = samp.agg({'quantity':'stddev'}).collect()[0][0]
You could also create one variable with all stats as pandas DataFrame and reference to it later in pyspark code:
from pyspark.sql import functions as F
stats = (
samp.select(
F.mean("quantity").alias("mean"),
F.stddev("quantity").alias("std")
).toPandas()
)
(
samp.withColumn('quant_category',
F.when(
samp['quantity'] <= stats["mean"].item() - stats["std"].item(),
'small')
.otherwise('not small')
)
.toPandas()
)
I am using Databricks with Spark 2.4. and i am coding Python
I have created this function to convert null to empty string
def xstr(s):
if s is None:
return ""
return str(s)
Then I have below code
from pyspark.sql.functions import *
lv_query = """
SELECT
SK_ID_Site, Designation_Site
FROM db_xxx.t_xxx
ORDER BY SK_ID_Site
limit 2"""
lvResult = spark.sql(lv_query)
a = lvResult1.select(map(xstr, col("Designation_Site")))
display(a)
I have this error : TypeError: Column is not iterable
what i need to do here is to call a function for each row that i have in my Dataframe. i would like to pass columns as parameters and have a result.
That's not how spark works. You cannot apply direct python code to a spark dataframe content.
There are already builtin functions that do the job for you.
from pyspark.sql import functions as F
a = lvResult1.select(
F.when(F.col("Designation_Site").isNull(), "").otherwise(
F.col("Designation_Site").cast("string")
)
)
In case you want some more complex functions that you cannot do with the builtin functions, you can use an UDF but it may impact a lot your performances (better check for existing builtin functions before building your own UDF).
I am trying to define a new column in a Spark DataFrame using a constant defined as a var. I'm using Zeppelin - in the initial cell, it starts with
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
spark.read.parquet("<path/to/file>")
The file contains a column named birth_year; I want to create a new column named age defined as $year - birth_year, where birth_year is a string column. I'm not quite clear on how to do this when the input argument to a UDF is a parameter. I've done a couple hours of searching and created a UDF, but I got an error message whose principal part is
<console>:71: error: type mismatch;
found : Int
required: org.apache.spark.sql.Column
spark.read.parquet("path/to/file").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge(year, col("birth_year"))).createOrReplaceTempView("tmp")
and a caret directly under 'year'.
I suspect that $year does not map into a variable of the same length as birth_year; I've seen the lit() function that appears to work for strings - does it work with integer values as well, or is there another function for this purpose?
I tried the following:
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
def createAge = udf((yr : Int, dob : Int) => {yr - dob})
spark.read.parquet("<path/to/file>").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge($"year", col("birth_year"))).createOrReplaceTempView("tmp")
Any suggestions welcome - thanks in advance for any help.
You can't use year directly as an input to the UDF since a it expects columns to operate on. To create a column with a constant value use lit(). You can call the UDF as follows:
df.withColumn("age", createAge(lit(year), $"birth_year".cast("int")))
However, when possible it's always preferred to use the in-built functions in Spark when possible. In this case, you do not need an UDF. Simply do:
df.withColumn("age", lit(year) - $"birth_year".cast("int"))
This should be much faster.
I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))