Pyspark dataframe LIKE operator - pyspark

What is the equivalent in Pyspark for LIKE operator?
For example I would like to do:
SELECT * FROM table WHERE column LIKE "*somestring*";
looking for something easy like this (but this is not working):
df.select('column').where(col('column').like("*s*")).show()

You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%'). The col('col_name') is used to represent the condition and like is the operator:
df.where(col('col1').like("%string%")).show()

Using spark 2.0.0 onwards following also works fine:
df.select('column').where("column like '%s%'").show()

Use the like operator.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
df.filter(df.column.like('%s%')).show()

To replicate the case-insensitive ILIKE, you can use lower in conjunction with like.
from pyspark.sql.functions import lower
df.where(lower(col('col1')).like("%string%")).show()

Well...there should be sql like regexp ->
df.select('column').where(col('column').like("%s%")).show()

This worked for me:
import pyspark.sql.functions as f
df.where(f.col('column').like("%x%")).show()

In pyspark you can always register the dataframe as table and query it.
df.registerTempTable('my_table')
query = """SELECT * FROM my_table WHERE column LIKE '*somestring*'"""
sqlContext.sql(query).show()

Using spark 2.4, to negate you can simply do:
df = df.filter("column not like '%bla%'")

Also CONTAINS can be used:
df = df.where(col("columnname").contains("somestring"))

I always use a UDF to implement such functionality:
from pyspark.sql import functions as F
like_f = F.udf(lambda col: True if 's' in col else False, BooleanType())
df.filter(like_f('column')).select('column')

Related

How do you use aggregated values within PySpark SQL when() clause?

I am trying to learn PySpark, and have tried to learn how to use SQL when() clauses to better categorize my data. (See here: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/) What I can't seem to get addressed is how to insert actual scalar values into the when() conditions for comparison's sake explicitly. It seems the aggregate functions return more tabular values than actual float() types.
I keep getting this error message unsupported operand type(s) for -: 'method' and 'method' When I tried running functions to aggregate another column in the original data frame I noticed the result didn't seem to be a flat scaler as much as a table (agg(select(f.stddev("Col")) gives a result like: "DataFrame[stddev_samp(TAXI_OUT): double]") Here is a sample of what I am trying to accomplish if you want to replicate, and I was wondering how you might get aggregate values like the standard deviation and mean within the when() clause so you can use that to categorize your new column:
samp = spark.createDataFrame(
[("A","A1",4,1.25),("B","B3",3,2.14),("C","C2",7,4.24),("A","A3",4,1.25),("B","B1",3,2.14),("C","C1",7,4.24)],
["Category","Sub-cat","quantity","cost"])
psMean = samp.agg({'quantity':'mean'})
psStDev = samp.agg({'quantity':'stddev'})
psCatVect = samp.withColumn('quant_category',.when(samp['quantity']<=(psMean-psStDev),'small').otherwise('not small')) ```
psMean and psStdev in your example are dataframes, you need to use collect() method to extract the scalar values
psMean = samp.agg({'quantity':'mean'}).collect()[0][0]
psStDev = samp.agg({'quantity':'stddev'}).collect()[0][0]
You could also create one variable with all stats as pandas DataFrame and reference to it later in pyspark code:
from pyspark.sql import functions as F
stats = (
samp.select(
F.mean("quantity").alias("mean"),
F.stddev("quantity").alias("std")
).toPandas()
)
(
samp.withColumn('quant_category',
F.when(
samp['quantity'] <= stats["mean"].item() - stats["std"].item(),
'small')
.otherwise('not small')
)
.toPandas()
)

From pyspark agg function to int

I am counting rows by a condition on pyspark
df.agg(count(when((col("my_value")==0),True))).show()
It works as I expected. Then how can I extract the value showed in the table to store to a Python variable?
If you just want to count the Trues (ceros), you should better do this:
from pyspark.sql import functions as F
pythonVariable = df.where(F.col('my_value') == 0).collect[0][0]
As you can see, there is no need to change the ceros to True to count them.

How do I select columns from a Spark dataframe when I also need to use withColumnRenamed?

I have a dataframe of
df = df.select("employee_id", "employee_name", "employee_address")
I need to rename the first two fields, but also still select the third field. So I thought this would work, but this appears to only select employee_address.
df = (df.withColumnRenamed("employee_id", "empId")
.withColumnRenamed("employee_name", "empName")
.select("employee_address")
)
How do I properly rename the first two fields while also selecting the third field?
I tried a mix of withColumn usages, but that doesn't work. Do I have to use a select on all three fields?
You can use the alias command:
import pyspark.sql.functions as func
df = df.select(
func.col("employee_id").alias("empId"),
func.col("employee_name").alias("empName"),
func.col("employee_address")
)

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

Math.abs function in Select statement in Scala

I have the following code in Scala:
val FilteredPSPDF = PSPDF.select("accountname","amount", "currency", "datestamp","orderid","transactiontype")
However, I have some values in column "amount" which are negative and I need to change them to positive values. Is it possible to do this arithmetic function within the Select statement? How do I go about this?
There's an abs function available in Spark SQL
You can either use selectExpr instead of select
PSPDF.selectExpr("accountname","abs(amount) as amount", "currency", "datestamp","orderid","transactiontype")
or use select's overloaded version that takes columns types:
PSPDF.select($"accountname", abs($"amount").as("amount"), $"currency", $"datestamp", $"orderid", $"transactiontype")
You can use when and negate inbuilt function as
import org.apache.spark.sql.functions._
val FilteredPSPDF = PSPDF.select(col("accountname"), when(col("amount") < 0, negate(col("amount"))).otherwise(col("amount")), col("currency"), col("datestamp"),col("orderid"),col("transactiontype"))