I have a postgres code like this and I want it to convert to pyspark but I am having problem on what to put inside my like operator
My pyspark code is something like this
(
customer.select('first_name', 'last_name')
.where(F.col('email').like(F.concat(F.col('first_name'), F.lit('.'), F.col('last_name'), F.lit('#sakilacustomer.org'))).show()
Below code with just where and concat worked fine for me
from pyspark.sql.functions import *
tdf=spark.createDataFrame([("abc","def", "abc.def#test.com"),
("abc","defd", "abc.def#test.com"),
("abc","defd", "abc.deqf#test.com"),
("aabc","def", "aabc.def123#test.com")]).toDF("na","la","em")
tdf.where("em like concat(na,'.',la,'%#test.com')").show()
#output
+----+---+--------------------+
| na| la| em|
+----+---+--------------------+
| abc|def| abc.def#test.com|
|aabc|def|aabc.def123#test.com|
+----+---+--------------------+
Related
I have a pyspark dataframe like this:
+-------+---------------+------------+
|s_field|s_check| t_filter|
+-------+---------------+------------+
| MANDT| true| !=E|
| WERKS| true|0010_0020_0021_00...|
+-------+---------------+------------+
And as a first step, I split t_filter based on _ with f.split(f.col("t_filter"), "_")
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_")).show(truncate=False)
+-------+---------------+------------+------------+------------+
|s_field|s_check| t_filter| t_filter_1|
+-------+---------------+------------+------------+------------+
| MANDT| true| 070_70| [!= E]|
| WERKS| true|0010_0020_0021_00...| [0010, 0020, 0021, 00...]
+-------+---------------+------------+------------+------------+
What I want to achieve is to create a new column, using s_field and t_filter as the input while doing a logic check for !=.
ultimate aim
+------------+------------+------------+
| t_filter_2|
+------------+------------+------------+
| MANDT != 'E'|
| WERKS in ('0010', '0020', ...)|
+------------+------------+------------+
I have tried using withColumn but I keep getting error on col must be Column.
I am also not sure what the proper approach should be in order to achieve this.
Note: there is a large amount of rows, like 10k. I understand that using a UDF would be quite expensive, so i'm interested to know if there are other ways that can be done.
You can achieve this using withColumn with conditional evaluation by using the when and otherwise function. Following your example the following logic would apply, if t_filter contains != concatenate s_field and t_filter, else first convert t_filter_1 array to a string with , as separator then concat with s_field along with literals for in and ().
from pyspark.sql import functions as f
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
)
Output
+-------+-------+--------------------+-------------------------+---------------------------------------+
|s_check|s_field|t_filter |t_filter_1 |t_filter_2 |
+-------+-------+--------------------+-------------------------+---------------------------------------+
|true |MANDT |!='E' |[!='E'] |MANDT!='E' |
|true |WERKS |0010_0020_0021_00...|[0010, 0020, 0021, 00...]|WERKS in ('0010','0020','0021','00...')|
+-------+-------+--------------------+-------------------------+---------------------------------------+
Complete Working Example
from pyspark.sql import functions as f
filters_data = [
{"s_field": "MANDT", "s_check": True, "t_filter": "!='E'"},
{"s_field": "WERKS", "s_check": True, "t_filter": "0010_0020_0021_00..."},
]
filters = spark.createDataFrame(filters_data)
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_"))
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
).show(200, False)
Running a simple example -
dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)]
deptColumns = ["dept_name","dept_id"]
rdd = sc.parallelize(dept)
df = rdd.toDF(deptColumns)
df.show(truncate=False)
print('count the dept_id, should be 3')
print('count: ' + str(df.select(F.col("dept_id")).count()))
We get the following output -
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|null |
|Sales |30 |
|IT |40 |
+---------+-------+
count the dept_id, should be 3
count: 4
I'm running on databricks and this is my stack -
Spark 3.0.1 scala 2.12, DBR 7.3 LTS
Thanks for any help!!
There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.
You are using Dataframe.count(). According to the documentation, this function
returns the number of rows in this DataFrame
So the result 4 is correct as there are 4 rows in the dataframe.
If null values should be ignored, you can use the Spark SQL function count which can ignore null values:
count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.
For example
df.selectExpr("count(dept_id)").show()
returns 3.
Another alternative solution to #werner is using the
pyspark.sql.functions
from pyspark.sql import functions as F
print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))
Your code is not complete. Maybe you can just add 'isNotNull()' before count()
My code like this:
from pyspark.sql.functions import col, count
print('count the dept_id, should be 3')
print('count: ' + str(df.filter(col("dept_id").isNotNull()).count()))
i have 2 statements which are to my knowledge exactly alike, but select() works fine, but selectExpr() generates following results.
+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
| 85123A| 22138|
+-----------------------+----------------------+
+-----------+----------+
|first_value|last_value|
+-----------+----------+
| StockCode| StockCode|
+-----------+----------+
following is implementation.
df.select(first(col("StockCode")), last(col("StockCode"))).show()
df.selectExpr("""first('StockCode') as first_value""", """last('StockCode') as last_value""").show()
Can any 1 explain the behaviour?
selectExpr takes everything as select clause in sql.
Hence if you write anything in single quote', it will act as string in sql. if you wanted to pass the column to selectExpr use backtique (`) as below-
df.selectExpr("""first(`StockCode`) as first_value""", """last(`StockCode`) as last_value""").show()
backtique will help you to escape space in the column.
you can use without backtique also if your column name is not starting with number like 12col or it doesn't have spaces in between like column name
df.selectExpr("""first(StockCode) as first_value""", """last(StockCode) as last_value""").show()
You should pass like below
df_b = df_b.selectExpr('first(count) as first', 'last(count) as last')
df_b.show(truncate = False)
+-----+----+
|first|last|
+-----+----+
|2527 |13 |
+-----+----+
I have the following string 103400 I need to write it like 10:34:00 using pyspark. let take the following column as an example
time
130045
230022
And I want it to become like this:
time
13:00:45
23:00:22
you can try with regexp_replace
df.withColumn("time", regexp_replace(col("time") , "(\\d{2})(\\d{2})(\\d{2})" , "$1:$2:$3" ) ).show()
+--------+
| time |
+--------+
|13:00:45|
|23:00:22|
+--------+
I'm running a spark application, using EMR through pyspark interactive shell.
I'm trying to connect to a hive table named: content_publisher_events_log which I know that is isn't empty (through my hue console using exactly the same query), though when I try to read it throuhg pyspark I get count=0 as following:
from pyspark.sql import HiveContext
Query=""" select dt
from default.content_publisher_events_log
where dt between '20170415' and '20170419'
"""
hive_context = HiveContext(sc)
user_data = hive_context.sql(Query)
user_data.count()
0 #that's the result
Also, from the console i can see that this table exists:
>>> sqlContext.sql("show tables").show()
+--------+--------------------+-----------+
|database| tableName|isTemporary|
+--------+--------------------+-----------+
| default|content_publisher...| false|
| default| feed_installer_log| false|
| default|keyword_based_ads...| false|
| default|search_providers_log| false|
+--------+--------------------+-----------+
>>> user_data.printSchema()
root
|-- dt: string (nullable = true)
Also checked on the spark history server - seems like the job that ran the count worked without any errors, any idea on what could go wrong?
Thank's in advance!
The dt column isnt in datetime format . Either properly change the column itself to have datetime format or change the query itself to cast string as timestamp
Query=""" select dt
from default.content_publisher_events_log
where dt between
unix_timestamp('20170415','yyyyMMdd') and
unix_timestamp('20170419','yyyyMMdd')
"""
It seems like our data team moved per each partition the parquet file into a subfolder, they fixed it and starting from April 25th it works perfectly.
As far as i know if anyone is facing this isseue, try something like this one:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
or this one:
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive","true")