LEFT and In Function - PySpark SQL - pyspark

I am trying to convert the below SQL query in PySpark but somehow it is not working.
SELECT
Distinct *
FROM Dataset
where left(PAT,3) in ('123','203')
I have converted the query in pySpark indicated below
df_data=PAT_Data
df_data.where(df_data.PAT.substr(1,3)='123').show
OR
df_data.filter(col("PAT").like("123%")).show()
Any thoughts?
Thanks.

You can use isin operator after taking the substring of the PAT column:
df_data = spark.createDataFrame([['123221'], ['2321'], ['123221'], ['20322']], ['PAT'])
df_data.show()
+------+
| PAT|
+------+
|123221|
| 2321|
|123221|
| 20322|
+------+
df_data.where(df_data.PAT.substr(1,3).isin(['123', '203'])).show()
+------+
| PAT|
+------+
|123221|
|123221|
| 20322|
+------+
To drop duplicates:
df_data.where(df_data.PAT.substr(1,3).isin(['123', '203'])).dropDuplicates().show()
+------+
| PAT|
+------+
| 20322|
|123221|
+------+

check if the following works for you:
df_data.where('PAT like "123%"').show()
df_data.where('PAT rlike "^(123|203)"').distinct().show()
df_data.where('substr(PAT,1,3) in (123,203)').distinct().show()
btw. tested on spark.sparkContext.version = '2.2.1'

Related

np.where logic in pyspark dataframe

I'm looking for a way to get character after 2nd place from a string in a dataframe column only if the length of the character is > 2 and place it into another column else null. I have several other columns in the spark dataframe
I have a Spark dataframe that looks like this:
animal
======
mo
cat
mouse
snake
reptiles
I want something like this:
remainder
========
null
t
use
ake
ptiles
I can do it using np.where in pandas dataframe like below
import numpy as np
df['remainder'] = np.where(len(df['animal]) > 2, df['animal].str[2:], 'null)
How do I do the same in pyspark dataframe
You can easily do this with a combination of when-otherwise with substring
Data Preparation
s = StringIO("""
animal
mo
cat
mouse
snake
reptiles
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+--------+
| animal|
+--------+
| mo|
| cat|
| mouse|
| snake|
|reptiles|
+--------+
When-Otherwise - Substring
sparkDF = sparkDF.withColumn('animal_length',F.length(F.col('animal'))) \
.withColumn('remainder',F.when(F.col('animal_length') > 2
,F.substring(F.col('animal'),2,1000)
).otherwise(None)
) \
.drop('animal_length')
sparkDF.show()
+--------+---------+
| animal|remainder|
+--------+---------+
| mo| null|
| cat| at|
| mouse| ouse|
| snake| nake|
|reptiles| eptiles|
+--------+---------+

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Use PySpark Dataframe column in another spark sql query

I have a situation where I'm trying to query a table and use the result (dataframe) from that query as IN clause of another query.
From the first query I have the dataframe below:
+-----------------+
|key |
+-----------------+
| 10000000000004|
| 10000000000003|
| 10000000000008|
| 10000000000009|
| 10000000000007|
| 10000000000006|
| 10000000000010|
| 10000000000002|
+-----------------+
And now I want to run a query like the one below using the values of that dataframe dynamically instead of hard coding the values:
spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()
I tried the following, however it didn't work:
df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()
Can somebody help me?
You should use an (inner) join between two data frames to get the countries you would like. See my example:
# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]
# Create a list of Ids
numbers = [(1,), (2,)]
# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])
The data frames look like the following:
df_countries:
+-----------+---+
|CountryName| Id|
+-----------+---+
|Netherlands| 1|
| France| 2|
| Germany| 3|
| Belgium| 4|
+-----------+---+
df_numbers:
+---+
| Id|
+---+
| 1|
| 2|
+---+
You can join them as follows:
countries.join(numbers, on='Id', how='inner')
Resulting in:
+---+-----------+
| Id|CountryName|
+---+-----------+
| 1|Netherlands|
| 2| France|
+---+-----------+
Hope that clears things up!

How to extract the numeric part from a string column in spark?

I am new to spark and trying to play with data to get practice. I am using databricks in scala and for dataset I am using fifa 19 complete player dataset from kaggle. one of the column named "Weight" which contains the data that looks like
+------+
|Weight|
+------+
|136lbs|
|156lbs|
|136lbs|
|... |
|... |
+------+
I want to change the column in such a way to look like this
+------+
|Weight|
+------+
|136 |
|156 |
|136 |
|... |
|... |
+------+
Can any one help how I can change the column values in spark sql.
Here is another way using regex and the regexp_extract build-in function:
import org.apache.spark.sql.functions.regexp_extract
val df = Seq(
"136lbs",
"150lbs",
"12lbs",
"30kg",
"500kg")
.toDF("weight")
df.withColumn("weight_num", regexp_extract($"weight", "\\d+", 0))
.withColumn("weight_unit", regexp_extract($"weight", "[a-z]+", 0))
.show
//Output
+------+----------+-----------+
|weight|weight_num|weight_unit|
+------+----------+-----------+
|136lbs| 136| lbs|
|150lbs| 150| lbs|
| 12lbs| 12| lbs|
| 30kg| 30| kg|
| 500kg| 500| kg|
+------+----------+-----------+
You can create a new column and use regexp_replace
dataFrame.withColumn("Weight2", regexp_replace($"Weight" , lit("lbs"), lit("")))