How to replace leading 0 with 91 using regex in pyspark dataframe - pyspark

In python I am doing this to replace leading 0 in column phone with 91.
But how to do it in pyspark.
con dataframe is :
id phone1
1 088976854667
2 089706790002
Outptut i want is
1 9188976854667
2 9189706790002
# Replace leading Zeros in a phone number with 91
con.filter(regex='[_]').replace('^0','385',regex=True)

You are looking for the regexp_replace function. This function takes 3 parameter:
column name
pattern
repleacement
from pyspark.sql import functions as F
columns = ['id', 'phone1']
vals = [(1, '088976854667'),(2, '089706790002' )]
df = spark.createDataFrame(vals, columns)
df = df.withColumn('phone1', F.regexp_replace('phone1',"^0", "91"))
df.show()
Output:
+---+-------------+
| id| phone1|
+---+-------------+
| 1|9188976854667|
| 2|9189706790002|
+---+-------------+

Related

Joining dataframe and replacing column value in scala

I am trying to join two apache spark sql DataFrame and replace column value of first dataframe with another.
Eg:
Df1:
col1 | col2 | other columns .... say (col-x, col-y, col-z)
------------ |--------------------------------
x | a |random values
y | b |random values
z | c |random values
Df2:
col1 | col3 | other columns .. say (col-a, col-b, col-c)
-------------|--------------------------------
x | a1 |different random values
y | b1 |different random values
w | w1 |different random values
resultant dataframe should be
DF:
col1 | col2 | other columns of DF1 (col-x. col-y, col-z)
-------------|-------------------------------
a1 | a |random values
b1 | b |random values
z | c |random values
I need to perform left join and replace values of DF1.col1 with DF2.col3 wherever DF1.col1 = DF2.col1.
I am not sure how to do that. Furthermore, as it can be seen in above example, DF1 has a lot more columns other than "col1" and "col2" and I cannot apply select on all of them.
I was trying something like,
val df = df1.join(df2, Seq("col1"), "left").select(
coalesce(df2("col2"), df1("col1")).as("col1")
)
but this doesn't seem to work. Also, I think it will filter out other columns of DF1. I want to keep all columns of DF1.
How can I do this in Scala?
You can construct the required 3 columns as follows.
val df = df1.join(df2, Seq("col1"), "left").select(coalesce(df2("col3"), df1("col1")).as("col1"),col("col2"), col("colx"))
For get all columns from "df1" after join, alias can be used for Dataframe:
val updatedCol1 = coalesce(df2("col3"), df1("col1")).alias("col1")
val columns = updatedCol1 :: df1.columns
.filterNot(_ == "col1")
.map(cname => col("df1." + cname))
.toList
df1.alias("df1")
.join(df2, Seq("col1"), "left")
.select(columns: _*)

PySpark, create line graph from a dataframe without a "category" on databricks

I'm running the following code on databricks:
dataToShow = jDataJoined.\
withColumn('id', monotonically_increasing_id()).\
filter(
(jDataJoined.containerNumber == 'SUDU8108536')).\
select(col('id'), col('returnTemperature'), col('supplyTemperature'))
This will give me tabular data like
Now I would like to display a line graph with this returnTemperature and supplyTemperature as categories.
As far as I understood, the method display in databricks wants as second argument the category, so basically what I should have is something like
id - temperatureCategory - value
1 - returnTemperature - 25.0
1 - supplyTemperature - 27.0
2 - returnTemperature - 24.0
2 - supplyTemperature - 28.0
How can I transform the dataframe in this way?
I don't know if your format is what the display method is expecting, but you can do this transformation with the sql functions create_map and explode:
#creates a example df
from pyspark.sql import functions as F
l1 = [(1,25.0,27.0),(2,24.0,28.0)]
df = spark.createDataFrame(l1,['id','returnTemperature','supplyTemperature'])
#creates a map column which contains the values of the returnTemperature and supplyTemperature
df = df.withColumn('mapCol', F.create_map(
F.lit('returnTemperature'),df.returnTemperature
,F.lit('supplyTemperature'),df.supplyTemperature
)
)
#The explode function creates a new row for each element of the map
df = df.select('id',F.explode(df.mapCol).alias('temperatureCategory','value'))
df.show()
Output:
+---+-------------------+-----+
| id|temperatureCategory|value|
+---+-------------------+-----+
| 1 | returnTemperature| 25.0|
| 1 | supplyTemperature| 27.0|
| 2 | returnTemperature| 24.0|
| 2 | supplyTemperature| 28.0|
+---+-------------------+-----+

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

pyspark generate row hash of specific columns and add it as a new column

I am working with spark 2.2.0 and pyspark2.
I have created a DataFrame df and now trying to add a new column "rowhash" that is the sha2 hash of specific columns in the DataFrame.
For example, say that df has the columns: (column1, column2, ..., column10)
I require sha2((column2||column3||column4||...... column8), 256) in a new column "rowhash".
For now, I tried using below methods:
1) Used hash() function but since it gives an integer output it is of not much use
2) Tried using sha2() function but it is failing.
Say columnarray has array of columns I need.
def concat(columnarray):
concat_str = ''
for val in columnarray:
concat_str = concat_str + '||' + str(val)
concat_str = concat_str[2:]
return concat_str
and then
df1 = df1.withColumn("row_sha2", sha2(concat(columnarray),256))
This is failing with "cannot resolve" error.
Thanks gaw for your answer. Since I have to hash only specific columns, I created a list of those column names (in hash_col) and changed your function as :
def sha_concat(row, columnarray):
row_dict = row.asDict() #transform row to a dict
concat_str = ''
for v in columnarray:
concat_str = concat_str + '||' + str(row_dict.get(v))
concat_str = concat_str[2:]
#preserve concatenated value for testing (this can be removed later)
row_dict["sha_values"] = concat_str
row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest()
return Row(**row_dict)
Then passed as :
df1.rdd.map(lambda row: sha_concat(row,hash_col)).toDF().show(truncate=False)
It is now however failing with error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 8: ordinal not in range(128)
I can see value of \ufffd in one of the column so I am unsure if there is a way to handle this ?
You can use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.
Using the data from #gaw:
from pyspark.sql.functions import sha2, concat_ws
df = spark.createDataFrame(
[(1,"2",5,1),(3,"4",7,8)],
("col1","col2","col3","col4")
)
df.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
#+----+----+----+----+----------------------------------------------------------------+
#|col1|col2|col3|col4|row_sha2 |
#+----+----+----+----+----------------------------------------------------------------+
#|1 |2 |5 |1 |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|
#|3 |4 |7 |8 |57f057bdc4178b69b1b6ab9d78eabee47133790cba8cf503ac1658fa7a496db1|
#+----+----+----+----+----------------------------------------------------------------+
You can pass in either 0 or 256 as the second argument to sha2(), as per the docs:
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256).
The function concat_ws takes in a separator, and a list of columns to join. I am passing in || as the separator and df.columns as the list of columns.
I am using all of the columns here, but you can specify whatever subset of columns you'd like- in your case that would be columnarray. (You need to use the * to unpack the list.)
If you want to have the hash for each value in the different columns of your dataset you can apply a self-designed function via map to the rdd of your dataframe.
import hashlib
test_df = spark.createDataFrame([
(1,"2",5,1),(3,"4",7,8),
], ("col1","col2","col3","col4"))
def sha_concat(row):
row_dict = row.asDict() #transform row to a dict
columnarray = row_dict.keys() #get the column names
concat_str = ''
for v in row_dict.values():
concat_str = concat_str + '||' + str(v) #concatenate values
concat_str = concat_str[2:]
row_dict["sha_values"] = concat_str #preserve concatenated value for testing (this can be removed later)
row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest() #calculate sha256
return Row(**row_dict)
test_df.rdd.map(sha_concat).toDF().show(truncate=False)
The Results would look like:
+----+----+----+----+----------------------------------------------------------------+----------+
|col1|col2|col3|col4|sha_hash |sha_values|
+----+----+----+----+----------------------------------------------------------------+----------+
|1 |2 |5 |1 |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|1||2||5||1|
|3 |4 |7 |8 |cb8f8c5d9fd7165cf3c0f019e0fb10fa0e8f147960c715b7f6a60e149d3923a5|8||4||7||3|
+----+----+----+----+----------------------------------------------------------------+----------+
New in version 2.0 is the hash function.
from pyspark.sql.functions import hash
(
spark
.createDataFrame([(1,'Abe'),(2,'Ben'),(3,'Cas')], ('id','name'))
.withColumn('hashed_name', hash('name'))
).show()
wich results in:
+---+----+-----------+
| id|name|hashed_name|
+---+----+-----------+
| 1| Abe| 1567000248|
| 2| Ben| 1604243918|
| 3| Cas| -586163893|
+---+----+-----------+
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash
if you want to control how the IDs should look like then we can use this code below.
import pyspark.sql.functions as F
from pyspark.sql import Window
SRIDAbbrev = "SOD" # could be any abbreviation that identifys the table or object on the table name
max_ID = 00000000 # control how long you want your numbering to be, i chose 8.
if max_ID == None:
max_ID = 0 # helps identify where you start numbering from.
dataframe_new = dataframe.orderBy(
F.lit('name')
).withColumn(
'hashed_name',
F.concat(
F.lit(SRIDAbbrev),
F.lpad(
(
F.dense_rank().over(
Window.orderBy(name)
)
+ F.lit(max_ID)
),
8,
"0"
)
)
)
which results to
+---+----+-----------+
| id|name|hashed_name|
+---+----+-----------+
| 1| Abe| SOD0000001|
| 2| Ben| SOD0000002|
| 3| Cas| SOD0000003|
| 3| Cas| SOD0000003|
+---+----+-----------+
Let me know if this helps :)

Padding in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):
id Value
1 103
2 1504
3 1
I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:
id Value
1 0103
2 1504
3 0001
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
You can use lpad from functions module,
from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| 0103|
| 2| 1504|
| 3| 0001|
+---+-----+
Using PySpark lpad function in conjunction with withColumn:
import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0'))