Change prefix in a integer column in pyspark - pyspark

I want to convert the prefix from 222.. to 999.. in pyspark.
Expected new column new_id with changed prefixt to 999..s
I will be using this column for inner merge b/w 2 pysparl dataframes
id
new_id
2222238308750
9999938308750
222222579844
999999579844
222225701296
999995701296
2222250087899
9999950087899
2222250087899
9999950087899
2222237274658
9999937274658
22222955099
99999955099
22222955099
99999955099
22222955099
99999955099
222285678
999985678

You can achieve it with something like this,
# First calculate the number of "2"s from the start till some other value is found, for eg '2223' should give you 3 as the length
# Use that calculated value to repeat the "9" that many times
# replace starting "2"s with the calulated "9" string
# finally drop all the calculated columns
df.withColumn("len_2", F.length(F.regexp_extract(F.col("value"), r"^2*(?!2)", 0)).cast('int'))\
.withColumn("to_replace_with", F.expr("repeat('9', len_2)"))\
.withColumn("new_value", F.expr("regexp_replace(value, '^2*(?!2)', to_replace_with)")) \
.drop("len_2", "to_replace_with")\
.show(truncate=False)
Output:
+-------------+-------------+
|value |new_value |
+-------------+-------------+
|2222238308750|9999938308750|
|222222579844 |999999579844 |
|222225701296 |999995701296 |
|2222250087899|9999950087899|
|2222250087899|9999950087899|
|2222237274658|9999937274658|
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|222285678 |999985678 |
+-------------+-------------+
I have used the column name as value, you would have to substitute it with id.

You can try the following:
from pyspark.sql.functions import *
df = df.withColumn("tempcol1", regexp_extract("id", "^2*", 0)).withColumn("tempcol2", split(regexp_replace("id", "^2*", "_"), "_")[1]).withColumn("new_id", concat((regexp_replace("tempcol1", "2", "9")), "tempcol2")).drop("tempcol1", "tempcol2")
The id column is split into two temp columns, one having the prefix and the other the rest of the string. The prefix column values are replaced and concatenated back with the second temp column.

Related

In Pyspark: Get specific values of list-like Columns with a condition and extract to new Columns

My dataframe looks like this:
The specific values for a respective entity are at the same index of the list in a consistent way overarching all shown columns.
column_1 | [2022-08-05 03:38...
column_2 | [inside, inside, ...
column_3 | [269344c6-c01c-45...
column_4 | [ff870660-57ce-11...
column_5 | [Mannheim, Mannhe...
column_6 | [26, 21, 2, 8]
column_7 | [fa8103a0-57ce-11...
column_8 | [ATG1, ATG3, Variable1...
My Approach:
#Get columns
df_colum_names = list(df.schema.names)
# Set condition with a expression
filter_func = ("filter(geofenceeventtype,spatial_wi_df -> df.column_8 == 'Variable1')")
geofence_expr= f"transform(sort_array({filter_func}), x -> x."
geofence_prefix = "geofence_sorted"
# extract to new columns
for col in df_colum_names:
df = df.withColumn(
geofence_prefix + col,
F.element_at(
F.expr(geofence_expr + col.replace("_", ".") + ")"), 1),)
In this way i want to create columns only with the specific values of entity 'Variable1' and then drop all rows without data from this entity.
The error message:
Can't extract value from lambda df#2345: need struct type but got string
So there are rows where the value of the column is just one value as a String and not a Structtype, how to deal with this problem?

Pyspark dataframe split and pad delimited column value into Array of N index

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column.
Example of source dataframe given below:
X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123
Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".
Resulting output should be like below:
X
Col1
Col2
Col3
Col4
A123-B345-C44656-D4423-E3445-F5667
A123
B345
C44656
D4423
X123-Y345
A123
Y345
Z123-N345-T44656-M4423
Z123
N345
T44656
M4423
X123
X123
Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:
def pad_infinite(siterable, padding=None):
return chain(iterable, repeat(padding))
def pad(iterable, size, padding=None):
return islice(pad_infinite(iterable, padding), size)
colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))
You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:
df = ...
df.withColumn("arr", F.split("X", "-")) \
.selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
.na.fill("") \
.show(truncate=False)
Output:
+----------------------------------+----+----+------+-----+
|X |Col1|Col2|Col3 |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345 |X123|Y345| | |
|Z123-N345-T44656-M4423 |Z123|N345|T44656|M4423|
|X123 |X123| | | |
+----------------------------------+----+----+------+-----+

add new column in a dataframe depending on another dataframe's row values

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

How to associate dates with a count or an integer [duplicate]

This question already has answers here:
Spark SQL window function with complex condition
(2 answers)
Closed 4 years ago.
I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:
+---------+----------+
| item_id| ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+
I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..
I want my DataFrame to look like this... :
+---------+----------+---------+
| item_id| ds| number|
+---------+----------+---------+
| 25867869|2018-05-01| 3|
| 17190474|2018-01-02| 1|
| 19870756|2018-01-02| 1|
|172248680|2018-07-29| 4|
| 41148162|2018-03-01| 2|
+---------+----------+---------+
Explanation:
2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?
This can be achieved by dense_rank in Window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)
df1 will have the column number as you required.
Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.
You should make a function to get the oldest query without a number something like:
SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC
then make another query where you get the greatest number:
SELECT * FROM tablename ORDER BY number DESC
then if both queries have the same date then update the table with the same number:
UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'
or if the dates are diferent then the same but add 1 to the number:
UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'
To make this work you should first assgin the number 1 to the oldest entry.
You should do this in a loop until the first query (checks if there is any number that is not set) is empty.
The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.

Reading a csv file into PySpark that contains the key:value pairing, such that key becomes the column and value is the data of it

I am a beginner of Spark. Please help me out with a solution.
The csv file contains the text in the form of key:value paring delimited by a comma. And in some lines, the keys(or columns) may be missing.
I have loaded this file into a single column of a dataframe. I want to segregate these keys as columns and values associated to it as data into that column. And when there are some columns missing i want to add a new column and a dummy data to that.
Dataframe
+----------------------------------------------------------------+
| _c0 |
+----------------------------------------------------------------+
|name:Pradnya,IP:100.0.0.4, college: SDM, year:2018 |
|name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018 |
+----------------------------------------------------------------+
I want the output in this form
+----------- ----------------------------------------------
| name | IP | College | Semester | year |
+-----------+-------------------------+-----------+-------+
| Pradnya |100.0.0.4 | SDM | null | 2018 |
+-----------+-------------+-----------+-----------+-------+
| Ram | 100.10.10.5 | BVB | IV |2018 |
+-----------+-------------+-----------+-----------+-------+
Thanks.
Pyspark won't recognize the key:value pairing. One workaround is convert the file int json format and then read the json file.
content of raw.txt:
name:Pradnya,IP:100.0.0.4, college: SDM, year:2018
name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018
Following code will create the json file :
with open('raw.json', 'w') as outfile:
json.dump([dict([p.split(':') for p in l.split(',')]) for l in open('raw.txt')], outfile)
Now you can create the pyspark dataframe using following code :
df = spark.read.format('json').load('raw.json')
If you know all field names and keys/values do not contain embedded delimiters. then you can probably convert the key/value lines into Row object through RDD's map function.
from pyspark.sql import Row
from string import lower
# assumed you already defined SparkSession named `spark`
sc = spark.sparkContext
# initialize the RDD
rdd = sc.textFile("key-value-file")
# define a list of all field names
columns = ['name', 'IP', 'College', 'Semester', 'year']
# set Row object
def setRow(x):
# convert line into key/value tuples. strip spaces and lowercase the `k`
z = dict((lower(k.strip()), v.strip()) for e in x.split(',') for k,v in [ e.split(':') ])
# make sure all columns shown in the Row object
return Row(**dict((c, z[c] if c in z else None) for c in map(lower, columns)))
# map lines to Row objects and then convert the result to dataframe
rdd.map(setRow).toDF().show()
#+-------+-----------+-------+--------+----+
#|college| ip| name|semester|year|
#+-------+-----------+-------+--------+----+
#| SDM| 100.0.0.4|Pradnya| null|2018|
#| BVB|100.10.10.5| Ram| IV|2018|
#+-------+-----------+-------+--------+----+