Slice Alphanumeric word from column sentence using pyspark - pyspark

I want to slice only alphanumeric word from column sentence using pyspark.
For Example,
Original text:
Expected results:

Please extract text between the white space.
df.withColumn('newtext', F.regexp_extract('text','\s(.*?)\s',0)).show()
+---+----------------+-------+
| id| text|newtext|
+---+----------------+-------+
| 1|ABCD AB12C BCDEF| AB12C |
+---+----------------+-------+
Followingg your revised question. Extract as ordered;
df.withColumn('newtext', F.regexp_extract('text','([A-Za-z]+\d+[A-Za-z]+|[A-Za-z]+\d+|\d+[A-Za-z]+)',0)).show()
+---+------------------+-------+
| id| text|newtext|
+---+------------------+-------+
| 1| ABCD AB12C BCDEF| AB12C|
| 2|SE2DC WERDF EWSQSA| SE2DC|
| 3| REDC SEDX WSDR12 | WSDR12|
+---+------------------+-------+

Related

PySpark: Remove leading numbers and full stop from dataframe column

I'm trying to remove numbers and full stops that lead the names of horses in a betting dataframe.
The format is like this:
Horse Name
Horse Name
I would like the resulting df column to just have the horses name.
I've tried splitting the column at the full stop but am not getting the required result.
import pyspark.sql.functions as F
runners_returns = runners_returns.withColumn('runner_name', F.split(F.col('runner_name'), '.'))
Any help is greatly appreciated
With a Dataframe like the following.
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| 123.John|
| 2| 5.42Anna|
| 3| .203Josh|
| 4| 102Paul|
+---+-----------+
You can do remove the leading numbers and periods like this.
import pyspark.sql.functions as F
df = (df.withColumn("runner_name",
F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| John|
| 2| Anna|
| 3| Josh|
| 4| Paul|
+---+-----------+

how to replace pyspark dataframe columns value with a dict

I have a dataframe as show below
+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123 | 3 | 0|
222 | 0 | 1|
200 | 0 | 2|
I want to replace the values in colB with a dict d to get the result like this.
d = {3:'a', 0:'b}
+++++++++++++++++++++
colA | colB | colC |
+++++++++++++++++++++
123 | a | 0|
222 | b | 1|
200 | b | 2|
You should simply use dataframe method replace that actually does not clearly explains this use case.
To use a dictionary, you have to simply setting the specified dict as first argument, a random value as second argument, and the name of the column as third argument.
At least in Spark 2.2, a warning will be raised expliciting that, since the first argument is a dictionary, the second argument will be not take into account.
data = [
(123,3,0),
(222,0,2),
(200,0,2)]
df = spark.createDataFrame(data,['colA','colB','colC'])
d = {3:'a', 0:'b}
df_renamed = df.replace(d,1,'colB')
df_renamed.show()
# +++++++++++++++++++++
# colA | colB | colC |
# +++++++++++++++++++++
# 123 | a | 0|
# 222 | b | 1|
# 200 | b | 2|
Please also note that, "When replacing, the new value will be cast to the type of the existing column" , as reported inside the docs. By consequence, your column will be casted to string.

How to truncate the values of a column of a spark dataframe? [duplicate]

This question already has answers here:
remove last few characters in PySpark dataframe column
(5 answers)
Closed 3 years ago.
I would like to remove the last two values of a string for each string in a single column of a spark dataframe. I would like to do this in the spark dataframe not by moving it to pandas and then back.
An example dataframe would be below,
# +----+-------+
# | age| name|
# +----+-------+
# | 350|Michael|
# | 290| Andy|
# | 123| Justin|
# +----+-------+
where the dtype of the age column is a string.
# +----+-------+
# | age| name|
# +----+-------+
# | 3|Michael|
# | 2| Andy|
# | 1| Justin|
# +----+-------+
This is the expected output. The last two characters of the string have been removed.
Hi Scala/sparkSql way of doing this is very Simple.
val result = originalDF.withColumn("age", substring(col("age"),0,1))
reult.show
you can probably get your syntax for pyspark
substring, length, col, expr from functions can be used for this purpose.
from pyspark.sql.functions import substring, length, col, expr
df = your df here
substring index 1, -2 were used since its 3 digits and .... its age field logically a person
wont live more than 100 years :-) OP can change substring function suiting to his requirement.
df.withColumn("age",expr("substring(age, 1, length(age)-2)"))
df.show
Result :
+----+-------+
| age| name|
+----+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+----+-------+
Scala answer :
val originalDF = Seq(
(350, "Michael"),
(290, "Andy"),
(123, "Justin")
).toDF("age", "name")
println(" originalDF " )
originalDF.show
println("modified")
originalDF.selectExpr("substring(age,0,1) as age", "name " ).show
Result :
originalDF
+---+-------+
|age| name|
+---+-------+
|350|Michael|
|290| Andy|
|123| Justin|
+---+-------+
modified
+---+-------+
|age| name|
+---+-------+
| 3|Michael|
| 2| Andy|
| 1| Justin|
+---+-------+

Spark dataframe: Pivot and Group based on columns

I have input dataframe as below with id, app, and customer
Input dataframe
+--------------------+-----+---------+
| id|app |customer |
+--------------------+-----+---------+
|id1 | fw| WM |
|id1 | fw| CS |
|id2 | fw| CS |
|id1 | fe| WM |
|id3 | bc| TR |
|id3 | bc| WM |
+--------------------+-----+---------+
Expected output
Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe
Expected dataframe
+--------------------+----------+-------+----------+
| id| bc | fe| fw |
+--------------------+----------+-------+----------+
|id1 | 0 | WM| [WM,CS]|
|id2 | 0 | 0| [CS] |
|id3 | [TR,WM] | 0| 0 |
+--------------------+----------+-------+----------+
What have i tried ?
val newDF =
df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
+--------------------+-----+-------+------+
| id|bc | fe| fw|
+--------------------+-----+-------+------+
|id1 | 0 | WM| WM|
|id2 | 0 | 0| CS|
|id3 | TR | 0| 0|
+--------------------+-----+-------+------+
Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"
Need your suggestion to get the list of customer under each app respectively.
You can use collect_list if you can bear with an empty List at cells where it should be zero:
df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id| bc| fe| fw|
+---+--------+----+--------+
|id3|[TR, WM]| []| []|
|id1| []|[WM]|[CS, WM]|
|id2| []| []| [CS]|
+---+--------+----+--------+
Using CONCAT_WS we can explode array and can remove the square brackets.
df.groupBy("id").pivot("app").agg(concat_ws(",",collect_list("customer")))

how to ignore special row in the collect_list

I have a table like below.
| COLUMN A| COLUMN b|
| Case| 1111111111|
| Rectype| ABCD|
| Key| UMUM_REF_ID=A1234|
| UMSV ERROR| UNITS_ALLOW must|
| NTNB ERROR| GGGGGGG Value|
| Case| 2222222222|
| Rectype| ABCD|
| Key| UMUM_REF_ID=B8765|
| UMSV ERROR| UNITS_ALLOW must|
| NTNB ERROR| Invalid Value|
I want to add new column "C".
C is the collect_list "Case", "Rectype", "key", "UMSV ERROR" and "NTNB ERRO" in A.
My code is
val window = Window.rowsBetween(0,4)
val begin = rddDF.withColumn("C", collect_list( $"value").over( window)).where( $"A" like "Case")
begin.show()
It works well.
Now, I want to get the collect_list again but ignore the "NTNB ERROR" where its value in column b is "Invalid Value".
What should I do please?