PySpark: Regex Replace Group - pyspark

I'm trying to join two tables based on a common ID, but there's a mismatch in dates across these files which I'm trying to normalise.
Given this data:
+-------+-------------------+----------------------------+
|dataset|id |topic |
+-------+-------------------+----------------------------+
|2020A |1128290566331031552|papuaNewguineaEarthquake2019|
|2020A |1128293303659716608|papuaNewguineaEarthquake2019|
|2020A |1152200235847966726|athensEarthquake2019 |
|2020A |1152204892083281920|athensEarthquake2019 |
|2020A |1152220394008522753|athensEarthquake2019 |
+-------+-------------------+----------------------------+
How would I, for example, replace the 2019 in papuaNewguineaEarthquake2019 with the first four numbers of the value in the dataset column so that:
papuaNewguineaEarthquake2019 becomes papuaNewguineaEarthquake2020?
In other words, how do I use regex to replace a subgroup in one column with a subgroup in another column?

You can use the expr function.
I'm using regexp_extract to extract the first 4 digits from the dataset column and regexp_replace to replace the last 4 digits of the topic column with the output of regexp_extract.
Regex for first 4 digits: (^[0-9]{4})
Regex for last 4 digits: ([0-9]{4}$)
from pyspark.sql.functions import expr
df.withColumn("dataset_year",expr("regexp_extract(dataset, '(^[0-9]{4})')"))\
.withColumn("topic",expr("regexp_replace(topic, '([0-9]{4}$)'\
, dataset_year)")).drop('dataset_year').show(truncate=False)
+-------+-------------------+----------------------------+
|dataset|id |topic |
+-------+-------------------+----------------------------+
|2020A |1128290566331031552|papuaNewguineaEarthquake2020|
|2020A |1128293303659716608|papuaNewguineaEarthquake2020|
|2020A |1152200235847966726|athensEarthquake2020 |
|2020A |1152204892083281920|athensEarthquake2020 |
|2020A |1152220394008522753|athensEarthquake2020 |
+-------+-------------------+----------------------------+

Related

Change prefix in a integer column in pyspark

I want to convert the prefix from 222.. to 999.. in pyspark.
Expected new column new_id with changed prefixt to 999..s
I will be using this column for inner merge b/w 2 pysparl dataframes
id
new_id
2222238308750
9999938308750
222222579844
999999579844
222225701296
999995701296
2222250087899
9999950087899
2222250087899
9999950087899
2222237274658
9999937274658
22222955099
99999955099
22222955099
99999955099
22222955099
99999955099
222285678
999985678
You can achieve it with something like this,
# First calculate the number of "2"s from the start till some other value is found, for eg '2223' should give you 3 as the length
# Use that calculated value to repeat the "9" that many times
# replace starting "2"s with the calulated "9" string
# finally drop all the calculated columns
df.withColumn("len_2", F.length(F.regexp_extract(F.col("value"), r"^2*(?!2)", 0)).cast('int'))\
.withColumn("to_replace_with", F.expr("repeat('9', len_2)"))\
.withColumn("new_value", F.expr("regexp_replace(value, '^2*(?!2)', to_replace_with)")) \
.drop("len_2", "to_replace_with")\
.show(truncate=False)
Output:
+-------------+-------------+
|value |new_value |
+-------------+-------------+
|2222238308750|9999938308750|
|222222579844 |999999579844 |
|222225701296 |999995701296 |
|2222250087899|9999950087899|
|2222250087899|9999950087899|
|2222237274658|9999937274658|
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|222285678 |999985678 |
+-------------+-------------+
I have used the column name as value, you would have to substitute it with id.
You can try the following:
from pyspark.sql.functions import *
df = df.withColumn("tempcol1", regexp_extract("id", "^2*", 0)).withColumn("tempcol2", split(regexp_replace("id", "^2*", "_"), "_")[1]).withColumn("new_id", concat((regexp_replace("tempcol1", "2", "9")), "tempcol2")).drop("tempcol1", "tempcol2")
The id column is split into two temp columns, one having the prefix and the other the rest of the string. The prefix column values are replaced and concatenated back with the second temp column.

pyspark column type casting in pivot

I have a dataframe where I want to create pivot table from 2 columns, i'm using the question header column which will have its value pivoted like below : age , age_numeric
and the answer header is the value , my problem is I want to put the value of the answer header in a list which I'm doing using collect_list function, but the problem is i want the new column like age_numeric to be list of int, while column age to be list of strings, based on question type column, but when i try the code it always gives me a list of strings, any idea how to solve this problem?
this is the code
y=output.groupby("sessionId").pivot("questionHeader").
agg(collect_list(when(col("questionType")=="numericAnswer",
col("answerHeader")
.cast("float")).when(col("questionType")!="numericAnswer",col("answerHeader"))))
this is what i get
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | ["20"]
| 3 | ["20-25 years"] | ["20"]
This is what i want
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | [20]
| 3 | ["20-25 years"] | [20]
If you want the output as in the last two rows, then you do not require a pivot, just groupby and collect_list on each of the two columns To get the list of integers for Age_numeric, apply .cast("array< int>"), or change the type of Age_numeric column before collect_list().
Replicate the data
import pyspark.sql.functions as F
data = [(1, "20-25 years", "20"), (3, "20-25 years", "20")]
df = spark.createDataFrame(data, schema=["session_id", "Age", "Age_numeric"])
Replicate the output
df_out = (df.groupBy("session_id")
.agg(F.collect_list("Age").alias("Age"),
F.collect_list("Age_numeric")
.cast("array<int>")
.alias("Age_numeric"))

How to parse month-year string using Presto

I have a column that contains a Month-Year string that I would like to convert to an actual date representing the first day of the Month and Year combination. For example
+----------+------------+
| Original | Desired |
+----------+------------+
| Aug-19 | 08/01/2019 |
+----------+------------+
| Sep-20 | 09/01/2020 |
+----------+------------+
| May-22 | 05/01/2022 |
+----------+------------+
I have tried breaking apart the Month-Year string using split_part but when I try and pass Month as a parameter into date_parse it throws an error with the input (INVALID_FUNCTION_ARGUMENT). I could break apart the Month-Year into strings and then recombine, hard-coding the 01 however the problem seems that three letter month cannot be parsed into an actual month by Presto. I also want to avoid a 12 line CASE WHEN statement to parse the month if possible.
I'm not sure where the year comes from, but the query will be like this:
select date_format(date_parse('May-22', '%b-%d'), '%m/%d/%Y')
https://trino.io/docs/current/functions/datetime.html?mysql-date-functions

Hbase shell Filter with Prefix

I have to get all the entries from a HBASE table which have values substring of the given input.
For example if my table is like below:
Table | Family | ColumnQualifier | Value
exp | family | column | 1000xyz
exp | family | column1 | 1000abc
exp | family | column2 | 1001abc
I need to get the entries 1000xyz and 1000abc by value filter with input - 1000
I tried the value filter :
scan 'exp', { FILTER => "ValueFilter( =, 'binary:1000')" }
which gives me the exact value 1000.
Thanks in advance!!!!
Use binaryprefix instead of binary as value comparator,
scan 'exp', { FILTER => "ValueFilter( =, 'binaryprefix:1000' )" }

Spark explode multiple columns of row in multiple rows

I have a problem with converting one row using three 3 columns into 3 rows
For example:
<pre>
<b>ID</b> | <b>String</b> | <b>colA</b> | <b>colB</b> | <b>colC</b>
<em>1</em> | <em>sometext</em> | <em>1</em> | <em>2</em> | <em>3</em>
</pre>
I need to convert it into:
<pre>
<b>ID</b> | <b>String</b> | <b>resultColumn</b>
<em>1</em> | <em>sometext</em> | <em>1</em>
<em>1</em> | <em>sometext</em> | <em>2</em>
<em>1</em> | <em>sometext</em> | <em>3</em>
</pre>
I just have dataFrame which is connected with first schema(table).
val df: dataFrame
Note: I can do it using RDD, but do we have other way? Thanks
Assuming that df has the schema of your first snippet, I would try:
df.select($"ID", $"String", explode(array($"colA", $"colB",$"colC")).as("resultColumn"))
I you further want to keep the column names, you can use a trick that consists in creating a column of arrays that contains the array of the value and the name. First create your expression
val expr = explode(array(array($"colA", lit("colA")), array($"colB", lit("colB")), array($"colC", lit("colC"))))
then use getItem (since you can not use generator on nested expressions, you need 2 select here)
df.select($"ID, $"String", expr.as("tmp")).select($"ID", $"String", $"tmp".getItem(0).as("resultColumn"), $"tmp".getItem(1).as("columnName"))
It is a bit verbose though, there might be more elegant way to do this.