Multiple regex replace together in scala

Multiple regex replace together in scala - scala

I get as input to a function in scala a dataframe that has a column named vin.
The column has values in the below format
1. UJ123QR8467
2. 0UJ123QR846
3. /UJ123QR8467
4. -UJ123QR8467
and so on.
The requirement is to clean the column vin based on the following rules.
1. replace **"/_-** as ""
2. replace first 0 as ""
3. if the value is more than 10 characters then make the value as NULL.
I would like to know is there any simplified way to achieve the above.
I can only think of doing multiple .withcolumn during regex replace every time.

I would combine all Regex related changes in a single transformation and the length condition in another, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
"UJ123QR8467", "0UJ123QR846", "/UJ123QR8467",
"-UJ123QR8467", "UJ0123QR84", "UJ123-QR_846"
).toDF("vin")
df.
withColumn("vin2", regexp_replace($"vin", "^[0]|[/_-]", "")).
withColumn("vin2", when(length($"vin2") <= 10, $"vin2")).
show
// +------------+----------+
// | vin| vin2|
// +------------+----------+
// | UJ123QR8467| null|
// | 00UJ123QR84|0UJ123QR84|
// |/UJ123QR8467| null|
// |-UJ123QR8467| null|
// | UJ0123QR84|UJ0123QR84|
// |UJ123-QR_846|UJ123QR846|
// +------------+----------+
Note that I've slightly expanded the sample dataset to cover cases such as non-leading 0, [/_-].

Related

How to check if at least one element of a list is included in a text column?

I have a data frame with a column containing text and a list of keywords. My goal is to build a new column showing if the text column contains at least one of the keywords. Let's look at some mock data:
test_data = [('1', 'i like stackoverflow'),
('2', 'tomorrow the sun will shine')]
test_df = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])
With a single keyword ("sun") the solution would be:
test_df.withColumn(
'text_contains_keyword', F.array_contains(F.split(test_df.text, ' '), 'sun')
).show()
The word "sun" is included in the second row of the text column, but not in the first. Now, let's say I have a list of keywords:
test_keywords = ['sun', 'foo', 'bar']
How to check for each of the words in test_keywords if they are included in the text column? Unfortunately, if I simply replace "sun" with the list, it leads to this error:
Unsupported literal type class java.util.ArrayList [sun, foo, bar]

You can do that using the built in rlike function with the following code.
from pyspark.sql import functions
test_df = (test_df.withColumn("text_contains_word",
functions.col('text')
.rlike('(^|\s)(' + '|'.join(test_keywords)
+ ')(\s|$)')))
test_df.show()
+---+--------------------+------------------+
| id| text|text_contains_word|
+---+--------------------+------------------+
| 1|i like stackoverflow| false|
| 2|tomorrow the sun ...| true|
+---+--------------------+------------------+

pyspark join 2 columns if condition is met, and insert string into the result

I have a pyspark dataframe like this:
+-------+---------------+------------+
|s_field|s_check| t_filter|
+-------+---------------+------------+
| MANDT| true| !=E|
| WERKS| true|0010_0020_0021_00...|
+-------+---------------+------------+
And as a first step, I split t_filter based on _ with f.split(f.col("t_filter"), "_")
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_")).show(truncate=False)
+-------+---------------+------------+------------+------------+
|s_field|s_check| t_filter| t_filter_1|
+-------+---------------+------------+------------+------------+
| MANDT| true| 070_70| [!= E]|
| WERKS| true|0010_0020_0021_00...| [0010, 0020, 0021, 00...]
+-------+---------------+------------+------------+------------+
What I want to achieve is to create a new column, using s_field and t_filter as the input while doing a logic check for !=.
ultimate aim
+------------+------------+------------+
| t_filter_2|
+------------+------------+------------+
| MANDT != 'E'|
| WERKS in ('0010', '0020', ...)|
+------------+------------+------------+
I have tried using withColumn but I keep getting error on col must be Column.
I am also not sure what the proper approach should be in order to achieve this.
Note: there is a large amount of rows, like 10k. I understand that using a UDF would be quite expensive, so i'm interested to know if there are other ways that can be done.

You can achieve this using withColumn with conditional evaluation by using the when and otherwise function. Following your example the following logic would apply, if t_filter contains != concatenate s_field and t_filter, else first convert t_filter_1 array to a string with , as separator then concat with s_field along with literals for in and ().
from pyspark.sql import functions as f
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
)
Output
+-------+-------+--------------------+-------------------------+---------------------------------------+
|s_check|s_field|t_filter |t_filter_1 |t_filter_2 |
+-------+-------+--------------------+-------------------------+---------------------------------------+
|true |MANDT |!='E' |[!='E'] |MANDT!='E' |
|true |WERKS |0010_0020_0021_00...|[0010, 0020, 0021, 00...]|WERKS in ('0010','0020','0021','00...')|
+-------+-------+--------------------+-------------------------+---------------------------------------+
Complete Working Example
from pyspark.sql import functions as f
filters_data = [
{"s_field": "MANDT", "s_check": True, "t_filter": "!='E'"},
{"s_field": "WERKS", "s_check": True, "t_filter": "0010_0020_0021_00..."},
]
filters = spark.createDataFrame(filters_data)
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_"))
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
).show(200, False)

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?

from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))

The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

Filtering on a dataframe based on columns defined in a list

I have a dataframe -
df
+----------+----+----+-------+-------+
| WEEK|DIM1|DIM2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-02| 14|NULL| -5| 60|
|2016-04-30| 14| FR| 90| 4|
+----------+----+----+-------+-------+
I have defined a list as targetList
List(T1_diff, T2_diff)
I want to filter out all rows in dataframe where T1_diff and T2_diff is greater than 3. In this scenario the output should only contain the second row as first row contains -5 as T1_Diff. targetList can contain more columns, currently it has T1_diff, T2_diff, if there is another column called T3_diff, so that should be automatically handled.
What is the best way to achieve this ?

Suppose you have following List of columns which you want to filter out for a value greater than 3.
val lst = List("T1_diff", "T2_diff")
Then you can create a String using these column names and then pass that String to where function.
val condition = lst.map(c => s"$c>3").mkString(" AND ")
df.where(condition).show(false)
For the above Dataframe it will output only second row.
+----------+----+----+-------+-------+
|Week |Dim1|Dim2|T1_diff|T2_diff|
+----------+----+----+-------+-------+
|2016-04-30|14 |FR |90 |4 |
+----------+----+----+-------+-------+
If you have another column say T3_diff you can add it to the List and it will get added to the filter condition.

fetch more than 20 rows and display full value of column in spark-shell

I am using CassandraSQLContext from spark-shell to query data from Cassandra. So, I want to know two things one how to fetch more than 20 rows using CassandraSQLContext and second how do Id display the full value of column. As you can see below by default it append dots in the string values.
Code :
val csc = new CassandraSQLContext(sc)
csc.setKeyspace("KeySpace")
val maxDF = csc.sql("SQL_QUERY" )
maxDF.show
Output:
+--------------------+--------------------+-----------------+--------------------+
| id| Col2| Col3| Col4|
+--------------------+--------------------+-----------------+--------------------+
|8wzloRMrGpf8Q3bbk...| Value1| X| K1|
|AxRfoHDjV1Fk18OqS...| Value2| Y| K2|
|FpMVRlaHsEOcHyDgy...| Value3| Z| K3|
|HERt8eFLRtKkiZndy...| Value4| U| K4|
|nWOcbbbm8ZOjUSNfY...| Value5| V| K5|

If you want to print the whole value of a column, in scala, you just need to set the argument truncate from the show method to false :
maxDf.show(false)
and if you wish to show more than 20 rows :
// example showing 30 columns of
// maxDf untruncated
maxDf.show(30, false)
For pyspark, you'll need to specify the argument name :
maxDF.show(truncate = False)

You won't get in nice tabular form instead it will be converted to scala object.
maxDF.take(50)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Multiple regex replace together in scala - scala

Related

How to check if at least one element of a list is included in a text column?

pyspark join 2 columns if condition is met, and insert string into the result

How to select a column based on value of another in Pyspark?

Filtering on a dataframe based on columns defined in a list

fetch more than 20 rows and display full value of column in spark-shell

Categories

Resources