How to use a column value as delimiter in spark sql substring? - scala

I am trying to do a substring option on a column with another column as a delimiter, the methods like substring_index() expects string value, could somebody suggest ?

substring_index defines it as substring_index(Column str, String delim, int count)
So if you have a common delimiter in all the strings of that column as
+-------------+----+
|col1 |col2|
+-------------+----+
|a,b,c |, |
|d,e,f |, |
|Jonh,is,going|, |
+-------------+----+
You can use the function as
import org.apache.spark.sql.functions._
df.withColumn("splitted", substring_index(col("col1"), ",", 1))
which should give result as
+-------------+----+--------+
|col1 |col2|splitted|
+-------------+----+--------+
|a,b,c |, |a |
|d,e,f |, |d |
|Jonh,is,going|, |Jonh |
+-------------+----+--------+
different splitting delimiter on different rows
If you have different splitting delimiter on different rows as
+-------------+----+
|col1 |col2|
+-------------+----+
|a,b,c |, |
|d$e$f |$ |
|jonh|is|going|| |
+-------------+----+
You can define udf function as
import org.apache.spark.sql.functions._
def subStringIndex = udf((string: String, delimiter: String) => string.substring(0, string.indexOf(delimiter)))
And call it using .withColumn api as
df.withColumn("splitted", subStringIndex(col("col1"), col("col2")))
the final output is
+-------------+----+--------+
|col1 |col2|splitted|
+-------------+----+--------+
|a,b,c |, |a |
|d$e$f |$ |d |
|jonh|is|going|| |jonh |
+-------------+----+--------+
I hope the answer is helpful

You can try to invoke the related hive UDF with two different columns as parameters.

Related

Spark/Scala:Finding count of delimited values in a column eliminating duplicates

I've a column like
+-----------------+----------------------------+
|Race_Track | EngineType |
+----------------------------------------------+
|800-RDUO | 881,652,EWQ,300x,652,PXZ |
+----------------------------------------------+
i should remove one specific value say EWQ and all duplicates like below
+-----------------+----------------------------+
|Race_Track | EngineType |
+----------------------------------------------+
|800-RDUO | 881,300x,652,PXZ |
+----------------------------------------------+
How to achieve this in Scala?
You can achieve your desired output by combining split, concat_ws and array_distinct as below (assuming data is your dataset):
data = data
.withColumn("EngineType", array_distinct(
filter(split(col("EngineType"), ","), x => x.notEqual("EWQ")))
)
.withColumn("EngineType", concat_ws(",", col("EngineType")))
Final output:
+----------+----------------+
|Race_Track|EngineType |
+----------+----------------+
|800-RDUO |881,652,300x,PXZ|
+----------+----------------+
Good luck!

How to format CSV data by removing quotes and double-quotes around fields

I'm using a dataset and apparently it has "double quotes" wrapped around each row. I can't see it as it opens with Excel by default when I use my browser.
The dataset looks like this (raw):
"age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""----header 58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"--row
I use the following code:
val bank = spark.read.format("com.databricks.spark.csv").
| option("header", true).
| option("ignoreLeadingWhiteSpace", true).
| option("inferSchema", true).
| option("quote", "").
| option("delimiter", ";").
| load("bank_dataset.csv")
But what I get is:
Data with quotes on either end and string values wrapped in double-double quotes
What I instead want is:
age as int and single quotes wrapped around string values
If you still have this raw data and want to clean, then you can use regex_replace to replace all double quotes"
val expr = df.columns
.map(c => regexp_replace(col(c), "\"", "").as(c.replaceAll("\"", "")))
df.select(expr: _*).show(false)
Output:
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|job |marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|y |
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|58 |management|married|tertiary |no |2143 |yes |no |unknown|5 |may |261 |1 |-1 |0 |unknown |no |
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+

How do I string concat two columns in Scala but order the resulting column alphabetically?

I have a dataframe like this...
val new_df =Seq(("a","b"),("b","a"),("a","c")).toDF("col1","col2")
and I want to create "col3" which is a string concatenation of "col1" and "col2". However, I want the concatenation of "ab" and "ba" to be treated the same, sorted alphabetically so that it's only "ab".
The resulting dataframe I would like to look like this:
val new_df =Seq(("a","b","ab"),("b","a","ab"),("a","c","ac")).toDF("col1","col2","col3")
And here's a before and after picture too:
before:
after:
thanks and have a great day!
With Spark SQL functions to take advantage of the Spark SQL Optimizations:
import org.apache.spark.sql.functions.{sort_array, array, concat_ws}
new_df.withColumn("col3",
concat_ws("",
sort_array(array(col("col1"), col("col2")))))
You can just create an udf to create a sorted String
val concatColumns = udf((c1: String, c2: String) => {
List(c1, c2).sorted.mkString
})
And then use it in a withColumn statement sending the desired columns to concatenate
new_df.withColumn("col3", concatColumns($"col1", $"col2")).show(false)
Result
+----+----+----+
|col1|col2|col3|
+----+----+----+
|a |b |ab |
|b |a |ab |
|a |c |ac |
+----+----+----+

replacing strings inside df using dictionary scala

I'm new to Scala. Im trying to replace parts of strings using a dictionary.
my dictionary would be:
val dict = Seq(("fruits", "apples"),("color", "red"), ("city", "paris")).
toDF(List("old", "new").toSeq:_*)
+------+------+
| old| new|
+------+------+
|fruits|apples|
| color| red|
| city| paris|
+------+------+
I would then translate fields from a column in another df which is:
+--------------------------+
|oldCol |
+--------------------------+
|I really like fruits |
|they are colored brightly |
|i live in city!! |
+--------------------------+
the desired output:
+------------------------+
|newCol |
+------------------------+
|I really like apples |
|they are reded brightly |
|i live in paris!! |
+------------------------+
please help! I've tried to covert dict to a map and then use replaceAllIn() function but really can't solve this one.
I've also tried foldleft following this answer: Scala replace an String with a List of Key/Values.
Thanks
Create a Map from dict dataframe and then you can easily do this using udf like below
import org.apache.spark.sql.functions._
//Creating Map from dict dataframe
val oldNewMap=dict.map(row=>row.getString(0)->row.getString(1)).collect.toMap
//Creating udf
val replaceUdf=udf((str:String)=>oldNewMap.foldLeft (str) {case (acc,(key,value))=>acc.replaceAll(key+".", value).replaceAll(key, value)})
//Select old column from oldDf and apply udf
oldDf.withColumn("newCol",replaceUdf(oldDf.col("oldCol"))).drop("oldCol").show
//Output:
+--------------------+
| newCol|
+--------------------+
|I really like apples|
|they are reded br...|
| i live in paris!!|
+--------------------+
I hope this will help you

Pyspark- Fill an empty strings with a value

Using Pyspark i found how to replace nulls (' ') with string, but it fills all the cells of the dataframe with this string between the letters. Maybe the system sees nulls (' ') between the letters of the strings of the non empty cells.
These are the values of the initial dataframe:
+-----------------+-----+
|CustomerRelStatus|count|
+-----------------+-----+
| Ανοιχτος | 477|
| Κλειστος | 68|
| 'γνωστο | 291|
| | 1165|
+-----------------+-----+
After using this:
newDf = df.withColumn('CustomerStatus', regexp_replace('CustomerRelStatus', '', '-1000'))
it returns:
+--------------------+-----+
| CustomerRelStatus |count|
+--------------------+-----+
|-1000Α-1000ν-1000...| 477|
|-1000Κ-1000λ-1000...| 68|
|-1000ʼ-1000γ-1000...| 291|
| -1000| 1165|
+--------------------+-----+
Is there any other way?
Hope this helps!
from pyspark.sql.functions import col, when
#sample data
df = sc.parallelize([['abc', '123'],
['efg', '456'],
['', '789']]).toDF(('CustomerRelStatus', 'count'))
#replace empty string with 'null' and then impute missing value, OR directly impute it with '-1000' in 'otherwise' condition
df = df.withColumn("CustomerStatus",
when(col('CustomerRelStatus') != '', col('CustomerRelStatus')).otherwise(None)).drop('CustomerRelStatus')
df = df.na.fill({'CustomerStatus': '-1000'})
df.show()
Output is
+-----+--------------+
|count|CustomerStatus|
+-----+--------------+
| 123| abc|
| 456| efg|
| 789| -1000|
+-----+--------------+
Don't forget to let us know if it solved your problem :)
I think you are missing a space in the second argument of regexp_replace so maybe try this:
newDf = df.withColumn('CustomerStatus', regexp_replace('CustomerRelStatus', ' ', '-1000'))