Substring with delimiters with Spark Scala - scala

I am new at Spark and Scala and I want to ask you a question :
I have a city field in my database (that I have already loaded it in a DataFrame) with this pattern : "someLetters" + " - " + id + ')'.
Example :
ABDCJ - 123456)
AGDFHBAZPF - 1234567890)
The size of the field is not fixed and the id here can be an integer of 6 or 10 digits. So, what I want to do is to extract that id in a new column called city_id.
Concretely, I want to start by the last character of the digit which is ')', ignore it and extract the integer until I find a space. Then break.
I already tried to do this using withColumn or a regex or even subString index but I got confused since they are based on the index which I can't use here.
How can I fix this?

start by the last character of the digit which is ')', ignore it and
extract the integer until I find a space
This can be done with regex pattern .*?(\\d+)\\)$, where \\)$ matches the ) at the end of the string, and then capture the digits with \\d+, and extract it as a new column; Notice .*? lazily (due to ?) matches a string until the pattern (\\d+)\\)$ is found:
df.withColumn("id", regexp_extract($"city", ".*?(\\d+)\\)$", 1)).show
+--------------------+----------+
| city| id|
+--------------------+----------+
| ABDCJ - 123456)| 123456|
|AGDFHBAZPF - 1234...|1234567890|
+--------------------+----------+

import org.apache.spark.sql.functions._
val df=tempDF.withColumn("city_id",rtrim(element_at(split($"city"," - "),2),")"))

Assuming that the input is in the format in your example.
In order to get the number after the - without the trailing ) you can execute the following command:
split(" - ")(1).dropRight(1)
The above split by the - sign and takes the second element (i.e. the number), and remove the last char (the )).
You can create udf which execute the above command, and create a new column using withColumn command

I would go for regex_extract, but there are many alternatives : You can also do this using 2 splits :
df
.withColumn("id",
split(
split($"city"," - ")(1),"\\)"
)(0)
)
First, you split by - and take the second element, then split by ) and take the first element
Or another alternative, split by - and then drop ) :
df
.withColumn("id",
reverse(
substring(
reverse(split($"city"," - ")(1)),
2,
Int.MaxValue
)
)
)

You can use 2 regexp_replace functions also.
scala> val df = Seq(("ABDCJ - 123456)"),("AGDFHBAZPF - 1234567890)")).toDF("cityid")
df: org.apache.spark.sql.DataFrame = [citiid: string]
scala> df.withColumn("id",regexp_replace(regexp_replace('cityid,""".*- """,""),"""\)""","")).show(false)
+------------------------+----------+
|cityid |id |
+------------------------+----------+
|ABDCJ - 123456) |123456 |
|AGDFHBAZPF - 1234567890)|1234567890|
+------------------------+----------+
scala>
Since the id seems to be an integer, you can cast it to long as
scala> val df2 = df.withColumn("id",regexp_replace(regexp_replace('cityid,""".*- """,""),"""\)""","").cast("long"))
df2: org.apache.spark.sql.DataFrame = [cityid: string, id: bigint]
scala> df2.show(false)
+------------------------+----------+
|cityid |id |
+------------------------+----------+
|ABDCJ - 123456) |123456 |
|AGDFHBAZPF - 1234567890)|1234567890|
+------------------------+----------+
scala> df2.printSchema
root
|-- cityid: string (nullable = true)
|-- id: long (nullable = true)
scala>

Related

How to find position of substring in another column of dataframe using spark scala

I have a Spark scala DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?
Input data:
+---------------------------+---------+
| text | subtext |
+---------------------------+---------+
| Where is my string? | is |
| Hm, this one is different | on |
+---------------------------+---------+
Expected output:
+---------------------------+---------+----------+
| text | subtext | position |
+---------------------------+---------+----------+
| Where is my string? | is | 6 |
| Hm, this one is different | on | 9 |
+---------------------------+---------+----------+
Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex. Found an answer here that works with pyspark. I am looking to use similar solution in scala.
How to find position of substring column in a another column using PySpark?
This works:
import org.apache.spark.sql.functions._
val df = Seq(
("beatles", "hey jude"),
("romeo", "eres mia")
).toDF("name", "hit_songs")
val df2 = df.withColumn("answer", locate("ju", col("hit_songs"), pos=1))
df2.show(false)
Gets first find position from or equal to the pos you specify, not all occurrences.
UPDATE
You can adapt to use col as opposed to literal, that goes without saying. However, the stuff on Columns and using locate etc. does not work well. E.g.
val df2 = df.withColumn("answer", locate(col("search_string"), col("hit_songs"), pos=1))
this does not work. I find this aspect hard to explain, but I cannot cast col search_string to string as we have this Column aspect to consider.
So, this is what you need: a UDF reverting to Scala functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import spark.implicits._
val df = Seq(("beatles", "hey jude", "ju"), ("romeo", "eres mia", "es") ).toDF("name", "hit_songs", "search_string")
def ggg = udf((c1: String, c2: String) => {
c2.indexOf(c1)
} )
df.withColumn("value",ggg(df("search_string") ,df("hit_songs"))).show(false)
Note you may add 1 to result as it starts at 0.
Interesting to note the contrast with this question and answer: Pass dataframe column name as parameter to the function using scala?
In any event:
df.withColumn("answer", locate(df("search_string"), col("hit_songs"), pos=1))
does not work.

Spark - Replace first occurrence in a string

I want to use the replaceFirst() function in spark scala sql.
or
Is it possible to use the replaceFirst() function in spark scala dataframe?
Is this possible without using a UDF?
The function I want to do is:
println("abcdefgbchijkl".replaceFirst("bc","**BC**"))
// a**BC**defgbchijkl
However, the Column Type of DataFrame cannot be applied with Function:
var test0 = Seq("abcdefgbchijkl").toDF("col0")
test0
.select(col("col0").replaceFirst("bc","**BC**"))
.show(false)
/*
<console>:230: error: value replaceFirst is not a member of org.apache.spark.sql.Column
.select(col("col0").replaceFirst("bc","**BC**"))
*/
Also, I don't know how to use it in SQL form:
%sql
-- How to use replaceFirst()
select replaceFirst()
Replacing the first occurrence isn't something I can see supported out of the box by Spark, but it is possible by combining a few functions:
Spark >= 3.0.0
import org.apache.spark.sql.functions.{array_join, col, split}
val test0 = Seq("abcdefgbchijkl").toDF("col0") // replaced `var` with `val`
val stringToReplace = "bc"
val replacement = "**BC**"
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", split(col("col0"), stringToReplace, 2))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
.show(false)
gives:
+------------------+
|col0 |
+------------------+
|a**BC**defgbchijkl|
+------------------+
For (spark) SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT split(col0, "bc", 2) AS tempr
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)
Spark < 3.0.0 (pre 2020, tested using Spark 2.4.5)
val test0 = Seq("abcdefgbchijkl").toDF("col0")
val stringToReplace = "bc"
val replacement = "**BC**"
val splitFirst = udf { (s: String) => s.split(stringToReplace, 2) }
spark.udf.register("splitFirst", splitFirst) // if you're using Spark SQL
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", splitFirst(col("col0")))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
.show(false)
gives:
+------------------+
|col0 |
+------------------+
|a**BC**defgbchijkl|
+------------------+
For (spark) SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT splitFirst(col0) AS tempr -- `splitFirst` was registered above
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)

scala - how to substring column names after the last dot?

After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
When performing a select I'm getting the error:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
How should I select from the DataFrame so the column names are parsed correctly?
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
I want to leave just metric1, metric2, metric3
You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
EDIT: Get the last component
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
EDIT 2: Get the last two components
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

how to clean output from symbols: plus minus pipe ( | - + )

using the sentence:
scala> val intento2 = sql("SELECT _CreationDate FROM tablaTemporal" )
intento2: org.apache.spark.sql.DataFrame = [_CreationDate: string]
scala> intento2.show(5, false)
I receive this output:
+-----------------------+
|_CreationDate |
+-----------------------+
|2008-07-31T00:00:00.000|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.287|
|2008-07-31T14:22:31.317|
+-----------------------+
only showing top 5 rows
but the result I need is the same but no symbols added by scala/spark:
2005-07-31T14:20:19.239
2007-07-31T14:20:31.287
2009-07-31T14:21:33.287
2005-07-31T14:23:36.287
2009-07-31T14:20:38.317
How can i do, to print a clean output like above?
Here, you're printing the dataframe.
What you want to do is print each record of the dataframe:
intento2.collect().map(_.getString(0)).foreach(println)
collect transforms the dataframe into an array of Row objects.
then we map each Row to its first element with row.getString(0). In fact the Row contains only one element, the date.

How to join datasets with same columns and select one?

I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.
I have tried this:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")
and this:
d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")
but it does not work.
which spark version you're using ? can you put a sample of your dataframes ?
try this:
d2prim = d2.withColumnRenamed("columnName", d2_columnName)
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")
I have two dataframes
val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
which I am joining and selecting afterwards.
I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.
val q1 = d1.as("d1")
.join(d2.as("d2"), Seq("id"), "left")
.select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
| d1|
| d1|
| d1|
+----------+
As you can see it just works.
So, why did it not work for you? Let's analyze each.
// you started very well
d1.as("d1")
// but here you used $ to reference a column to join on
// with column references by their aliases
// that won't work
.join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
// same here
// $ + aliased columns won't work
.select($"d1.columnName")
PROTIP: Use d1("columnName") to reference a specific column in a dataframe.
The other query was very close to be fine, but...
d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
.select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column
This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.
val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")