Removing tailing tabs from a string column in a Spark Dataframe - scala

I need to clean a column from a Dataframe which contains tailing whitespaces. Something like this:
'17063256 '
'17403492 '
'17390052 '
First, I tried to remove white spaces using trim:
df.withColumn("col1_cleansed", trim(col("col1")))
Then I though it may be tailing "tabs", so I tried as well with:
df.withColumn("col1_cleansed", regexp_replace(col("col1"), "\t", ""))
However none of these two solutions seems to be working.
What is the correct way to remove "tab" characters from a string column in Spark?

Method trim or rtrim does seem to have problem handling general whitespaces. To remove trailing whitespaces, consider using regexp_replace with regex pattern \\s+$ (with '$' representing end of string), as shown below:
val df = Seq(
"17063256 ", // space
"17403492 ", // tab
"17390052 " // space + tab
).toDF("c1")
df.withColumn("c1_trimmed", regexp_replace($"c1", "\\s+$", "")).show
// Output (prettified)
// +------------+----------+
// | c1|c1_trimmed|
// +------------+----------+
// | 17063256 | 17063256|
// | 17403492 | 17403492|
// |17390052 | 17390052|
// +------------+----------+

Try below udf & change as per your needs.
val normalize = udf((in: String) => {
import java.text.Normalizer.{normalize ⇒ jnormalize, _}
val cleaned = in.trim.toLowerCase
val normalized = jnormalize(cleaned, Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}\\p{IsLm}\\p{IsSk}]+", "")
normalized.replaceAll("'s", "")
.replaceAll("ß", "ss")
.replaceAll("ø", "o")
.replaceAll("[^a-zA-Z0-9-]+", " ")
})
df.withColumn("col1_cleansed", normalize(col("col1")))

You can regex_replace to and replace with ""
df.withColumn("new", regexp_replace($"id", " ",""))
.show(false)
Output:
+------------------------------------------------------+----------+
|id |new |
+------------------------------------------------------+----------+
|'17063256 ' |'17063256'|
|'17403492 '|'17403492'|
|'17390052 ' |'17390052'|
+------------------------------------------------------+----------+

Another way to look at the problem. Extract only the required portion from the column .This will work if you are expecting only alphanumeric values and nothing else.
Feel free to modify it to accept numbers only if required.
df.withColumn("cleansed_col",regexp_extract(col("input"),"[a-z0-9]+",0))

Related

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

Substring with delimiters with Spark Scala

I am new at Spark and Scala and I want to ask you a question :
I have a city field in my database (that I have already loaded it in a DataFrame) with this pattern : "someLetters" + " - " + id + ')'.
Example :
ABDCJ - 123456)
AGDFHBAZPF - 1234567890)
The size of the field is not fixed and the id here can be an integer of 6 or 10 digits. So, what I want to do is to extract that id in a new column called city_id.
Concretely, I want to start by the last character of the digit which is ')', ignore it and extract the integer until I find a space. Then break.
I already tried to do this using withColumn or a regex or even subString index but I got confused since they are based on the index which I can't use here.
How can I fix this?
start by the last character of the digit which is ')', ignore it and
extract the integer until I find a space
This can be done with regex pattern .*?(\\d+)\\)$, where \\)$ matches the ) at the end of the string, and then capture the digits with \\d+, and extract it as a new column; Notice .*? lazily (due to ?) matches a string until the pattern (\\d+)\\)$ is found:
df.withColumn("id", regexp_extract($"city", ".*?(\\d+)\\)$", 1)).show
+--------------------+----------+
| city| id|
+--------------------+----------+
| ABDCJ - 123456)| 123456|
|AGDFHBAZPF - 1234...|1234567890|
+--------------------+----------+
import org.apache.spark.sql.functions._
val df=tempDF.withColumn("city_id",rtrim(element_at(split($"city"," - "),2),")"))
Assuming that the input is in the format in your example.
In order to get the number after the - without the trailing ) you can execute the following command:
split(" - ")(1).dropRight(1)
The above split by the - sign and takes the second element (i.e. the number), and remove the last char (the )).
You can create udf which execute the above command, and create a new column using withColumn command
I would go for regex_extract, but there are many alternatives : You can also do this using 2 splits :
df
.withColumn("id",
split(
split($"city"," - ")(1),"\\)"
)(0)
)
First, you split by - and take the second element, then split by ) and take the first element
Or another alternative, split by - and then drop ) :
df
.withColumn("id",
reverse(
substring(
reverse(split($"city"," - ")(1)),
2,
Int.MaxValue
)
)
)
You can use 2 regexp_replace functions also.
scala> val df = Seq(("ABDCJ - 123456)"),("AGDFHBAZPF - 1234567890)")).toDF("cityid")
df: org.apache.spark.sql.DataFrame = [citiid: string]
scala> df.withColumn("id",regexp_replace(regexp_replace('cityid,""".*- """,""),"""\)""","")).show(false)
+------------------------+----------+
|cityid |id |
+------------------------+----------+
|ABDCJ - 123456) |123456 |
|AGDFHBAZPF - 1234567890)|1234567890|
+------------------------+----------+
scala>
Since the id seems to be an integer, you can cast it to long as
scala> val df2 = df.withColumn("id",regexp_replace(regexp_replace('cityid,""".*- """,""),"""\)""","").cast("long"))
df2: org.apache.spark.sql.DataFrame = [cityid: string, id: bigint]
scala> df2.show(false)
+------------------------+----------+
|cityid |id |
+------------------------+----------+
|ABDCJ - 123456) |123456 |
|AGDFHBAZPF - 1234567890)|1234567890|
+------------------------+----------+
scala> df2.printSchema
root
|-- cityid: string (nullable = true)
|-- id: long (nullable = true)
scala>

scala - how to substring column names after the last dot?

After exploding a nested structure I have a DataFrame with column names like this:
sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3
When performing a select I'm getting the error:
cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]
How should I select from the DataFrame so the column names are parsed correctly?
I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.
var salesDf_new = salesDf
for(col <- salesDf .columns){
salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}
I want to leave just metric1, metric2, metric3
You can use backticks to select columns whose names include periods.
val df = (1 to 1000).toDF("column.a.b")
df.printSchema
// root
// |-- column.a.b: integer (nullable = false)
df.select("`column.a.b`")
Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)
EDIT: Get the last component
To rename with just the last name component, this regex will work:
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)
EDIT 2: Get the last two components
This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:
val pattern = (
".*?" + // Lazy match leading chars so we ignore that bits we don't want
"([^.]+\\.)?" + // Optional 2nd to last group
"([^.]+)$" // Last group
)
val df2 = df.columns.foldLeft(df)(
(myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema

How to split using multi-char separator with pipe?

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.
You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).

Makiing sql request on columns containing dot

i have a dataframe having column'name containing "."
I would like to filter columns to get column's name containing "." and then make a select on it.Any help will be appreciated.
here is the dataset
//dataset
time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,4
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df1 = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
val columnContainingdots=df1.schema.fieldNames.filter(p=>p.contains('.'))
df1.select(columnContainingdots)
Having dot in column names will require you to enclose the names with "`" character. See the below code, this should serve your purpose.
val columnContainingDots = df1.schema.fieldNames.collect({
// since the column names has "." character, we must enclose the column names with "`", otherwise dataframe select will cause exception
case column if column.contains('.') => s"`${column}`"
})
df1.select(columnContainingDots.head, columnContainingDots.tail:_*)