How to get substring using patterns and replace quotes in json value field using scala? - scala

I have few json messages like
{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}
{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}
I need to add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.

You have mentioned in the comment above
I have huge dataset in kafka.I am trying to read from kafka and write to hdfs through spark using scala.I am using json parser but unable to parse because of column3 issue.so need to manipulate the message to change into json
So you must be having collecting of malformed jsons as in the question. I have created a list as
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""")
and you are reading it through Spark so you must be having rdds as
val rdd = sc.parallelize(kafkaMsg)
All you need is some parsing in the malformed text json to make it valid json string as
val validJson = rdd.map(msg => msg.replaceAll("[}\"{]", "").split(",").map(_.split(":").mkString("\"", "\":\"", "\"")).mkString("{", ",", "}"))
validJson should be
{"column1":"abc","column2":"123","column3":"qwerty","column4":"abc123"}
{"column1":"defhj","column2":"45","column3":"asdfgh","column4":"def12d"}
You can create a dataframe from the validJson rdd as
sqlContext.read.json(validJson).show(false)
which should give you
+-------+-------+-------+-------+
|column1|column2|column3|column4|
+-------+-------+-------+-------+
|abc |123 |qwerty |abc123 |
|defhj |45 |asdfgh |def12d |
+-------+-------+-------+-------+
Or you can do as per your requirement.

Goal
add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
I would recommend to use RegEx because you have more flexibility with it.
Here is the solution:
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""", """{"column1":"defhj","column2":"45","column3":without-quotes,"column4":"def12d"}""")
val rdd = sc.parallelize(kafkaMsg)
val rePattern = """(^\{.*)("column3":)(.*)(,"column4":.*)""".r
val newRdd = rdd.map(r =>
r match {
case rePattern(start, col3, col3Value, end) => (start + col3 + '"' + col3Value.replaceAll("\"", "'") + '"' + end)
case _ => r }
)
newRdd.foreach(println)
Explanation:
First and second statements are rdd initialization.
Third line defines the regex pattern. You may need to adjust it to your situation.
Regex produce 4 groups of values (whatever is in a () is a group):
string starting with "{" and whatever after until we meet "column3":
"column3": itself
whatever comes after "column3": but before ,"column4":
whatever comes starting ,"column4":
I use these 4 groups in next statement.
Iterate over your rdd, run it against regex, and change it: replace double quotes with single, and add open/close quotes. In case there is no match the original string will be returned.
Because regex was defined with 4 groups, I use 4 variables to map matches:
case rePattern(start, col3, col3Value, end) =>
Note: Code doesn't check if you have double quote in the value or not, it just runs update. You can add validation on your own if you need.
Show the results.
Important notes:
Regex that I used is strictly linked to your source string format. Keep in mind that you have JSON, so order of your keys is not guaranteed. As a result you can end up with "column4" (which is used as a column3 value ending) may come before "column3".
If you use comma as a key/value ending, make sure you don't have it as part of column3 value.
Bottom line: you need to adjust my regex to properly identify the end of column3 value.
Hope it helps.

Related

In Spark Scala, how to create a column with substring() using locate() as a parameter?

I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.

Spark: dropping multiple columns using regex

In a Spark (2.3.0) project using Scala, I would like to drop multiple columns using a regex. I tried using colRegex, but without success:
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)
On the other hand, the mechanism seems to work with select:
df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)
Several things are not clear to me:
what is this backquote syntax in the regex?
colRegex returns a Column: how can it actually represent several columns in the 2nd example?
can I combine drop and colRegex or do I need some workaround?
If you check spark code of colRefex method ... it expects regexs to be passed in the below format
/** the column name pattern in quoted regex without qualifier */
val escapedIdentifier = "`(.+)`".r
/** the column name pattern in quoted regex with qualifier */
val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r
backticks(`) are necessary to enclose your regex, otherwise the above patterns will not identify your input pattern.
you can try selecting specific colums which are valid as mentioned below
val df = Seq(("id","a_in","a_out","b_in","b_out"))
.toDF("id","a_in","a_out","b_in","b_out")
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
In addition to the workaround proposed by Waqar Ahmed and kavetiraviteja (accepted answer), here is another possibility based on select with some negative regex magic. More concise, but harder to read for non-regex-gurus...
val df_in = df
.withColumnRenamed("a_in","a")
.withColumnRenamed("b_in","b")
.select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

Capture and write string inside of dataframe using foreach row

Trying to capture and write a string value after substituting contents obtained from specific fields from each row of a dataframe using scala. But since it is deployed on cluster not able to capture any records. Can anyone provide a solution?
Assuming TEST_DB.finalresult has 2 fields input1 and input2:
val finalresult=spark.sql("select * from TEST_DB.finalresult")
finalResult.foreach { row =>
val param1=row.getAs("input1").asInstanceOf[String]
val param2=row.getAs("input2").asInstanceOf[String]
val string = """new values of param1 and param2 are -> """ + param1 + """,""" + param2
// how to append modified string to csv file continously for each microbatch in hdfs ??
}
In your code you create the wanted string variable but it is not being saved anywhere, hence you can't see the result.
You can potentially in each foreach execution open up the wanted csv file and append the new string, but I'd like to propose a different solution.
If you can, try to always use built-in functionality of Spark, since it is (usually) more optimised and better in handling null inputs. You can achieve the same by:
import org.apache.spark.sql.functions.{lit, concat, col}
val modifiedFinalResult = finalResult.select(
concat(
lit("new values of param1 and param2 are -> "),
col("input1"),
lit(","),
col("input2")
).alias("string")
)
In variable modifiedFinalResult you will have a spark dataframe with single column named string, which represents the exact same output as your variable string in your code. Afterwards you can save the dataframe directly as a single csv file (using the repartition functionality):
modifiedFinalResult.repartition(1).write.format("csv").save("path/to/your/csv/output")
PS: Also a suggestion for the future, try to avoid naming variables after data types.
UPDATE: Fixed the empty rows issue by using "concat_ws" instead of concat and coalesce to each fields. It seems some of the values which were null were transforming the entire concatenated string to null after the transformation. Nevertheless this solution works for now!

How to conditionally remove the first two characters from a column

I have the below data of some phone records, and I want to remove the first two values from each record as they are a country code. What is the way by which I can do this using Scala, Spark, or Hive?
phone
|917799423934|
|019331224595|
| 8981251522|
|917271767899|
I'd like the result to be:
phone
|7799423934|
|9331224595|
|8981251522|
|7271767899|
How can we remove the prefix 91,01 from each record or each row of this column?
Phone size can be different, such construction can be used (Scala):
df.withColumn("phone", expr("substring(phone,3,length(phone)-2)"))
Using regular expressions
Use regexp_replace (add more extension codes if necessary):
select regexp_replace(trim(phone),'^(91|01)','') as phone --removes leading 91, 01 and all leading and trailing spaces
from table;
The same using regexp_extract:
select regexp_extract(trim(phone),'^(91|01)?(\\d+)',2) as phone --removes leading and trailing spaces, extract numbers except first (91 or 01)
from table;
An improvement I believe, would prefer a list with contains or the equivalent of, but here goes:
import org.apache.spark.sql.functions._
case class Tel(telnum: String)
val ds = Seq(
Tel("917799423934"),
Tel("019331224595"),
Tel("8981251522"),
Tel("+4553")).toDS()
val ds2 = ds.withColumn("new_telnum", when(expr("substring(telnum,1,2)") === "91" || expr("substring(telnum,1,2)") === "01", expr("substring(telnum,3,length(telnum)-2)")).otherwise(col("telnum")))
ds2.show
returns:
+------------+----------+
| telnum|new_telnum|
+------------+----------+
|917799423934|7799423934|
|019331224595|9331224595|
| 8981251522|8981251522|
| +4553| +4553|
+------------+----------+
We may need to think of the +, but nothing was stated.
If they are strings then for a Hive query:
sql("select substring(phone,3) from table").show

Spark SQL change format of the number

After show command spark prints the following:
+-----------------------+---------------------------+
|NameColumn |NumberColumn |
+-----------------------+---------------------------+
|name |4.3E-5 |
+-----------------------+---------------------------+
Is there a way to change NumberColumn format to something like 0.000043?
you can use format_number function as
import org.apache.spark.sql.functions.format_number
df.withColumn("NumberColumn", format_number($"NumberColumn", 5))
here 5 is the decimal places you want to show
As you can see in the link above that the format_number functions returns a string column
format_number(Column x, int d)
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
If your don't require , you can call regexp_replace function which is defined as
regexp_replace(Column e, String pattern, String replacement)
Replace all substrings of the specified string value that match regexp with rep.
and use it as
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("NumberColumn", regexp_replace(format_number($"NumberColumn", 5), ",", ""))
Thus comma (,) should be removed for large numbers.
You can use cast operation as below:
val df = sc.parallelize(Seq(0.000043)).toDF("num")
df.createOrReplaceTempView("data")
spark.sql("select CAST (num as DECIMAL(8,6)) from data")
adjust the precision and scale accordingly.
In newer versions of pyspark you can use round() or bround() functions.
Theses functions return a numeric column and solve the problem with ",".
it would be like:
df.withColumn("NumberColumn", bround("NumberColumn",5))