Spark SQL change format of the number - scala

After show command spark prints the following:
+-----------------------+---------------------------+
|NameColumn |NumberColumn |
+-----------------------+---------------------------+
|name |4.3E-5 |
+-----------------------+---------------------------+
Is there a way to change NumberColumn format to something like 0.000043?

you can use format_number function as
import org.apache.spark.sql.functions.format_number
df.withColumn("NumberColumn", format_number($"NumberColumn", 5))
here 5 is the decimal places you want to show
As you can see in the link above that the format_number functions returns a string column
format_number(Column x, int d)
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
If your don't require , you can call regexp_replace function which is defined as
regexp_replace(Column e, String pattern, String replacement)
Replace all substrings of the specified string value that match regexp with rep.
and use it as
import org.apache.spark.sql.functions.regexp_replace
df.withColumn("NumberColumn", regexp_replace(format_number($"NumberColumn", 5), ",", ""))
Thus comma (,) should be removed for large numbers.

You can use cast operation as below:
val df = sc.parallelize(Seq(0.000043)).toDF("num")
df.createOrReplaceTempView("data")
spark.sql("select CAST (num as DECIMAL(8,6)) from data")
adjust the precision and scale accordingly.

In newer versions of pyspark you can use round() or bround() functions.
Theses functions return a numeric column and solve the problem with ",".
it would be like:
df.withColumn("NumberColumn", bround("NumberColumn",5))

Related

Pyspark : How to take Minimum in the timestamp column?

In pyspark , i tried to do this
df = df.select(F.col("id"),
F.col("mp_code"),
F.col("mp_def"),
F.col("mp_desc"),
F.col("mp_code_desc"),
F.col("zdmtrt06_zstation").alias("station"),
F.to_timestamp(F.col("date_time"), "yyyyMMddHHmmss").alias("date_time_utc"))
df = df.groupBy("id", "mp_code", "mp_def", "mp_desc", "mp_code_desc", "station").min(F.col("date_time_utc"))
But, i have an issue
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
Here is an extract of the pyspark documentation
GroupedData.min(*cols)[source]
Computes the min value for each numeric column for each group.
New in version 1.3.0.
Parameters: cols : str
In other words, the min function does not support column arguments. It only works with column names (strings) like this:
df.groupBy("x").min("date_time_utc")
# you can also specify several column names
df.groupBy("x").min("y", "z")
Note that if you want to use a column object, you have to use agg:
df.groupBy("x").agg(F.min(F.col("date_time_utc")))

In Spark Scala, how to create a column with substring() using locate() as a parameter?

I have a dataset that is like the following:
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
I want to create a column that will be a string containing only the values after Id. The result would be something like:
val df_result = Seq(("samb id 12",12), ("car id 13",13), ("lxu id 88",88)).toDF("list", "id_value")
For that, I am trying to use substring. For the the parameter of the starting position to extract the substring, I am trying to use locate. But it gives me an error saying that it should be an Int and not a column type.
What I am trying is like:
df
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
The error I get is:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
.withColumn("id_value", substring($"list", locate("id", $"list") + 2, 2))
^
How can I fix this and continue using locate() as a parameter?
UPDATE
Updating to give an example in which #wBob answer doesn't work for my real world data: my data is indeed a bit more complicated than the examples above.
It is something like this:
val df = Seq(":option car, lorem :ipsum: :ison, ID R21234, llor ip", "lst ID X49329xas ipsum :ion: ip_s-")
The values are very long strings that don't have a specific pattern.
Somewhere in the string that is always a part written ID XXXXX. The XXXXX varies, but it is always the same size (5 characters) and always after a ID .
I am not being able to use neither split nor regexp_extract to get something in this pattern.
It is not clear if you want the third item or the first number from the list, but here are a couple of examples which should help:
// Assign sample data to dataframe
val df = Seq("samb id 12", "car id 13", "lxu id 88").toDF("list")
df
.withColumn("t1", split($"list", "\\ ")(2))
.withColumn("t2", regexp_extract($"list", "\\d+", 0))
.withColumn("t3", regexp_extract($"list", "(id )(\\d+)", 2))
.withColumn("t4", regexp_extract($"list", "ID [A-Z](\\d{5})", 1))
.show()
You can use functions like split and regexp_extract with withColumn to create new columns based on existing values. split splits out the list into an array based on the delimiter you pass in. I have used space here, escaped with two slashes to split the array. The array is zero-based hence specifying 2 gets the third item in the array. regexp_extract uses regular expressions to extract from strings. here I've used \\d which represents digits and + which matches the digit 1 or many times. The third column, t3, again uses regexp_extract with a similar RegEx expression, but using brackets to group up sections and 2 to get the second group from the regex, ie the (\\d+). NB I'm using additional slashes in the regex to escape the slashes used in the \d.
My results:
If your real data is more complicated please post a few simple examples where this code does not work and explain why.

pyspark aws glue UDF multi parmeter function? [duplicate]

I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns
name, age, balance;"
So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

How to conditionally remove the first two characters from a column

I have the below data of some phone records, and I want to remove the first two values from each record as they are a country code. What is the way by which I can do this using Scala, Spark, or Hive?
phone
|917799423934|
|019331224595|
| 8981251522|
|917271767899|
I'd like the result to be:
phone
|7799423934|
|9331224595|
|8981251522|
|7271767899|
How can we remove the prefix 91,01 from each record or each row of this column?
Phone size can be different, such construction can be used (Scala):
df.withColumn("phone", expr("substring(phone,3,length(phone)-2)"))
Using regular expressions
Use regexp_replace (add more extension codes if necessary):
select regexp_replace(trim(phone),'^(91|01)','') as phone --removes leading 91, 01 and all leading and trailing spaces
from table;
The same using regexp_extract:
select regexp_extract(trim(phone),'^(91|01)?(\\d+)',2) as phone --removes leading and trailing spaces, extract numbers except first (91 or 01)
from table;
An improvement I believe, would prefer a list with contains or the equivalent of, but here goes:
import org.apache.spark.sql.functions._
case class Tel(telnum: String)
val ds = Seq(
Tel("917799423934"),
Tel("019331224595"),
Tel("8981251522"),
Tel("+4553")).toDS()
val ds2 = ds.withColumn("new_telnum", when(expr("substring(telnum,1,2)") === "91" || expr("substring(telnum,1,2)") === "01", expr("substring(telnum,3,length(telnum)-2)")).otherwise(col("telnum")))
ds2.show
returns:
+------------+----------+
| telnum|new_telnum|
+------------+----------+
|917799423934|7799423934|
|019331224595|9331224595|
| 8981251522|8981251522|
| +4553| +4553|
+------------+----------+
We may need to think of the +, but nothing was stated.
If they are strings then for a Hive query:
sql("select substring(phone,3) from table").show

How to get substring using patterns and replace quotes in json value field using scala?

I have few json messages like
{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}
{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}
I need to add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
You have mentioned in the comment above
I have huge dataset in kafka.I am trying to read from kafka and write to hdfs through spark using scala.I am using json parser but unable to parse because of column3 issue.so need to manipulate the message to change into json
So you must be having collecting of malformed jsons as in the question. I have created a list as
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""")
and you are reading it through Spark so you must be having rdds as
val rdd = sc.parallelize(kafkaMsg)
All you need is some parsing in the malformed text json to make it valid json string as
val validJson = rdd.map(msg => msg.replaceAll("[}\"{]", "").split(",").map(_.split(":").mkString("\"", "\":\"", "\"")).mkString("{", ",", "}"))
validJson should be
{"column1":"abc","column2":"123","column3":"qwerty","column4":"abc123"}
{"column1":"defhj","column2":"45","column3":"asdfgh","column4":"def12d"}
You can create a dataframe from the validJson rdd as
sqlContext.read.json(validJson).show(false)
which should give you
+-------+-------+-------+-------+
|column1|column2|column3|column4|
+-------+-------+-------+-------+
|abc |123 |qwerty |abc123 |
|defhj |45 |asdfgh |def12d |
+-------+-------+-------+-------+
Or you can do as per your requirement.
Goal
add double quotes both sides for column3 value and replace double quotes in the column3 value with single quotes using scala.
I would recommend to use RegEx because you have more flexibility with it.
Here is the solution:
val kafkaMsg = List("""{"column1":"abc","column2":"123","column3":qwe"r"ty,"column4":"abc123"}""", """{"column1":"defhj","column2":"45","column3":asd"f"gh,"column4":"def12d"}""", """{"column1":"defhj","column2":"45","column3":without-quotes,"column4":"def12d"}""")
val rdd = sc.parallelize(kafkaMsg)
val rePattern = """(^\{.*)("column3":)(.*)(,"column4":.*)""".r
val newRdd = rdd.map(r =>
r match {
case rePattern(start, col3, col3Value, end) => (start + col3 + '"' + col3Value.replaceAll("\"", "'") + '"' + end)
case _ => r }
)
newRdd.foreach(println)
Explanation:
First and second statements are rdd initialization.
Third line defines the regex pattern. You may need to adjust it to your situation.
Regex produce 4 groups of values (whatever is in a () is a group):
string starting with "{" and whatever after until we meet "column3":
"column3": itself
whatever comes after "column3": but before ,"column4":
whatever comes starting ,"column4":
I use these 4 groups in next statement.
Iterate over your rdd, run it against regex, and change it: replace double quotes with single, and add open/close quotes. In case there is no match the original string will be returned.
Because regex was defined with 4 groups, I use 4 variables to map matches:
case rePattern(start, col3, col3Value, end) =>
Note: Code doesn't check if you have double quote in the value or not, it just runs update. You can add validation on your own if you need.
Show the results.
Important notes:
Regex that I used is strictly linked to your source string format. Keep in mind that you have JSON, so order of your keys is not guaranteed. As a result you can end up with "column4" (which is used as a column3 value ending) may come before "column3".
If you use comma as a key/value ending, make sure you don't have it as part of column3 value.
Bottom line: you need to adjust my regex to properly identify the end of column3 value.
Hope it helps.