Input dataframe:
val ds = Seq((1,34.44),
(2,76.788),
(3,54.822)).toDF("id","mark")
Expected output:
val ds = Seq((1,34),
(2,76),
(3,54)).toDF("id","mark")
I want to remove the fractional part from the column mark as above. I have searched for any builtin functions, but did not find anything. How should an udf look like to achieve the above result?
You can just use cast to integer as
import org.apache.spark.sql.functions._
ds.withColumn("mark", $"mark".cast("integer")).show(false)
which should give you
+---+----+
|id |mark|
+---+----+
|1 |34 |
|2 |76 |
|3 |54 |
+---+----+
I hope the answer is helpful
Update
You commented as
But if any string values are there in the column , it is getting null since we are casting into integer . i don't want that kind of behaviour
So I guess your mark column must be a StringType() and you can use regexp_replace
import org.apache.spark.sql.functions._
ds.withColumn("mark", regexp_replace($"mark", "(\\.\\d+)", "")).show(false)
Related
I have a dataframe like this...
val new_df =Seq(("a","b"),("b","a"),("a","c")).toDF("col1","col2")
and I want to create "col3" which is a string concatenation of "col1" and "col2". However, I want the concatenation of "ab" and "ba" to be treated the same, sorted alphabetically so that it's only "ab".
The resulting dataframe I would like to look like this:
val new_df =Seq(("a","b","ab"),("b","a","ab"),("a","c","ac")).toDF("col1","col2","col3")
And here's a before and after picture too:
before:
after:
thanks and have a great day!
With Spark SQL functions to take advantage of the Spark SQL Optimizations:
import org.apache.spark.sql.functions.{sort_array, array, concat_ws}
new_df.withColumn("col3",
concat_ws("",
sort_array(array(col("col1"), col("col2")))))
You can just create an udf to create a sorted String
val concatColumns = udf((c1: String, c2: String) => {
List(c1, c2).sorted.mkString
})
And then use it in a withColumn statement sending the desired columns to concatenate
new_df.withColumn("col3", concatColumns($"col1", $"col2")).show(false)
Result
+----+----+----+
|col1|col2|col3|
+----+----+----+
|a |b |ab |
|b |a |ab |
|a |c |ac |
+----+----+----+
Take for example the following dataFrame:
x.show(false)
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/done/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
Now I am trying to update the existing DF to create a new DF based based on the column hdfsPath
The new DF should look like the following:
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------+-------------+
So the path done changes to target and then from the compiled-20200218050518-1-0-0-1582020318751.snappy portion I get the date 20200218 and then colID 11 and then finally the snappy file. What would be the easiest and most efficient way to achieve this?
It's not a hard requirement to create a newDF, I can update the existing DF with a new column.
To summarize:
Current hdfsPath:
hdfs://novus-nameservice/a/b/c/done/compiled-20200218050518-1-0-0-1582020318751.snappy
Expected hdfsPath:
hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy
Based on colID.
The simplest way i can imagine doing this is converting your dataframe to a dataset and apply a map operation and then back to dataframe,
// Define a case class
case class MyType(colId:Int,path:String,timestamp:Int) // they need to match the column names
dataframe.as[MyType].map(x=> <<Your Transformation code>>).toDf()
Here is what you can do with regex_replace and regex_extract, Extract the values you want and replace with it
df.withColumn("hdfsPath", regexp_replace(
$"hdfsPath",
lit("/done"),
concat(
lit("/target/"),
regexp_extract($"hdfsPath", "compiled-([0-9]{1,8})", 1),
lit("/"),
$"colId")
))
Output:
+-----+----------------------------------------------------------------------------------------------------+-------------+
|colId|hdfsPath |timestamp |
+-----+----------------------------------------------------------------------------------------------------+-------------+
|11 |hdfs://novus-nameservice/a/b/c/target/20200218/11/compiled-20200218050518-1-0-0-1582020318751.snappy|1662157400000|
|12 |hdfs://novus-nameservice/a/b/c/target/20200219/12/compiled-20200219060507-1-0-0-1582023907108.snappy|1662158000000|
+-----+----------------------------------------------------------------------------------------------------+-------------+
Hope this helps!
In below spark-scala code, double data type value is storing differently. Though in table, all columns are of string type. Same result in impala as well.
Does someone knows how to make sure exact value get saved and retrieved ?
Thanks
val df = Seq(("one", 1324235345435.4546)).toDF("a", "b")
df.write.mode("append").insertInto("test")
spark.sql("select * from test").show(false)
+---+---------------------+
|a |b |
+---+---------------------+
|one|1.3242353454354546E12|
+---+---------------------+
Try with casting to Decimal type and then insert into Hive table.
val df = Seq(("one", 1324235345435.4546))
.toDF("a", "b")
.select('a,'b.cast("Decimal(36,4)"))
df.show(false)
+---+------------------+
|a |b |
+---+------------------+
|one|1324235345435.4546|
+---+------------------+
scala> df.select(format_number(col("b"),4)).show(false)
+----------------------+
|format_number(b, 4) |
+----------------------+
|1,324,235,345,435.4546|
+----------------------+
you could use the number_formater function on top of column so that you it can automatically convert to string with your requirement with the precision.
hope this helps on generalizing.
I have a problem with adding days (numbers) to date format columns in Spark. I know that there is a function date_add that takes two arguments - date column and integer:
date_add(date startdate, tinyint/smallint/int days)
I'd like to use a column value that is of type integer instead (not an integer itself).
Say I have the following dataframe:
val data = Seq(
(0, "2016-01-1"),
(1, "2016-02-2"),
(2, "2016-03-22"),
(3, "2016-04-25"),
(4, "2016-05-21"),
(5, "2016-06-1"),
(6, "2016-03-21"))
).toDF("id", "date")
I can simply add integers to dates:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", 1)
)
But I cannot use a column expression that contains the values:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", $"id")
)
It gives error:
<console>:60: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Int
date_add($"date", $"id")
Does anyone know if it is possible to use column is date_add function? Or what is the workaround?
You can use expr:
import org.apache.spark.sql.functions.expr
data.withColumn("future", expr("date_add(date, id)")).show
// +---+----------+----------+
// | id| date| future|
// +---+----------+----------+
// | 0| 2016-01-1|2016-01-01|
// | 1| 2016-02-2|2016-02-03|
// | 2|2016-03-22|2016-03-24|
// | 3|2016-04-25|2016-04-28|
// | 4|2016-05-21|2016-05-25|
// | 5| 2016-06-1|2016-06-06|
// | 6|2016-03-21|2016-03-27|
// +---+----------+----------+
selectExpr could be use in a similar way:
data.selectExpr("*", "date_add(date, id) as future").show
The other answers work but aren't a drop in replacement for the existing date_add function.
I had a case where expr wouldn't work for me, so here is a drop in replacement:
def date_add(date: Column, days: Column) = {
new Column(DateAdd(date.expr, days.expr))
}
Basically, all the machinery is there in Spark to do this already, the function signature for date_add just forces it to be a literal.
You can use a sql expression as
data.createOrReplaceTempView("table")
sqlContext.sql("select id, date, date_add(`date`, `id`) as added_date from table").show(false)
which would give you
+---+----------+----------+
|id |date |added_date|
+---+----------+----------+
|0 |2016-01-1 |2016-01-01|
|1 |2016-02-2 |2016-02-03|
|2 |2016-03-22|2016-03-24|
|3 |2016-04-25|2016-04-28|
|4 |2016-05-21|2016-05-25|
|5 |2016-06-1 |2016-06-06|
|6 |2016-03-21|2016-03-27|
+---+----------+----------+
For the Python developers who are here, you can simply add a date column to another column together using +:
import pyspark.sql.functions as F
new_df = df.withColumn("new_date", F.col("date") + F.col("offset"))
Juste make sure that the offset column is int/smallint/tinyint.
I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.
The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()
A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))
If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful