how to use Regexp_replace in spark - scala

I am pretty new to spark and would like to perform an operation on a column of a dataframe so as to replace all the , in the column with .
Assume there is a dataframe x and column x4
x4
1,3435
1,6566
-0,34435
I want the output to be as
x4
1.3435
1.6566
-0.34435
The code I am using is
import org.apache.spark.sql.Column
def replace = regexp_replace((x.x4,1,6566:String,1.6566:String)x.x4)
But I get the following error
import org.apache.spark.sql.Column
<console>:1: error: ')' expected but '.' found.
def replace = regexp_replace((train_df.x37,0,160430299:String,0.160430299:String)train_df.x37)
Any help on the syntax, logic or any other suitable way would be much appreciated

Here's a reproducible example, assuming x4 is a string column.
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "1,3435"),
(2, "1,6566"),
(3, "-0,34435"))).toDF("Id", "x4")
The syntax is regexp_replace(str, pattern, replacement), which translates to:
df.withColumn("x4New", regexp_replace(df("x4"), "\\,", ".")).show
+---+--------+--------+
| Id| x4| x4New|
+---+--------+--------+
| 1| 1,3435| 1.3435|
| 2| 1,6566| 1.6566|
| 3|-0,34435|-0.34435|
+---+--------+--------+

We could use the map method to do this transformation:
scala> df.map(each => {
(each.getInt(0),each.getString(1).replaceAll(",", "."))
})
.toDF("Id","x4")
.show
Output:
+---+--------+
| Id| x4|
+---+--------+
| 1| 1.3435|
| 2| 1.6566|
| 3|-0.34435|
+---+--------+

Related

base64 decoding of a dataframe

I have an encoded dataframe and I managed to get it decoded using following code in PySpark. Is there any simple way where I can have an additional column in the dataframe itself through Scala/PySpark?
import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')
I looked up multiple similar questions, but couldnt get them to work.
Edit:
Got it working with help from a colleague.
def binaryToFloatArray(stringValue: String): Array[Float] = {
val t:Array[Byte] = Base64.getDecoder().decode(stringValue)
val b = ByteBuffer.wrap(t).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer()
val copy = new Array[Float](2048)
b.get(copy)
return copy
}
val binaryToFloatArrayUDF = udf(binaryToFloatArray _)
val finalResultDf = dftest.withColumn("myFloatArray", binaryToFloatArrayUDF(col("_2"))).drop("_2")
You have base64 and unbase64 functions for this.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#pyspark.sql.functions.base64
You could
from pyspark.sql.functions import unbase64,base64
got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))
+---+------+
| id| name|
+---+------+
| 1| Jon|
| 2| Danny|
| 3|Tyrion|
+---+------+
encoded_got = got.withColumn('encoded_base64_name', base64(got.name))
+---+------+-------------------+
| id| name|encoded_base64_name|
+---+------+-------------------+
| 1| Jon| Sm9u|
| 2| Danny| RGFubnk=|
| 3|Tyrion| VHlyaW9u|
+---+------+-------------------+
decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string
+---+------+--------------+--------------+
| id| name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
| 1| Jon| Sm9u| Jon|
| 2| Danny| RGFubnk=| Danny|
| 3|Tyrion| VHlyaW9u| Tyrion|
+---+------+--------------+--------------+

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:
scala> avgsessiontime.show()
+-----------------+
| avg|
+-----------------+
|2.073455735838315|
+-----------------+
I need to store the value 2.073455735838315 in a variable. I tried using
avgsessiontime.collect
but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.
scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]
But when I do this:
avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))
I get a blank/empty result. Even the length returns empty.
avgsessiontime.foreachPartition(x => for (name <- x) name.length)
I know name is of type org.apache.spark.sql.Row then it should return both those results.
You might need:
avgsessiontime.first.getDouble(0)
Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.
val df = Seq(2.0743).toDF("avg")
df.show
+------+
| avg|
+------+
|2.0743|
+------+
df.first.getDouble(0)
// res6: Double = 2.0743
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.
rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.
Supposing you have a dataframe as
+-----------------+
|avg |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+
doing the following will print all the values, which you can store in a variable too
avgsessiontime.rdd.collect().foreach(x => println(x(0)))
it will print
2.073455735838315
2.073455735838316
Now if you want only the first one then you can do
avgsessiontime.rdd.collect()(0)(0)
which will give you
2.073455735838315
I hope the answer is helpful

Pass Array[seq[String]] to UDF in spark scala

I am new to UDF in spark. I have also read the answer here
Problem statement: I'm trying to find pattern matching from a dataframe col.
Ex: Dataframe
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),
(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
df.show()
+---+--------------------+
| id| text|
+---+--------------------+
| 1| z|
| 2| abs,abc,dfg|
| 3|a,b,c,d,e,f,abs,a...|
+---+--------------------+
df.filter($"text".contains("abs,abc,dfg")).count()
//returns 2 as abs exits in 2nd row and 3rd row
Now I want to do this pattern matching for every row in column $text and add new column called count.
Result:
+---+--------------------+-----+
| id| text|count|
+---+--------------------+-----+
| 1| z| 1|
| 2| abs,abc,dfg| 2|
| 3|a,b,c,d,e,f,abs,a...| 1|
+---+--------------------+-----+
I tried to define a udf passing $text column as Array[Seq[String]. But I am not able to get what I intended.
What I tried so far:
val txt = df.select("text").collect.map(_.toSeq.map(_.toString)) //convert column to Array[Seq[String]
val valsum = udf((txt:Array[Seq[String],pattern:String)=> {txt.count(_ == pattern) } )
df.withColumn("newCol", valsum( lit(txt) ,df(text)) )).show()
Any help would be appreciated
You will have to know all the elements of text column which can be done using collect_list by grouping all the rows of your dataframe as one. Then just check if element in text column in the collected array and count them as in the following code.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq((1, Some("z")), (2, Some("abs,abc,dfg")),(3,Some("a,b,c,d,e,f,abs,abc,dfg"))).toDF("id", "text")
val valsum = udf((txt: String, array : mutable.WrappedArray[String])=> array.filter(element => element.contains(txt)).size)
df.withColumn("grouping", lit("g"))
.withColumn("array", collect_list("text").over(Window.partitionBy("grouping")))
.withColumn("count", valsum($"text", $"array"))
.drop("grouping", "array")
.show(false)
You should have following output
+---+-----------------------+-----+
|id |text |count|
+---+-----------------------+-----+
|1 |z |1 |
|2 |abs,abc,dfg |2 |
|3 |a,b,c,d,e,f,abs,abc,dfg|1 |
+---+-----------------------+-----+
I hope this is helpful.

Split 1 column into 3 columns in spark scala

I have a dataframe in Spark using scala that has a column that I need split.
scala> test.show
+-------------+
|columnToSplit|
+-------------+
| a.b.c|
| d.e.f|
+-------------+
I need this column split out to look like this:
+--------------+
|col1|col2|col3|
| a| b| c|
| d| e| f|
+--------------+
I'm using Spark 2.0.0
Thanks
Try:
import sparkObject.spark.implicits._
import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3")
)
The important point to note here is that the sparkObject is the SparkSession object you might have already initialized. So, the (1) import statement has to be compulsorily put inline within the code, not before the class definition.
To do this programmatically, you can create a sequence of expressions with (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")) (assume you need 3 columns as result) and then apply it to select with : _* syntax:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
(0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
To keep all columns:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+-------------+---------+----+----+----+
|columnToSplit| temp|col0|col1|col2|
+-------------+---------+----+----+----+
| a.b.c|[a, b, c]| a| b| c|
| d.e.f|[d, e, f]| d| e| f|
+-------------+---------+----+----+----+
If you are using pyspark, use a list comprehension to replace the map in scala:
df = spark.createDataFrame([['a.b.c'], ['d.e.f']], ['columnToSplit'])
from pyspark.sql.functions import col, split
(df.withColumn('temp', split('columnToSplit', '\\.'))
.select(*(col('temp').getItem(i).alias(f'col{i}') for i in range(3))
).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
A solution which avoids the select part. This is helpful when you just want to append the new columns:
case class Message(others: String, text: String)
val r1 = Message("foo1", "a.b.c")
val r2 = Message("foo2", "d.e.f")
val records = Seq(r1, r2)
val df = spark.createDataFrame(records)
df.withColumn("col1", split(col("text"), "\\.").getItem(0))
.withColumn("col2", split(col("text"), "\\.").getItem(1))
.withColumn("col3", split(col("text"), "\\.").getItem(2))
.show(false)
+------+-----+----+----+----+
|others|text |col1|col2|col3|
+------+-----+----+----+----+
|foo1 |a.b.c|a |b |c |
|foo2 |d.e.f|d |e |f |
+------+-----+----+----+----+
Update: I highly recommend to use Psidom's implementation to avoid splitting three times.
This appends columns to the original DataFrame and doesn't use select, and only splits once using a temporary column:
import spark.implicits._
df.withColumn("_tmp", split($"columnToSplit", "\\."))
.withColumn("col1", $"_tmp".getItem(0))
.withColumn("col2", $"_tmp".getItem(1))
.withColumn("col3", $"_tmp".getItem(2))
.drop("_tmp")
This expands on Psidom's answer and shows how to do the split dynamically, without hardcoding the number of columns. This answer runs a query to calculate the number of columns.
val df = Seq(
"a.b.c",
"d.e.f"
).toDF("my_str")
.withColumn("letters", split(col("my_str"), "\\."))
val numCols = df
.withColumn("letters_size", size($"letters"))
.agg(max($"letters_size"))
.head()
.getInt(0)
df
.select(
(0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
)
.show()
We can write using for with yield in Scala :-
If your number of columns exceeds just add it to desired column and play with it. :)
val aDF = Seq("Deepak.Singh.Delhi").toDF("name")
val desiredColumn = Seq("name","Lname","City")
val colsize = desiredColumn.size
val columList = for (i <- 0 until colsize) yield split(col("name"),".").getItem(i).alias(desiredColumn(i))
aDF.select(columList: _ *).show(false)
Output:-
+------+------+-----+--+
|name |Lname |city |
+-----+------+-----+---+
|Deepak|Singh |Delhi|
+---+------+-----+-----+
If you don't need name column then, drop the column and just use withColumn.
Example:
Without using the select statement.
Lets assume we have a dataframe having a set of columns and we want to split a column having column name as name
import spark.implicits._
val columns = Seq("name","age","address")
val data = Seq(("Amit.Mehta", 25, "1 Main st, Newark, NJ, 92537"),
("Rituraj.Mehta", 28,"3456 Walnut st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()
val newDF = dfFromData.map(f=>{
val nameSplit = f.getAs[String](0).split("\\.").map(_.trim)
(nameSplit(0),nameSplit(1),f.getAs[Int](1),f.getAs[String](2))
})
val finalDF = newDF.toDF("First Name","Last Name", "Age","Address")
finalDF.printSchema()
finalDF.show(false)
output: