How to perform arithmetic operation on two seperate dataframes in Apache Spark? - scala

I have two dataframes as follows which have only one row and one column each. Both holds two different numeric values.
How do I perform or achieve division or other arithmetic operation on those two dataframe values?
Please help.

First, if these DataFrames contain a single record each - any further use of Spark would likely be wasteful (Spark is intended for large data sets, small ones would be processed faster locally). So, you can simply collect these one-record values using first() an go on from there:
import spark.implicits._
val df1 = Seq(2.0).toDF("col1")
val df2 = Seq(3.5).toDF("col2")
val v1: Double = df1.first().getAs[Double](0)
val v2: Double = df2.first().getAs[Double](0)
val sum = v1 + v2
If, for some reason, you do want to use DataFrames all the way, you can use crossJoin to join the records together and then apply any arithmetic operation:
import spark.implicits._
val df1 = Seq(2.0).toDF("col1")
val df2 = Seq(3.5).toDF("col2")
df1.crossJoin(df2)
.select($"col1" + $"col2" as "sum")
.show()
// +---+
// |sum|
// +---+
// |5.5|
// +---+

If you have dataframes as
scala> df1.show(false)
+------+
|value1|
+------+
|2 |
+------+
scala> df2.show(false)
+------+
|value2|
+------+
|2 |
+------+
You can get the value by doing the following
scala> df1.take(1)(0)(0)
res3: Any = 2
But the dataType is Any, type casting is needed before we do arithmetic operations as
scala> df1.take(1)(0)(0).asInstanceOf[Int]*df2.take(1)(0)(0).asInstanceOf[Int]
res8: Int = 4

Related

Spark How can I filter out rows that contain char sequences from another dataframe?

So, I am trying to remove rows from df2 if the Value in df2 is "like" a key from df1. I'm not sure if this is possible, or if I might need to change df1 into a list first? It's a fairly small dataframe, but as you can see, we want to remove the 2nd and 3rd rows from df2 and just return back df2 without them.
df1
+--------------------+
| key|
+--------------------+
| Monthly Beginning|
| Annual Percentage|
+--------------------+
df2
+--------------------+--------------------------------+
| key| Value|
+--------------------+--------------------------------+
| Date| 1/1/2018|
| Date| Monthly Beginning on Tuesday|
| Number| Annual Percentage Rate for...|
| Number| 17.5|
+--------------------+--------------------------------+
I thought it would be something like this?
df.filter(($"Value" isin (keyDf.select("key") + "%"))).show(false)
But that doesn't work and I'm not surprised, but I think it helps show what I am trying to do if my previous explanation was not sufficient enough. Thank you for your help ahead of time.
Convert the first dataframe df1 to List[String] and then create one udf and apply filter condition
Spark-shell-
import org.apache.spark.sql.functions._
//Converting df1 to list
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
//Creating udf , spark stands for spark session
spark.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
//Applying filter
df2.filter("filterUDF(Value)=0").show
//output
+------+--------+
| key| Value|
+------+--------+
| Date|1/1/2018|
|Number| 17.5|
+------+--------+
Scala-IDE -
val sparkSession=SparkSession.builder().master("local").appName("temp").getOrCreate()
val df1=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df1.csv")
val df2=sparkSession.read.format("csv").option("header","true").load("C:\\spark\\programs\\df2.csv")
import sparkSession.implicits._
val df1List=df1.select("key").map(row=>row.getString(0).toLowerCase).collect.toList
sparkSession.udf.register("filterUDF", (str: String) => df1List.filter(str.toLowerCase.contains(_)).length)
df2.filter("filterUDF(Value)=0").show
Convert df1 to List. Convert df2 to Dataset.
case class s(key:String,Value:String)
df2Ds = df2.as[s]
Then we can use the filter method to filter out the records.
Somewhat like this.
def check(str:String):Boolean = {
var i = ""
for(i<-df1List)
{
if(str.contains(i))
return false
}
return true
}
df2Ds.filter(s=>check(s.Value)).collect

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Process all columns / the entire row in a Spark UDF

For a dataframe containing a mix of string and numeric datatypes, the goal is to create a new features column that is a minhash of all of them.
While this could be done by performing a dataframe.toRDD it is expensive to do that when the next step will be to simply convert the RDD back to a dataframe.
So is there a way to do a udf along the following lines:
val wholeRowUdf = udf( (row: Row) => computeHash(row))
Row is not a spark sql datatype of course - so this would not work as shown.
Update/clarifiction I realize it is easy to create a full-row UDF that runs inside withColumn. What is not so clear is what can be used inside a spark sql statement:
val featurizedDf = spark.sql("select wholeRowUdf( what goes here? ) as features
from mytable")
Row is not a spark sql datatype of course - so this would not work as shown.
I am going to show that you can use Row to pass all the columns or selected columns to a udf function using struct inbuilt function
First I define a dataframe
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")
).toDF("col1", "col2", "col3")
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// |a |b |c |
// |a1 |b1 |c1 |
// +----+----+----+
Then I define a function to make all the elements in a row as one string separated by , (as you have computeHash function)
import org.apache.spark.sql.Row
def concatFunc(row: Row) = row.mkString(", ")
Then I use it in udf function
import org.apache.spark.sql.functions._
def combineUdf = udf((row: Row) => concatFunc(row))
Finally I call the udf function using withColumn function and struct inbuilt function combining selected columns as one column and pass to the udf function
df.withColumn("contcatenated", combineUdf(struct(col("col1"), col("col2"), col("col3")))).show(false)
// +----+----+----+-------------+
// |col1|col2|col3|contcatenated|
// +----+----+----+-------------+
// |a |b |c |a, b, c |
// |a1 |b1 |c1 |a1, b1, c1 |
// +----+----+----+-------------+
So you can see that Row can be used to pass whole row as an argument
You can even pass all columns in a row at once
val columns = df.columns
df.withColumn("contcatenated", combineUdf(struct(columns.map(col): _*)))
Updated
You can achieve the same with sql queries too, you just need to register the udf function as
df.createOrReplaceTempView("tempview")
sqlContext.udf.register("combineUdf", combineUdf)
sqlContext.sql("select *, combineUdf(struct(`col1`, `col2`, `col3`)) as concatenated from tempview")
It will give you the same result as above
Now if you don't want to hardcode the names of columns then you can select the column names according to your desire and make it a string
val columns = df.columns.map(x => "`"+x+"`").mkString(",")
sqlContext.sql(s"select *, combineUdf(struct(${columns})) as concatenated from tempview")
I hope the answer is helpful
I came up with a workaround: drop the column names into any existing spark sql function to generate a new output column:
concat(${df.columns.tail.mkString(",'-',")}) as Features
In this case the first column in the dataframe is a target and was excluded. That is another advantage of this approach: the actual list of columns many be manipulated.
This approach avoids unnecessary restructuring of the RDD/dataframes.

Using of String Functions in Dataframe Join in scala

I am trying to join two dataframe with condition like "Wo" in "Hello World" i.e (dataframe1 col contains dataframe2 col1 value).
In HQL, we can use instr(t1.col1,t2.col1)>0
How can I achieve this same condtition in Dataframe in Scala ? I tried
df1.join(df2,df1("col1").indexOfSlice(df2("col1")) > 0)
But it throwing me the below error
error: value indexOfSlice is not a member of
org.apache.spark.sql.Column
I just want to achive the below hql query using DataFrames.
select t1.*,t2.col1 from t1,t2 where instr(t1.col1,t2.col1)>0
The following solution is tested with spark 2.2. You'll be needing to define a UDF and you can specify a join condition as part of where filter :
val indexOfSlice_ = (c1: String, c2: String) => c1.indexOfSlice(c2)
val islice = udf(indexOfSlice_)
val df10: DataFrame = Seq(("Hello World", 2), ("Foo", 3)).toDF("c1", "c2")
val df20: DataFrame = Seq(("Wo", 2), ("Bar", 3)).toDF("c3", "c4")
df10.crossJoin(df20).where(islice(df10.col("c1"), df20.col("c3")) > 0).show
// +-----------+---+---+---+
// | c1| c2| c3| c4|
// +-----------+---+---+---+
// |Hello World| 2| Wo| 2|
// +-----------+---+---+---+
PS: Beware ! Using a cross-join is an expensive operation as it yields a cartesian join.
EDIT: Consider reading this when you want to use this solution.

Scala: Performing comparision on query output

I am trying to compare count of 2 different queries/tables. Is it possible to perform this operation in Scala(Spark SQL)?
Here is my code:
val parquetFile1 = sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2016/2016_000_njars_09665_mbr_addr.20161222031015221601.parquet")
val parquetFile2 =sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2017/part-r-00000-70ce4958-57fe-487f-a45b-d73b7ef20289.snappy.parquet")
parquetFile1.registerTempTable("parquetFile1")
parquetFile2.registerTempTable("parquetFile2")
scala> var first_table_count=sqlContext.sql("select count(*) from parquetFile1")
first_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> var second_table_count=sqlContext.sql("select count(*) from parquetFile1 where LINE1_ADDR is NULL and LINE2_ADDR is NULL")
second_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> first_table_count.show()
+------+
| _c0|
+------+
|119928|
+------+
scala> second_table_count.show()
+---+
|_c0|
+---+
|617|
+---+
I am trying to get difference value of both these queries but getting error.
scala> first_table_count - second_table_count
<console>:30: error: value - is not a member of org.apache.spark.sql.DataFrame
first_table_count - second_table_count
whereas if I do normal substraction, it works
scala> 2 - 1
res7: Int = 1
It seems I have to do some data conversion but not able to find appropriate solution.
In newer version of spark count not return Long value instead it is reaped inside dataframe object i.e. Dataframe[BigInt].
you can try this
val diffrence = first_table_count.first.getLong(0) - second_table_count.first.getLong(0);
And subtract method is not available on dataframe.
You need something like the following to do the conversion:
first_table_count.first.getLong(0)
And here is why you need it:
A DataFrame represents a tabular data structure. So although your SQL seems to return a single value, it actually returns a table containing a single row, and the row contains a single column. Hence we use the above code to extract the first column (index 0) of the first row.