Comparing columns in two data frame in spark - scala

I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.

Use sql and inner join both df , with your conditions .

Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.

Related

How to create DataFrame from the an array in Scala?

I have a use case where I need to create a DataFrame from an array.
I've created a DataFrame that reads a CSV then I am using a map to process/transform it further.
var mapTransform = df1.collect.map(
line => {
// line.split(",") logic for fields separation
//transformation logic here for various fields
(field1+","+field2+","+field3);
}
)
From this, I am getting an array(Array[String]) which is transformed result.
I want to further convert it DataFrames with separate columns so that later it can be used to write to DB or file, however, I am facing an issue. Is it possible to do it? Any solutions?
This does your job:
spark.sparkContext.parallelize(mapTransform.toSeq)
But note that you must avoid methods that produce non-rdd, as they load all the contents of the array to the one node and that's ineffective in the general case.
Also, there's a convention turn vars to vals as much as possible.

Using MLUtils.convertVectorColumnsToML() inside a UDF?

I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.

Getting the value of a DataFrame column in Spark

I am trying to retrieve the value of a DataFrame column and store it in a variable. I tried this :
val name=df.select("name")
val name1=name.collect()
But none of the above is returning the value of column "name".
Spark version :2.2.0
Scala version :2.11.11
There are couple of things here. If you want see all the data collect is the way to go. However in case your data is too huge it will cause drive to fail.
So the alternate is to check few items from the dataframe. What I generally do is
df.limit(10).select("name").as[String].collect()
This will provide output of 10 element. But now the output doesn't look good
So, 2nd alternative is
df.select("name").show(10)
This will print first 10 element, Sometime if the column values are big it generally put "..." instead of actual value which is annoying.
Hence there is third option
df.select("name").take(10).foreach(println)
Takes 10 element and print them.
Now in all the cases you won't get a fair sample of the data, as the first 10 data will be picked. So to truely pickup randomly from the dataframe you can use
df.select("name").sample(.2, true).show(10)
or
df.select("name").sample(.2, true).take(10).foreach(println)
You can check the "sample" function on dataframe
The first will do :)
val name = df.select("name") will return another DataFrame. You can do for example name.show() to show content of the DataFrame. You can also do collect or collectAsMap to materialize results on driver, but be aware, that data amount should not be too big for driver
You can also do:
val names = df.select("name").as[String].collect()
This will return array of names in this DataFrame

Calculate row mean, ignoring NAs in Spark Scala

I'm trying to find a way to calculate the mean of rows in a Spark Dataframe in Scala where I want to ignore NAs. In R, there is a very convenient function called rowMeans where one can specify to ignore NAs:
rowmeans(df,na.rm=TRUE)
I'm unable to find a corresponding function for Spark Dataframes, and I wonder if anyone has a suggestion or input if this would be possible. Replacing them with 0 won't due since this will affect the denominator.
I found a similar question here, however my dataframe will have hundreds of columns.
Any help and shared insights is appreciated, cheers!
Usually such functions ignore nulls by default.
Even if there are some mixed columns with numeric and string types, this one will drop strings and nulls, and calculate only numerics.
df.select(df.columns.map(c => mean(col(c))) :_*).show
You can do this by first identifying which fields are numeric, and then selecting their mean for each row...
import org.apache.spark.sql.types._
val df = List(("a",1,2,3.0),("b",5,6,7.0)).toDF("s1","i1","i2","i3")
// grab numeric fields
val numericFields = df.schema.fields.filter(f => f.dataType==IntegerType || f.dataType==LongType || f.dataType==FloatType || f.dataType==DoubleType || f.dataType==ShortType).map(_.name)
// compute mean
val rowMeans = df.select(numericFields.map(f => col(f)).reduce(_+_) / lit(numericFields.length) as "row_mean")
rowMeans.show

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame?
More specifically, column may serve as Row IDs for the existing Data Frame.
In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:
var dataDF = sc.textFile("path/file").toDF()
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID")
dataDF = dataDF.withColumn("ID", rowDF("ID"))
It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.
So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:
sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))
Regarding the general case of appending any column to any data frame:
The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:
val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL
So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).
As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...
It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.
not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
df.withColumn("newName",lit("newValue"))
I use this when I need to use a value that is not related to existing columns of the dataframe
This is similar to #NehaM's answer but simpler
I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6.
zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.
val rdd = df.rdd.zipWithIndex()
.map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show
I hope this will be helpful.
You can use row_number with Window function as below to get the distinct id for each rows in a dataframe.
df.withColumn("ID", row_number() over Window.orderBy("any column name in the dataframe"))
You can also use monotonically_increasing_id for the same as
df.withColumn("ID", monotonically_increasing_id())
And there are some other ways too.