How to calculate a value for each row? - scala

I have an input dataframe (created from a hive table) containing more than 100 rows. For each row of the dataframe, I need to extract the column values (most strings) and pass those values to a user-defined function. For each row, the function uses these input values and other intermediate dataframes (created from hive tables) to calculate a set of rows and stores in a result dataframe.
How do I achieve this - please help.
I tried this:
var df1= hiveContext.sql("Select event_date,channelcode,st,tc,startsec,endsec from program_master")
var count1=df1.count()
df1 = df1.withColumn("INDEX", monotonically_increasing_id())
var i=1
while (i <= count1){
var ed = df1.filter(df1("INDEX") === s"""$i""").select(to_date(unix_timestamp(df1("ed"), "dd-MM-yy").cast(TimestampType)).cast(DateType)).first().getDate(0)
var cc = df1.filter(df1("INDEX") === s"""$i""").select(df1("cc")).first().getInt(0)
var ST = df1.filter(df1("INDEX") === s"""$i""").select(df1("ST")).first().getString(0)
var TC = df1.filter(df1("INDEX") === s"""$i""").select(df1("TC")).first().getString(0)
var ss = df1.filter(df1("INDEX") === s"""$i""").select(df1("ss")).first().getInt(0)
var es = df1.filter(df1("INDEX") === s"""$i""").select(df1("es")).first().getInt(0)
calculate_values(ed, cc, st, tc, ss, ss, sparkSession)
i=i+1
}
calculate_values def
def calculate_values(ed: Date,cc:Integer,ST:String,TC:String,ss:Integer,ss:Integer,sparkSession: SparkSession):Unit=
2 problems in what I tried: hence do not have an output
line 3: I expected it to give numbers like 1,2,3,......100.... to iterate using i - but its generating random very large numbers.
line 5: It throws java.util.NoSuchElementException: next on empty iterator

monotonically_increasing_id() will generate random numbers but in increasing manner, so you cannot rely on it to generate serial numbers as row_number() function. But row_number() is expensive to be used on whole dataset as it will collect all the data in one executor unless you use row_number() by grouping the data.
monotonically_increasing_id() would be helpful in cases where you want to order/sort your data.
It seems that you are trying to calculate some values row by row using event_date, channelcode, st, tc, startsec and endsec.
If its a row by row calculation then I would suggest you to use a udf function. So you can convert your calculate_value function into a udf function as
import org.apache.spark.sql.functions._
def calculate_value = udf((ed: Date,cc:Int,ST:String,TC:String,ss:Int,es:Int) => //write your calculation part here)
And you call the udf function using withColumn as
df1.withColumn("calculated", calculate(col("ed"), col("cc"), col("ST"), col("TC"), col("ss"), col("es"))
a new column will be created with the calculated value
But if the calculations can be done column wise I would recommend you to look at inbuilt functions too

Related

Storing Spark DataFrame Value In scala Variable

I need to Check the duplicate filename in my table and if file count is 0 then i need to load a file in my table using sparkSql. I wrote below code.
val s1=spark.sql("select count(filename) from mytable where filename='myfile.csv'") //giving '2'
s1: org.apache.spark.sql.DataFrame = [count(filename): bigint]
s1.show //giving 2 as output
//s1 is giving me the filecount from my table then i need to compare this count value using if statement.
I'm using below code.
val s2=s1.count //not working always giving 1
val s2=s1.head.count() // error: value count is not a member of org.apache.spark.sql.Row
val s2=s1.size //value size is not a member of Unit
if(s1>0){ //code } //value > is not a member of org.apache.spark.sql.DataFrame
can someone please give me a hint how should i do this.How can i get the dataframe value and can use as variable to check the condition.
i.e.
if(value of s1(i.e.2)>0){
//my code
}
You need to extract the value itself. Count will return the number of rows in the df, which is just one row.
So you can keep your original query and extract the value after with first and getInt methods
val s1 = spark.sql("select count(filename) from mytable where filename='myfile.csv'")`
val valueToCompare = s1.first().getInt(0)
And then:
if(valueToCompare>0){
//my code
}
Another option is performing the count outside the query, then the count will give you the desired value:
val s1 = spark.sql("select filename from mytable where filename='myfile.csv'")
if(s1.count>0){
//my code
}
I like the most the second option, but there is no reason other than that i think it is more clear
spark.sql("select count(filename) from mytable where filename='myfile.csv'") returns a dataframe and you need to extract both the first row and the first column of that row. It is much simpler to directly filter the dataset and count the number of rows in Scala:
val s1 = df.filter($"filename" === "myfile.csv").count
if (s1 > 0) {
...
}
where df is the dataset that corresponds to the mytable table.
If you got the table from some other source and not by registering a view, use SparkSession.table() to get a dataframe using the instance of SparkSession that you already have. For example, in Spark shell the pre-set variable spark holds the session and you'll do:
val df = spark.table("mytable")
val s1 = df.filter($"filename" === "myfile.csv").count

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

How to calculate value from one column between row 1 to row N, with Scala/spark data frame

Here is the example dataframe,
city, LONG, LAT
city1, 100.30, 50.11
city2, 100.20, 50.16
city3, 100.20, 51
..
We need to calculate distance between city1 and all cities, and city2 and all cities, and iterate for each city. Function 'distance' is created. Then we can use for loop each line or use data dict in Python.
For dataframe, how can apply for loop or data dict concept to dataframe?
for example in python. (Not all codes shown here.)
citydict = dict()
citydict2=copy.deepcopy(citydict)
for city1, pciinfo1 in citydict.items():
pcicity2.pop(pci1)
for city2, cityinfo2 in citydict2.items():
s=distancecalc(cityinfo1,cityinfo2)
The crossJoin method does the trick. It returns the cartesian product of two dataframes. The idea is to cross the Dataframe with itself.
import org.apache.spark.sql.functions._
df.as("thisDF")
.crossJoin(df.as("toCompareDF"))
.filter($"thisDF.city" =!= $"toCompareDF.city")
.withColumn("distance", calculateDistance($"thisDF.lon", $"thisDF.lat", $"toCompareDF.lon", $"toCompareDF.lat"))
.show
First of all, we add an alias to our Dataframe so that we can identify it when we perform the join. Next step is to perform the crossJoin over the same Dataframe. Note that we're also adding an alias to this new Dataframe. To delete those tuples that match the same city, we filter by the city column.
Finally, we apply a Spark User Defined Function, passing the necessary columns to calculate the distance. This is the declaration of the UDF:
def calculateDistance = udf((lon1: Double, lat1: Double, lon2: Double, lat2: Double) => {
// add calculation here
})
And that's all. Hope it helps.

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Iterate across columns in spark dataframe and calculate min max value

I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev