Appending the dataframe via loop

Appending the dataframe via loop - scala

I have created an empty dataframe var c= emptyDataFrame
Also, I have dataset with 200+ cols, below is my loop code
for (x <- groupcols){
var t= df.groupBy(x).agg(countDistinct ("ID") as "ID_Count", countDistinct("ID")/df.count as "Percentage")
t.show
}
t.show gives me table with 3 cols: col a: x, col b: ID_count , col c: Percentage
I want to append the result into the emptyDataFrame
I tried converting the result to string and append the result to string but I am unable to view the result.

I would go with a reduce function:
val df = groupcols
.map(g => g.groupBy(x).agg(countDistinct ("ID") as "ID_Count", countDistinct("ID")/y.count as "Percentage")) })
.reduceLeft((x, y) => x.union(y))
Written from scratch without testing, but should work something like this. You don't need an empty dataframe for this. It will union all the results together and you should be able to do a df.show()

Related

Storing Spark DataFrame Value In scala Variable

I need to Check the duplicate filename in my table and if file count is 0 then i need to load a file in my table using sparkSql. I wrote below code.
val s1=spark.sql("select count(filename) from mytable where filename='myfile.csv'") //giving '2'
s1: org.apache.spark.sql.DataFrame = [count(filename): bigint]
s1.show //giving 2 as output
//s1 is giving me the filecount from my table then i need to compare this count value using if statement.
I'm using below code.
val s2=s1.count //not working always giving 1
val s2=s1.head.count() // error: value count is not a member of org.apache.spark.sql.Row
val s2=s1.size //value size is not a member of Unit
if(s1>0){ //code } //value > is not a member of org.apache.spark.sql.DataFrame
can someone please give me a hint how should i do this.How can i get the dataframe value and can use as variable to check the condition.
i.e.
if(value of s1(i.e.2)>0){
//my code
}

You need to extract the value itself. Count will return the number of rows in the df, which is just one row.
So you can keep your original query and extract the value after with first and getInt methods
val s1 = spark.sql("select count(filename) from mytable where filename='myfile.csv'")`
val valueToCompare = s1.first().getInt(0)
And then:
if(valueToCompare>0){
//my code
}
Another option is performing the count outside the query, then the count will give you the desired value:
val s1 = spark.sql("select filename from mytable where filename='myfile.csv'")
if(s1.count>0){
//my code
}
I like the most the second option, but there is no reason other than that i think it is more clear

spark.sql("select count(filename) from mytable where filename='myfile.csv'") returns a dataframe and you need to extract both the first row and the first column of that row. It is much simpler to directly filter the dataset and count the number of rows in Scala:
val s1 = df.filter($"filename" === "myfile.csv").count
if (s1 > 0) {
...
}
where df is the dataset that corresponds to the mytable table.
If you got the table from some other source and not by registering a view, use SparkSession.table() to get a dataframe using the instance of SparkSession that you already have. For example, in Spark shell the pre-set variable spark holds the session and you'll do:
val df = spark.table("mytable")
val s1 = df.filter($"filename" === "myfile.csv").count

How to calculate a value for each row?

I have an input dataframe (created from a hive table) containing more than 100 rows. For each row of the dataframe, I need to extract the column values (most strings) and pass those values to a user-defined function. For each row, the function uses these input values and other intermediate dataframes (created from hive tables) to calculate a set of rows and stores in a result dataframe.
How do I achieve this - please help.
I tried this:
var df1= hiveContext.sql("Select event_date,channelcode,st,tc,startsec,endsec from program_master")
var count1=df1.count()
df1 = df1.withColumn("INDEX", monotonically_increasing_id())
var i=1
while (i <= count1){
var ed = df1.filter(df1("INDEX") === s"""$i""").select(to_date(unix_timestamp(df1("ed"), "dd-MM-yy").cast(TimestampType)).cast(DateType)).first().getDate(0)
var cc = df1.filter(df1("INDEX") === s"""$i""").select(df1("cc")).first().getInt(0)
var ST = df1.filter(df1("INDEX") === s"""$i""").select(df1("ST")).first().getString(0)
var TC = df1.filter(df1("INDEX") === s"""$i""").select(df1("TC")).first().getString(0)
var ss = df1.filter(df1("INDEX") === s"""$i""").select(df1("ss")).first().getInt(0)
var es = df1.filter(df1("INDEX") === s"""$i""").select(df1("es")).first().getInt(0)
calculate_values(ed, cc, st, tc, ss, ss, sparkSession)
i=i+1
}
calculate_values def
def calculate_values(ed: Date,cc:Integer,ST:String,TC:String,ss:Integer,ss:Integer,sparkSession: SparkSession):Unit=
2 problems in what I tried: hence do not have an output
line 3: I expected it to give numbers like 1,2,3,......100.... to iterate using i - but its generating random very large numbers.
line 5: It throws java.util.NoSuchElementException: next on empty iterator

monotonically_increasing_id() will generate random numbers but in increasing manner, so you cannot rely on it to generate serial numbers as row_number() function. But row_number() is expensive to be used on whole dataset as it will collect all the data in one executor unless you use row_number() by grouping the data.
monotonically_increasing_id() would be helpful in cases where you want to order/sort your data.
It seems that you are trying to calculate some values row by row using event_date, channelcode, st, tc, startsec and endsec.
If its a row by row calculation then I would suggest you to use a udf function. So you can convert your calculate_value function into a udf function as
import org.apache.spark.sql.functions._
def calculate_value = udf((ed: Date,cc:Int,ST:String,TC:String,ss:Int,es:Int) => //write your calculation part here)
And you call the udf function using withColumn as
df1.withColumn("calculated", calculate(col("ed"), col("cc"), col("ST"), col("TC"), col("ss"), col("es"))
a new column will be created with the calculated value
But if the calculations can be done column wise I would recommend you to look at inbuilt functions too

Variable substitution for column names in hive query

I have a task where in I need to compare 2 columns of a dataframe and get the differences.There are 200+ columns in the dataframe and i have to write 100+ queries to check the values in columns.
eg: DF1
https://i.stack.imgur.com/Aj1ca.png
I need all the values where X1 = X2 and column pairs have different value.
In simple terms-
select A1,A2 from DF1 where X1=X2 and A1!=A2
select B1,B2 from DF1 where X1=X2 and B1!=B2
select C1,C2 from DF1 where X1=X2 and C1!=C2
Now as I have 100+ columns so I have to write 100+ such queries. So I wanted to write a function in scala where in I would just pass the column names(A1,A2 or B1,B2, etc) which would be substituted in the hive query.
def comp_col(a:Any, b:Any):Any= {
var ret = sqlc.sql("SELECT $a, $b from DF1 WHERE X1= X2 $a!= $b");
return ret;
}
Is there anyway wherein the query in function would take the column names from the variables that I pass.
Any different approach is also welcome.
Thanks in Advance.

Yes, Use string interpolation for scala.
def comp_col(a:Any, b:Any):Any= {
var ret = sqlc.sql(s"SELECT $a, $b from DF1 WHERE X1= X2 $a!= $b");
return ret;
}

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?

According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.

Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Generate distinct values from a column in a spark dataframe

I have a spark dataframe like below
id|name|age|sub
1 |ravi|21 |[M,J,J,K]
I don't want to explode on the column "sub" as it will create another extra set of rows. I want generate unique values from the "sub" column and assign it to new column sub_unique.
My output should be like
id|name|age|sub_unique
1 |ravi|21 |[M,J,K]

You can use udf
val distinct = udf((x: Seq[String]) => if (s != null) x.distinct else Seq[String]())
df.withColumn("subm_unique", distinct($"sub"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Appending the dataframe via loop - scala

Related

Storing Spark DataFrame Value In scala Variable

How to calculate a value for each row?

Variable substitution for column names in hive query

Filter columns having count equal to the input file rdd Spark

Generate distinct values from a column in a spark dataframe

Categories

Resources