Variable substitution for column names in hive query - scala

I have a task where in I need to compare 2 columns of a dataframe and get the differences.There are 200+ columns in the dataframe and i have to write 100+ queries to check the values in columns.
eg: DF1
https://i.stack.imgur.com/Aj1ca.png
I need all the values where X1 = X2 and column pairs have different value.
In simple terms-
select A1,A2 from DF1 where X1=X2 and A1!=A2
select B1,B2 from DF1 where X1=X2 and B1!=B2
select C1,C2 from DF1 where X1=X2 and C1!=C2
Now as I have 100+ columns so I have to write 100+ such queries. So I wanted to write a function in scala where in I would just pass the column names(A1,A2 or B1,B2, etc) which would be substituted in the hive query.
def comp_col(a:Any, b:Any):Any= {
var ret = sqlc.sql("SELECT $a, $b from DF1 WHERE X1= X2 $a!= $b");
return ret;
}
Is there anyway wherein the query in function would take the column names from the variables that I pass.
Any different approach is also welcome.
Thanks in Advance.

Yes, Use string interpolation for scala.
def comp_col(a:Any, b:Any):Any= {
var ret = sqlc.sql(s"SELECT $a, $b from DF1 WHERE X1= X2 $a!= $b");
return ret;
}

Related

How to append 'explode'd columns to a dataframe keeping all existing columns?

I'm trying to add exploded columns to a dataframe:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json, schema=None):
# SparkSessions are available with Spark 2.0+
reader = spark.read
if schema:
reader.schema(schema)
return reader.json(sc.parallelize([json]))
schema = StructType().add("a", MapType(StringType(), IntegerType()))
events = jsonToDataFrame("""
{
"a": {
"b": 1,
"c": 2
}
}
""", schema)
display(
events.withColumn("a", explode("a").alias("x", "y"))
)
However, I'm hitting the following error:
AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got a
Any ideas?
In the end, I used the following:
display(
events.select(explode("a").alias("x", "y"), *[c for c in events.columns])
)
This approach uses select to specify the columns to return.
The first argument explodes the data:
explode("a").alias("x", "y")
The second argument specifies all existing columns should be included in the select:\
*[c for c in events.columns]
Note that I'm prefixing the list with * - this sends each column name as a separate parameter.
Simpler Method
The API docs specify:
Parameters
colsstr, Column, or list
column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.
We can simplify the first approach by passing in "*" to select all the columns:
display(
events.select("*", explode("a").alias("x", "y"))
)

How to calculate a value for each row?

I have an input dataframe (created from a hive table) containing more than 100 rows. For each row of the dataframe, I need to extract the column values (most strings) and pass those values to a user-defined function. For each row, the function uses these input values and other intermediate dataframes (created from hive tables) to calculate a set of rows and stores in a result dataframe.
How do I achieve this - please help.
I tried this:
var df1= hiveContext.sql("Select event_date,channelcode,st,tc,startsec,endsec from program_master")
var count1=df1.count()
df1 = df1.withColumn("INDEX", monotonically_increasing_id())
var i=1
while (i <= count1){
var ed = df1.filter(df1("INDEX") === s"""$i""").select(to_date(unix_timestamp(df1("ed"), "dd-MM-yy").cast(TimestampType)).cast(DateType)).first().getDate(0)
var cc = df1.filter(df1("INDEX") === s"""$i""").select(df1("cc")).first().getInt(0)
var ST = df1.filter(df1("INDEX") === s"""$i""").select(df1("ST")).first().getString(0)
var TC = df1.filter(df1("INDEX") === s"""$i""").select(df1("TC")).first().getString(0)
var ss = df1.filter(df1("INDEX") === s"""$i""").select(df1("ss")).first().getInt(0)
var es = df1.filter(df1("INDEX") === s"""$i""").select(df1("es")).first().getInt(0)
calculate_values(ed, cc, st, tc, ss, ss, sparkSession)
i=i+1
}
calculate_values def
def calculate_values(ed: Date,cc:Integer,ST:String,TC:String,ss:Integer,ss:Integer,sparkSession: SparkSession):Unit=
2 problems in what I tried: hence do not have an output
line 3: I expected it to give numbers like 1,2,3,......100.... to iterate using i - but its generating random very large numbers.
line 5: It throws java.util.NoSuchElementException: next on empty iterator
monotonically_increasing_id() will generate random numbers but in increasing manner, so you cannot rely on it to generate serial numbers as row_number() function. But row_number() is expensive to be used on whole dataset as it will collect all the data in one executor unless you use row_number() by grouping the data.
monotonically_increasing_id() would be helpful in cases where you want to order/sort your data.
It seems that you are trying to calculate some values row by row using event_date, channelcode, st, tc, startsec and endsec.
If its a row by row calculation then I would suggest you to use a udf function. So you can convert your calculate_value function into a udf function as
import org.apache.spark.sql.functions._
def calculate_value = udf((ed: Date,cc:Int,ST:String,TC:String,ss:Int,es:Int) => //write your calculation part here)
And you call the udf function using withColumn as
df1.withColumn("calculated", calculate(col("ed"), col("cc"), col("ST"), col("TC"), col("ss"), col("es"))
a new column will be created with the calculated value
But if the calculations can be done column wise I would recommend you to look at inbuilt functions too

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))

Iterate across columns in spark dataframe and calculate min max value

I want to iterate across the columns of dataframe in my Spark program and calculate min and max value.
I'm new to Spark and scala and not able to iterate over the columns once I fetch it in a dataframe.
I have tried running the below code but it needs column number to be passed to it, question is how do I fetch it from dataframe and pass it dynamically and store the result in a collection.
val parquetRDD = spark.read.parquet("filename.parquet")
parquetRDD.collect.foreach ({ i => parquetRDD_subset.agg(max(parquetRDD(parquetRDD.columns(2))), min(parquetRDD(parquetRDD.columns(2)))).show()})
Appreciate any help on this.
You should not be iterating on rows or records. You should be using aggregation function
import org.apache.spark.sql.functions._
val df = spark.read.parquet("filename.parquet")
val aggCol = col(df.columns(2))
df.agg(min(aggCol), max(aggCol)).show()
First when you do spark.read.parquet you are reading a dataframe.
Next we define the column we want to work on using the col function. The col function translate a column name to a column. You could instead use df("name") where name is the name of the column.
The agg function takes aggregation columns so min and max are aggregation functions which take a column and return a column with an aggregated value.
Update
According to the comments, the goal is to have min and max for all columns. You can therefore do this:
val minColumns = df.columns.map(name => min(col(name)))
val maxColumns = df.columns.map(name => max(col(name)))
val allMinMax = minColumns ++ maxColumns
df.agg(allMinMax.head, allMinMax.tail: _*).show()
You can also simply do:
df.describe().show()
which gives you statistics on all columns including min, max, avg, count and stddev

how to select all columns that starts with a common label

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
I know I can do like this to select specific columns:
df.select("colA", "colB", "colE")
but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?
The process canbe broken down into following steps:
First grab the column names with df.columns,
then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings.
But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)),
and finally turn the Array of Columns into a var arg with : _*.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
Python (tested in Azure Databricks)
selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)
In PySpark, use: colRegex to select columns starting with colF
Whit the sample:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
Apply:
df.select(col("colA"), col("colB"), df.colRegex("`(colF)+?.+`")).show()
The result is:
colA, colB, colF-0, colF-1, colF-2
I wrote a function that does that. Read the comments to see how it works.
/**
* Given a sequence of prefixes, select suitable columns from [[DataFrame]]
* #param columnPrefixes Sequence of prefixes
* #param dF Incoming [[DataFrame]]
* #return [[DataFrame]] with prefixed columns selected
*/
def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
// Find out if given column name matches any of the provided prefixes
def colNameStartsWith: String => Boolean = (colName: String) =>
columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
// Filter columns list by checking against given prefixes sequence
val columns = dF.columns.filter(colNameStartsWith)
// Select filtered columns list
dF.select(columns.head, columns.tail:_*)
}