variable not binding value in Spark - scala

I am passing variable, but it is not passing value.
I populates variable value here.
val Temp = sqlContext.read.parquet("Tabl1.parquet")
Temp.registerTempTable("temp")
val year = sqlContext.sql("""select value from Temp where name="YEAR"""")
year.show()
here year.show() proper value.
I am passing the parameter here in below code.
val data = sqlContext.sql("""select count(*) from Table where Year='$year' limit 10""")
data.show()

The value year is a Dataframe, not a specific value (Int or Long). So when you use it inside a string interpolation, you get the result of Dataframe.toString, which isn't something you can use to compare values to (the toString returns a string representation of the Dataframe's schema).
If you can assume the year Dataframe has a single Row with a single column of type Int, and you want to get the value of that column - you get use first().getAs[Int](0) to get that value and then use it to construct your next query:
val year: DataFrame = sqlContext.sql("""select value from Temp where name="YEAR"""")
// get the first column of the first row:
val actualYear: Int = year.first().getAs[Int](0)
val data = sqlContext.sql(s"select count(*) from Table where Year='$actualYear' limit 10")
If value column in Temp table has a different type (String, Long) - just replace the Int with that type.

Related

Spark Cassandra Table Filter in Spark Rdd

I have to filter Cassandra table in spark, after getting data from a table via spark, apply filter function on the returned rdd ,we dont want to use where clause in cassandra api that can filter but that needs custom sasi index on the filter column, which has disk overhead issue due to multiple ss table scan in cassandra .
for example:
val ct = sc.cassandraTable("keyspace1", "table1")
val fltr = ct.filter(x=x.contains "zz")
table1 fields are :
dirid uuid
filename text
event int
eventtimestamp bigint
fileid int
filetype int
Basically we need to filter data based on filename with arbitrary string. since returned rdd is of type
com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD
and filter operations are restricted only to the methods of CassandraRow type which are enter image description here
val ct = sc.cassandraTable("keyspace1", "table1")
scala> ct
res140: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[171] at RDD at CassandraRDD.scala:19
when i hit tab after "x." in the below filter function, which shows the below methods of CassandraRow class`enter code here
scala> ct.filter(x=>x.
columnValues getBooleanOption getDateTime getFloatOption getLongOption getString getUUIDOption length
contains getByte getDateTimeOption getInet getMap getStringOption getVarInt metaData
copy getByteOption getDecimal getInetOption getRaw getTupleValue getVarIntOption nameOf
dataAsString getBytes getDecimalOption getInt getRawCql getTupleValueOption hashCode size
equals getBytesOption getDouble getIntOption getSet getUDTValue indexOf toMap
get getDate getDoubleOption getList getShort getUDTValueOption isNullAt toString
getBoolean getDateOption getFloat getLong getShortOption getUUID iterator
You need to get string field from the CassandraRow object, and then perform filtering on it. So this code will look as following:
val fltr = ct.filter(x => x.getString("filename").contains("zz"))

Storing Spark DataFrame Value In scala Variable

I need to Check the duplicate filename in my table and if file count is 0 then i need to load a file in my table using sparkSql. I wrote below code.
val s1=spark.sql("select count(filename) from mytable where filename='myfile.csv'") //giving '2'
s1: org.apache.spark.sql.DataFrame = [count(filename): bigint]
s1.show //giving 2 as output
//s1 is giving me the filecount from my table then i need to compare this count value using if statement.
I'm using below code.
val s2=s1.count //not working always giving 1
val s2=s1.head.count() // error: value count is not a member of org.apache.spark.sql.Row
val s2=s1.size //value size is not a member of Unit
if(s1>0){ //code } //value > is not a member of org.apache.spark.sql.DataFrame
can someone please give me a hint how should i do this.How can i get the dataframe value and can use as variable to check the condition.
i.e.
if(value of s1(i.e.2)>0){
//my code
}
You need to extract the value itself. Count will return the number of rows in the df, which is just one row.
So you can keep your original query and extract the value after with first and getInt methods
val s1 = spark.sql("select count(filename) from mytable where filename='myfile.csv'")`
val valueToCompare = s1.first().getInt(0)
And then:
if(valueToCompare>0){
//my code
}
Another option is performing the count outside the query, then the count will give you the desired value:
val s1 = spark.sql("select filename from mytable where filename='myfile.csv'")
if(s1.count>0){
//my code
}
I like the most the second option, but there is no reason other than that i think it is more clear
spark.sql("select count(filename) from mytable where filename='myfile.csv'") returns a dataframe and you need to extract both the first row and the first column of that row. It is much simpler to directly filter the dataset and count the number of rows in Scala:
val s1 = df.filter($"filename" === "myfile.csv").count
if (s1 > 0) {
...
}
where df is the dataset that corresponds to the mytable table.
If you got the table from some other source and not by registering a view, use SparkSession.table() to get a dataframe using the instance of SparkSession that you already have. For example, in Spark shell the pre-set variable spark holds the session and you'll do:
val df = spark.table("mytable")
val s1 = df.filter($"filename" === "myfile.csv").count

Getting max value out of a dataframe column of timestamp in scala/spark

I am working with a spark dataframe where it contains the entire timestamp values from the Column 'IMG_CREATED_DT'.I have used collectAsList() and toString() method to get the values as List and converting in to String. But I am not getting how to fetch the max value out of it.Please guide me on this.
val query_new =s"""(select IMG_CREATED_DT from
${conf.get(UNCAppConstants.DB2_SCHEMA)}.$table)"""
println(query_new)
val db2_op=ConnectionUtilities_v.createDataFrame(src_props,srcConfig.url,query_new)
val t3 = db2_op.select("IMG_CREATED_DT").collectAsList().toString
How to get the max value out of t3.
You can calculate the max value form dataframe itself. Try the following sample.
val t3 = db2_op.agg(max("IMG_CREATED_DT").as("maxVal")).take(1)(0).get(0)

how to convert org.apache.spark.sql.DataFrame with single row and single column into an integer value

consider querying hive data from inside spark using
val selectMemCntQry = "select column1 from table1 where column2 = "+col_2_val
val table_col2 = sparkSession.sql(selectMemCntQry)
val diff = table_col2 - file_member_count
where file_member_count is an integer value.I know result of table_col2 is always going to a single number
I want to subtract result of the query from an integer value.But error that I am facing is
value - is not a member of org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
val Row(colum1: Integer) = sparkSession.sql(selectMemCntQry).first
colum1 - file_member_count
or
sparkSession.sql(selectMemCntQry).first.getAs[Integer]("column1") - file_member_count
or
sparkSession.sql(selectMemCntQry).as[Integer].first - file_member_count

Filter columns having count equal to the input file rdd Spark

I'm filtering Integer columns from the input parquet file with below logic and been trying to modify this logic to add additional validation to see if any one of the input columns have count equals to the input parquet file rdd count. I would want to filter out such column.
Update
The number of columns and names in the input file will not be static, it will change every time we get the file.
The objective is to also filter out column for which the count is equal to the input file rdd count. Filtering integer columns is already achieved with below logic.
e.g input parquet file count = 100
count of values in column A in the input file = 100
Filter out any such column.
Current Logic
//Get array of structfields
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("integer"))
//Get the column names
val z = df.select(columns.map(x => col(x.name)): _*)
//Get array of string
val m = z.columns
New Logic be like
val cnt = spark.read.parquet("inputfile").count()
val d = z.column.where column count is not equals cnt
I do not want to pass the column name explicitly to the new condition, since the column having count equal to input file will change ( val d = .. above)
How do we write logic for this ?
According to my understanding of your question, your are trying filter in columns with integer as dataType and whose distinct count is not equal to the count of rows in another input parquet file. If my understanding is correct, you can add column count filter in your existing filter as
val cnt = spark.read.parquet("inputfile").count()
val columns = df.schema.fields.filter(x =>
x.dataType.typeName.contains("string") && df.select(x.name).distinct().count() != cnt)
Rest of the codes should follow as it is.
I hope the answer is helpful.
Jeanr and Ramesh suggested the right approach and here is what I did to get the desired output, it worked :)
cnt = (inputfiledf.count())
val r = df.select(df.col("*")).where(df.col("MY_COLUMN_NAME").<(cnt))