I am trying to compare count of 2 different queries/tables. Is it possible to perform this operation in Scala(Spark SQL)?
Here is my code:
val parquetFile1 = sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2016/2016_000_njars_09665_mbr_addr.20161222031015221601.parquet")
val parquetFile2 =sqlContext.read.parquet("/user/njrbars2/ars/mbr_addr/2017/part-r-00000-70ce4958-57fe-487f-a45b-d73b7ef20289.snappy.parquet")
parquetFile1.registerTempTable("parquetFile1")
parquetFile2.registerTempTable("parquetFile2")
scala> var first_table_count=sqlContext.sql("select count(*) from parquetFile1")
first_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> var second_table_count=sqlContext.sql("select count(*) from parquetFile1 where LINE1_ADDR is NULL and LINE2_ADDR is NULL")
second_table_count: org.apache.spark.sql.DataFrame = [_c0: bigint]
scala> first_table_count.show()
+------+
| _c0|
+------+
|119928|
+------+
scala> second_table_count.show()
+---+
|_c0|
+---+
|617|
+---+
I am trying to get difference value of both these queries but getting error.
scala> first_table_count - second_table_count
<console>:30: error: value - is not a member of org.apache.spark.sql.DataFrame
first_table_count - second_table_count
whereas if I do normal substraction, it works
scala> 2 - 1
res7: Int = 1
It seems I have to do some data conversion but not able to find appropriate solution.
In newer version of spark count not return Long value instead it is reaped inside dataframe object i.e. Dataframe[BigInt].
you can try this
val diffrence = first_table_count.first.getLong(0) - second_table_count.first.getLong(0);
And subtract method is not available on dataframe.
You need something like the following to do the conversion:
first_table_count.first.getLong(0)
And here is why you need it:
A DataFrame represents a tabular data structure. So although your SQL seems to return a single value, it actually returns a table containing a single row, and the row contains a single column. Hence we use the above code to extract the first column (index 0) of the first row.
Related
I am currently using the following approach to concat the columns in a dataframe:
val Finalraw = raw.withColumn("primarykey", concat($"prod_id",$"frequency",$"fee_type_code"))
But the thing is that I do not want to hardcode the columns as the number of columns is changing everytime. I have a list that consists the column names:
columnNames: List[String] = List("prod_id", "frequency", "fee_type_code")
So, the question is how to pass the list elements to the concat function instead of hardcoding the column names?
The concat function takes multiple columns as input while you have a list of strings. You need to transform the list to fit the method input.
First, use map to transform the strings into column objects and then unpack the list with :_* to correctly pass the arguments to concat.
val Finalraw = raw.withColumn("primarykey", concat(columnNames.map(col):_*))
For an explaination of the :_* syntax, see What does `:_*` (colon underscore star) do in Scala?
Map the list elements to List[org.apache.spark.sql.Column] in a separate variable.
Check this out.
scala> val df = Seq(("a","x-","y-","z")).toDF("id","prod_id","frequency","fee_type_code")
df: org.apache.spark.sql.DataFrame = [id: string, prod_id: string ... 2 more fields]
scala> df.show(false)
+---+-------+---------+-------------+
|id |prod_id|frequency|fee_type_code|
+---+-------+---------+-------------+
|a |x- |y- |z |
+---+-------+---------+-------------+
scala> val arr = List("prod_id", "frequency", "fee_type_code")
arr: List[String] = List(prod_id, frequency, fee_type_code)
scala> val arr_col = arr.map(col(_))
arr_col: List[org.apache.spark.sql.Column] = List(prod_id, frequency, fee_type_code)
scala> df.withColumn("primarykey",concat(arr_col:_*)).show(false)
+---+-------+---------+-------------+----------+
|id |prod_id|frequency|fee_type_code|primarykey|
+---+-------+---------+-------------+----------+
|a |x- |y- |z |x-y-z |
+---+-------+---------+-------------+----------+
scala>
I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.
I have an array as a broadcast variable and it contains Integers:
broadcast_array.value
Array(72159153, 72159163, 72159202, 72159203, 72159238, 72159398, 72159447, 72159448, 72159455, 72159492...
I have a column in a dataset (call is col_id which contains IntegerType values that might be in the broadcast_array, but they might not.
I am only trying to create a new column (call it new_col) that checks if col_id value for each row is in broadcast_array. If so, the new column value should be Available, else it can be null
So I have something like:
val my_new_df = df.withColumn("new_col", when(broadcast_array.value.contains($"col_id"), "Available"))
But I keep getting this error:
Name: Unknown Error
Message: <console>:45: error: type mismatch;
found : Boolean
required: org.apache.spark.sql.Column
val my_new_df = df.withColumn("new_col", when(broadcast_array.value.contains($"col_id"), "Available"))
^
StackTrace:
What is most confusing to me is that I thought the when statement requires a conditional that outputs some Boolean, but here it's saying it requires a Column.
How should I go about adding a value to a new column based on whether the value in an existing column can be found in a predefined Array or not?
If you look at the api of when function
def when(condition : org.apache.spark.sql.Column, value : scala.Any) : org.apache.spark.sql.Column
Its clear that the condition required is a column and not a boolean.
So you can do complicated lit combinations to convert your boolean to column as
import org.apache.spark.sql.functions._
df.withColumn("new_col", when(lit(broadcast_array.value.mkString(",")).contains($"col_id"), lit("Available"))).show(false)
OR
You can achieve what you are trying by writing a simple udf function as
import org.apache.spark.sql.functions._
val broadcastContains = udf((id: Int) => broadcast_array.value.contains(id))
and just call the function as
df.withColumn("new_col", when(broadcastContains($"col_id"), lit("Available"))).show(false)
I added a broadcastArrayContains function to spark-daria that makes Ramesh's solution more reusable / accessible.
def broadcastArrayContains[T](col: Column, broadcastedArray: Broadcast[Array[T]]) = {
when(col.isNull, null)
.when(lit(broadcastedArray.value.mkString(",")).contains(col), lit(true))
.otherwise(lit(false))
}
Suppose you have the following DataFrame (df):
+----+
| num|
+----+
| 123|
| hi|
|null|
+----+
You can identify all the values in the broadcasted array as follows:
val specialNumbers = spark.sparkContext.broadcast(Array("123", "456"))
df.withColumn(
"is_special_number",
functions.broadcastArrayContains[String](col("num"), specialNumbers)
)
I have two dataframes as follows which have only one row and one column each. Both holds two different numeric values.
How do I perform or achieve division or other arithmetic operation on those two dataframe values?
Please help.
First, if these DataFrames contain a single record each - any further use of Spark would likely be wasteful (Spark is intended for large data sets, small ones would be processed faster locally). So, you can simply collect these one-record values using first() an go on from there:
import spark.implicits._
val df1 = Seq(2.0).toDF("col1")
val df2 = Seq(3.5).toDF("col2")
val v1: Double = df1.first().getAs[Double](0)
val v2: Double = df2.first().getAs[Double](0)
val sum = v1 + v2
If, for some reason, you do want to use DataFrames all the way, you can use crossJoin to join the records together and then apply any arithmetic operation:
import spark.implicits._
val df1 = Seq(2.0).toDF("col1")
val df2 = Seq(3.5).toDF("col2")
df1.crossJoin(df2)
.select($"col1" + $"col2" as "sum")
.show()
// +---+
// |sum|
// +---+
// |5.5|
// +---+
If you have dataframes as
scala> df1.show(false)
+------+
|value1|
+------+
|2 |
+------+
scala> df2.show(false)
+------+
|value2|
+------+
|2 |
+------+
You can get the value by doing the following
scala> df1.take(1)(0)(0)
res3: Any = 2
But the dataType is Any, type casting is needed before we do arithmetic operations as
scala> df1.take(1)(0)(0).asInstanceOf[Int]*df2.take(1)(0)(0).asInstanceOf[Int]
res8: Int = 4
I have two values of different type as shown below in spark-sql
scala> val ageSum = df.agg(sum("age"))
ageSum: org.apache.spark.sql.DataFrame = [sum(age): bigint]
scala> val totalEntries = df.count();
scala> totalEntries
res37: Long = 45211
First value is coming from aggregate function on data frame and second is coming from total count function on data frame. Both are having different types as ageSum is bigInt and totalEntries is Long. I want to perform mathematical operation on it. Mean = ageSum/totalEntries
scala> val mean = ageSum/totalEntries
<console>:31: error: value / is not a member of org.apache.spark.sql.DataFrame val mean = ageSum/totalEntries
I also tried to convert to ageSum to long type but not able to do so
scala> val ageSum = ageSum.longValue
<console>:29: error: recursive value ageSum needs type
val ageSum = ageSum.longValues
ageSum is a data frame, you need to extract the value from it. One option would be to use first() to get the value as a Row and then extract the value from the row:
ageSum.first().getAs[Long](0)/totalEntries
// res6: Long = 2
If you need a more exact value, you can use toDouble to convert before division:
ageSum.first().getAs[Long](0).toDouble/totalEntries
// res9: Double = 2.5
Or you can make the result another column of your ageSum:
ageSum.withColumn("mean", $"sum(age)"/totalEntries).show
+--------+----+
|sum(age)|mean|
+--------+----+
| 10| 2.5|
+--------+----+
val df = Seq(1,2,3,4).toDF("age")