I am working on the dataframe created by JSON and then I want to apply the filter condition over the dataframe.
val jsonStr = """{ "metadata": [{ "key": 84896, "value": 54 },{ "key": 1234, "value": 12 }]}"""
val rdd = sc.parallelize(Seq(jsonStr))
val df = sqlContext.read.json(rdd)
schema of df
root
|-- metadata: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: long (nullable = true)
| | |-- value: long (nullable = true)
now I need to filter the dataframe which I am trying to do as
val df1=df.where("key == 84896")
which throws error
ERROR Executor - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`key`' given input columns: [metadata]; line 1 pos 0;
'Filter ('key = 84896)
The reason I want to use where clause is because of the expression string which I want to use directly
eg ( (key == 999, value == 55) || (key == 1234, value == 12) )
First you should use explode to get an easy-to-work-with dataFrame. Then you can select both key and value of you given input:
val explodedDF = df.withColumn("metadata", explode($"metadata"))
.select("metadata.key", "metadata.value")
Output:
+-----+-----+
| key|value|
+-----+-----+
|84896| 54|
| 1234| 12|
+-----+-----+
This way you'll be able to perform your filtering logic as usual:
scala> explodedDF.where("key == 84896").show
+-----+-----+
| key|value|
+-----+-----+
|84896| 54|
+-----+-----+
You can concatenate your filtering requirements, some examples below:
explodedDF.where("key == 84896 AND value == 54")
explodedDF.where("(key == 84896 AND value == 54) OR key = 1234")
From what I have understood from your question and comment is that you are trying to apply ( (key == 999, value == 55) || (key == 1234, value == 12) ) expression to filter the dataframe rows.
First of all, the expression needs changes as it cannot be applied as expression to dataframe in spark so you need to change as
val expression = """( (key == 999, value == 55) || (key == 1234, value == 12) )"""
val actualExpression = expression.replace(",", " and").replace("||", "or")
which should give you new valid expression as
( (key == 999 and value == 55) or (key == 1234 and value == 12) )
Now that you have valid expression, your dataframe needs modification too as you can't query such expression on a column with array and struct as schema
So you would need explode function to explode the array elements to different rows and then use .* notation to select all the elements of struct on different columns.
val df1 = df.withColumn("metadata", explode($"metadata"))
.select($"metadata.*")
which should give you dataframe as
+-----+-----+
|key |value|
+-----+-----+
|84896|54 |
|1234 |12 |
+-----+-----+
And the finally use the valid expression on the dataframe generated as
df1.where(s"${actualExpression}")
I hope the answer is helpful
Related
In Spark (Scala), after the application jar is submitted to Spark, is it possible for the jar to fetch many strings from a database table, convert each string to a catalyst Expression and then convert that expression to a UDF, and use the UDF to filters rows in another DataFrame, and finally union the result of each UDF?
(The said expression needs some or all columns of the DataFrame, but which columns are needed is unknown at the time of the code of the jar is written, the schema of the DataFrame is known at development time)
An example:
expression 1: "id == 1"
expression 2: "name == \"andy\""
DataFrame:
row 1: id = 1, name = "red", age = null
row 2: id = 2, name = "andy", age = 20
row 3: id = 3, name = "juliet", age = 21
the final result should be the first two rows
Note: it is not acceptable to first concatenate the two expressions with a or, for I needed to track which expression results the result row
Edited: Filter for each argument and union All.
import org.apache.spark.sql.DataFrame
val df = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
val args = Array("id == 1", "name == \"andy\"")
val filters = args.zipWithIndex
var dfs = Array[DataFrame]()
filters.foreach {
case (filter, index) =>
val tempDf = df.filter(filter).withColumn("index", lit(index))
dfs = dfs :+ tempDf
}
val resultDF = dfs.reduce(_ unionAll _)
resultDF.show(false)
+---+----+----+-----+
|id |name|age |index|
+---+----+----+-----+
|1 |red |null|0 |
|2 |andy|20 |1 |
+---+----+----+-----+
Original: Why just put the string to the filter?
val df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
val condition = "id == 1 or name == \"andy\""
df.filter(condition).show(false)
+---+----+----+
|id |name|age |
+---+----+----+
|1 |red |null|
|2 |andy|20 |
+---+----+----+
Something I have missed?
I am looking to convert null values nested in Array of String to empty strings in spark. The data is in a dataframe. I plan on running a reduce function after making the dataframe null safe, not sure if that helps in answering the question. I am using spark 1.6.
Schema:
root
|-- carLineName: array (nullable = true)
| |-- element: string (containsNull = true)
Example input:
+--------------------+
|carLineName |
+--------------------+
|[null,null,null] |
|[null, null] |
|[Mustang, null] |
|[Pilot, Jeep] |
Desired output:
+--------------------+
|carLineName |
+--------------------+
|[,,] |
|[,] |
|[Mustang,] |
|[Pilot, Jeep] |
My attempt:
val safeString: Seq[String] => Seq[String] = s => if (s == null) "" else s
val udfSafeString = udf(safeString)
The input to the UDF is a sequence of strings, not a single string. Since that is the case, you need to map over it. You can do this as follows:
val udfSafeString = udf((arr: Seq[String]) => {
arr.map(s => if (s == null) "" else s)
})
df.withColumn("carLineName", udfSafeString($"carLineName"))
Can I use the following code:
df.withColumn("id", df["id"].cast("integer")).na.drop(subset=["id"])
If id is not a valid integer, it will be NULL and dropped in the subsequent step.
Without changing the type
df = sqlContext.read.text("sample.txt")
df.select(
df.value.substr(1.2).alias('id'),
df.value.substr(3.13).alias('name'),
df.value.substr(16,8).alias('date'),
df.value.substr(24,3).alias('Yes/No')
).show()
valid = df.where(df["id"].cast("integer").isNotNull())
invalid = df.where(df["id"].cast("integer").isNull())
Here my df.printschema prints
root
|-- value: string (nullable = true)
+---+-------------+--------+------+
| id| name | date |Yes/No|
+---+-------------+--------+------+
| 01|abcdefghijklkm |010V2201| 9Ye|
+---+-------------+--------+------+
| ab| abcdefghijklmm|010V2201| 9Ye|
+---+-------------+--------+------+
this is a sample output
Expected result
row with integer column to be removed with null or invalid values, can i use df.withcolumn into it ? if i can then how ?
How can I count the occurrences of a String in a df Column using Spark partitioned by id?
e.g. Find the value "test" in column "name" of a df
In SQL would be:
SELECT
SUM(CASE WHEN name = 'test' THEN 1 else 0 END) over window AS cnt_test
FROM
mytable
WINDOW window AS (PARTITION BY id)
I've tried using map( v => match { case "test" -> 1.. })
and things like:
def getCount(df: DataFrame): DataFrame = {
val dfCnt = df.agg(
.withColumn("cnt_test",
count(col("name")==lit('test'))
)
Is this an expensive operation? What could be the best approach to check for occurrences of a specific string and then perform an action (sum, max, min, etc)?
thanks
You can use groupBy + agg in spark; Here when($"name" == "test", 1) transforms name column to 1 if name == 'test', null otherwise, and count gives count of non null values:
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test"))
Example:
val df = Seq(("a", "joe"), ("b", "test"), ("b", "john")).toDF("id", "name")
df.groupBy("id").agg(count(when($"name" === "test", 1)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
| b| 1|
| a| 0|
+---+--------+
Or similar to your sql queries:
df.groupBy("id").agg(sum(when($"name" === "test", 1).otherwise(0)).as("cnt_test")).show
+---+--------+
| id|cnt_test|
+---+--------+
| b| 1|
| a| 0|
+---+--------+
If you want to translate your SQL, you can just also Window-functions in Spark as well:
def getCount(df: DataFrame): DataFrame = {
import org.apache.spark.sql.expressions.Window
df.withColumn("cnt_test",
sum(when($"name" === "test", 1).otherwise(0)).over(Window.partitionBy($"id"))
)
}
I've a column in a Hive table:
Column Name: Filters
Data Type:
|-- filters: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
I want to get the value from this column by it's corresponding name.
What I did so far:
val sdf: DataFrame = sqlContext.sql("select * from <tablename> where id='12345'")
val sdfFilters = sdf.select("filters").rdd.map(r => r(0).asInstanceOf[Seq[(String,String)]]).collect()
Output: sdfFilters: Array[Seq[(String, String)]] = Array(WrappedArray([filter_RISKFACTOR,OIS.SPD.*], [filter_AGGCODE,IR]), WrappedArray([filter_AGGCODE,IR_]))
Note: Casting to Seq because WrappedArray to Map conversion is not possible.
What to do next?
I want to get the value from this column by it's corresponding name.
If you want simple and reliable way to get all values by name, you may flatten your table using explode and filter:
case class Data(name: String, value: String)
case class Filters(filters: Array[Data])
val df = sqlContext.createDataFrame(Seq(Filters(Array(Data("a", "b"), Data("a", "c"))), Filters(Array(Data("b", "c")))))
df.show()
+--------------+
| filters|
+--------------+
|[[a,b], [a,c]]|
| [[b,c]]|
+--------------+
df.withColumn("filter", explode($"filters"))
.select($"filter.name" as "name", $"filter.value" as "value")
.where($"name" === "a")
.show()
+----+-----+
|name|value|
+----+-----+
| a| b|
| a| c|
+----+-----+
You can also collect your data any way you want:
val flatDf = df.withColumn("filter", explode($"filters")).select($"filter.name" as "name", $"filter.value" as "value")
flatDf.rdd.map(r => Array(r(0), r(1))).collect()
res0: Array[Array[Any]] = Array(Array(a, b), Array(a, c), Array(b, c))
flatDf.rdd.map(r => r(0) -> r(1)).groupByKey().collect() //not the best idea, if you have many values per key
res1: Array[(Any, Iterable[Any])] = Array((b,CompactBuffer(c)), (a,CompactBuffer(b, c)))
If you want to cast array[struct] to map[string, string] for future saving to some storage - it's different story, and this case is better solved by UDF. Anyway, you have to avoid collect() as long as it possible to keep your code scalable.