Apache Spark SQL identifier expected exception - scala

My question is quite similar to this one: Apache Spark SQL issue : java.lang.RuntimeException: [1.517] failure: identifier expected But I just can't figure out where my problem lies. I am using SQLite as database backend. Connecting and simple select statements work fine.
The offending line:
val df = tableData.selectExpr(tablesMap(t).toSeq:_*).map(r => myMapFunc(r))
tablesMap contains the table name as key and an array of strings as expressions. Printed, the array looks like this:
WrappedArray([My Col A], [ColB] || [Col C] AS ColB)
The table name is also included in square brackets since it contains spaces. The exception I get:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: identifier expected
I already made sure not to use any Spark Sql keywords. In my opinion there are 2 possible reasons why this code fails: 1) I somehow handle spaces in column names wrong. 2) I handle concatenation wrong.
I am using a resource file, CSV-like, which contains the expressions I want to be evaluated on my tables. Apart from this file, I want to allow the user to specify additional tables and their respective column expressions at runtime. The file looks like this:
TableName,`Col A`,`ColB`,CONCAT(`ColB`, ' ', `Col C`)
Appartently this does not work. Nevertheless I would like to reuse this file, modified of course. My idea was to map the columns with the expressions from an array of strings, like now, to a sequence of spark columns. (This is the only solution for me I could think of, since I want to avoid pulling in all hive dependecies just for this one feature.) I would introduce a small syntax for my expressions to mark raw column names with a $ and some keywords for functions like concat and as. But how could I do this? I tried something like this but it's far far away from even compiling.
def columnsMapFunc( expr: String) : Column = {
if(expr(0) == '$')
return expr.drop(1)
else
return concat(extractedColumnNames).as(newName)
}

Generally speaking using names containing whitespaces is asking for problems but replacing square brackets with backticks should solve the problem:
val df = sc.parallelize(Seq((1,"A"), (2, "B"))).toDF("f o o", "b a r")
df.registerTempTable("foo bar")
df.selectExpr("`f o o`").show
// +-----+
// |f o o|
// +-----+
// | 1|
// | 2|
// +-----+
sqlContext.sql("SELECT `b a r` FROM `foo bar`").show
// +-----+
// |b a r|
// +-----+
// | A|
// | B|
// +-----+
For concatenation you have to use concat function:
df.selectExpr("""concat(`f o o`, " ", `b a r`)""").show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
but it requires HiveContext in Spark 1.4.0.
In practice I would simply rename columns after loading data
df.toDF("foo", "bar")
// org.apache.spark.sql.DataFrame = [foo: int, bar: string]
and use functions instead of expression strings (concat function is available only in Spark >= 1.5.0, for 1.4 and earlier you'll need an UDF):
import org.apache.spark.sql.functions.concat
df.select($"f o o", concat($"f o o", lit(" "), $"b a r")).show
// +----------------------+
// |'concat(f o o, ,b a r)|
// +----------------------+
// | 1 A|
// | 2 B|
// +----------------------+
There is also concat_ws function which takes separator as the first argument:
df.selectExpr("""concat_ws(" ", `f o o`, `b a r`)""")
df.select($"f o o", concat_ws(" ", $"f o o", $"b a r"))

Related

Renaming columns in a PySpark DataFrame with a performant select operation

There are other thread on how to rename columns in a PySpark DataFrame, see here, here and here. I don't think the existing solutions are sufficiently performant or generic (I have a solution that should be better and I'm stuck on an edge case bug). Let's start by reviewing the issues with the current solutions:
Calling withColumnRenamed repeatedly will probably have the same performance problems as calling withColumn a lot, as outlined in this blog post. See Option 2 in this answer.
The toDF approach relies on schema inference and does not necessarily retain the nullable property of columns (toDF should be avoided in production code). I'm guessing this approach is slow as well.
This approach is close, but it's not generic enough and would be way too much manual work for a lot of columns (e.g. if you're trying to convert 2,000 column names to snake_case)
I created a function that's generic and works for all column types, except for column names that include dots:
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col(col_name).alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
Suppose you have the following DataFrame:
+-------------+-----------+
|i like cheese|yummy stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
Here's how to replace all the whitespaces in the column names with underscores:
def spaces_to_underscores(s):
return s.replace(" ", "_")
df.transform(with_columns_renamed(spaces_to_underscores)).show()
+-------------+-----------+
|i_like_cheese|yummy_stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
The solution works perfectly, except for when the column name contains dots.
Suppose you have this DataFrame:
+-------------+-----------+
|i.like.cheese|yummy.stuff|
+-------------+-----------+
| jose| a|
| li| b|
| sam| c|
+-------------+-----------+
This code will error out:
def dots_to_underscores(s):
return s.replace(".", "_")
df.transform(quinn.with_columns_renamed(dots_to_underscores))
Here's the error message: pyspark.sql.utils.AnalysisException: "cannot resolve 'i.like.cheese' given input columns: [i.like.cheese, yummy.stuff];;\n'Project ['i.like.cheese AS i_like_cheese#242, 'yummy.stuff AS yummy_stuff#243]\n+- LogicalRDD [i.like.cheese#231, yummy.stuff#232], false\n"
How can I modify this solution to work for column names that have dots? I'm also assuming that the Catalyst optimizer will have the same optimization problems for multiple withColumnRenamed calls as it does for multiple withColumn calls. Let me know if Catalyst handles multiple withColumnRenamed calls better for some reason.
You could do something simple like this,
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col('`' + col_name + '`').alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
I've read the other answers and can't understand why this isn't one of them, feel free to point out if I'm missing something! it's nothing new but it's concise and performs well
def with_columns_renamed(func):
def _(df):
return df.selectExpr(*['`{}` AS `{}`'.format(c, func(c)) for c in df.columns])
return _
Try escaping using ` :
import pyspark.sql.functions as F
def with_columns_renamed(fun):
def _(df):
cols = list(map(
lambda col_name: F.col("`{0}`".format(col_name)).alias(fun(col_name)),
df.columns
))
return df.select(*cols)
return _
Or use withColumnRenamed with reduce.
from functools import reduce
reduce(lambda new_df, col: new_df.withColumnRenamed(col,col.replace('.','_')),df.columns,df)

Spark dataframe - replace all values from key/value list in Scala

I've found some similar solutions, but none that accomplish exactly what I want to do. I have a set of key/value pairs that I want to use for string substitution. e.g.
val replacements = Map( "STREET" -> "ST", "STR" -> "ST")
I am reading a table into a dataframe, and I would like modify a column to replace all instances of the key in my map with their values. So in the above map, look at the "street" column and replace all values of "STREET" with "ST" and all values of "STR" with "ST" etc.
I've been looking at some foldLeft implementations, but haven't been able to finagle it into working.
A basic solution would be great, but an optimal solution would be something I could plug into a Column function that someone wrote that I was hoping to update. Specifically a line like this:
val CleanIt: Column = trim(regexp_replace(regexp_replace(regexp_replace(colName," OF "," ")," AT "," ")," AND "," "))
You can create this helper method that transforms a given column and a map of replacements into a new Column expression:
def withReplacements(column: Column, replacements: Map[String, String]): Column =
replacements.foldLeft[Column](column) {
case (col, (from, to)) => regexp_replace(col, from, to)
}
Then use it on your street column with your replacements map:
val result = df.withColumn("street", withReplacements($"street", replacements))
For example:
df.show()
// +------------+------+
// | street|number|
// +------------+------+
// | Main STREET| 1|
// |Broadway STR| 2|
// | 1st Ave| 3|
// +------------+------+
result.show()
// +-----------+------+
// | street|number|
// +-----------+------+
// | Main ST| 1|
// |Broadway ST| 2|
// | 1st Ave| 3|
// +-----------+------+
NOTE: the keys in the map must be valid regular expressions. That means, for example, that if you want to replace the string "St." with "ST", you should use Map("St\\." -> "ST) (escaping the dot, which otherwise would be interpreted as regex's "any")

How to evaluate expressions that are the column values?

I have a big dataframe with millions of rows as follows:
A B C Eqn
12 3 4 A+B
32 8 9 B*C
56 12 2 A+B*C
How to evaluate the expressions in the Eqn column?
You could create a custom UDF that evaluates these arithmetic functions
def evalUDF = udf((a:Int, b:Int, c:Int, eqn:String) => {
val eqnParts = eqn
.replace("A", a.toString)
.replace("B", b.toString)
.replace("C", c.toString)
.split("""\b""")
.toList
val (sum, _) = eqnParts.tail.foldLeft((eqnParts.head.toInt, "")){
case ((runningTotal, "+"), num) => (runningTotal + num.toInt, "")
case ((runningTotal, "-"), num) => (runningTotal - num.toInt, "")
case ((runningTotal, "*"), num) => (runningTotal * num.toInt, "")
case ((runningTotal, _), op) => (runningTotal, op)
}
sum
})
evalDf
.withColumn("eval", evalUDF('A, 'B, 'C, 'Eqn))
.show()
Output:
+---+---+---+-----+----+
| A| B| C| Eqn|eval|
+---+---+---+-----+----+
| 12| 3| 4| A+B| 15|
| 32| 8| 9| B*C| 72|
| 56| 12| 2|A+B*C| 136|
+---+---+---+-----+----+
As you can see this works, but is very fragile (spaces, unknown operators, etc will break the code) and doesn't adhere to order of operations (otherwise the last should have been 92)
So you could write all that yourself or find some library that already does that perhaps (like https://gist.github.com/daixque/1610753)?
Maybe the performance overhead will be very large (especially it you start using recursive parsers), But at least you can perform it on a dataframe instead of collecting it first
I think the only way to execute SQLs that are inside a DataFrame is to select("Eqn").collect first followed by executing the SQLs iteratively on the source Dataset.
Since the SQLs are in a DataFrame that is nothing else but a description of a distributed computation that will be executed on Spark executors there is no way you could submit Spark jobs while processing the SQLs on executors. It is simply too late in the execution pipeline. You should be back on the driver to be able to submit new Spark jobs, say to execute SQLs.
With SQLs on the driver you'd then take the corresponding row per SQL and simply withColumn to execute SQLs (with their rows).
I think it's easier to write it than develop a working Spark application, but that's how I'd go about it.
I am late But incase someone is looking for
Generic Math expression interpreter using variables
Complex/unknown expression that cannot be hardcoded into the UDF (accepted answer)
Then you can use javax.script.ScriptEngineManager
import javax.script.SimpleBindings;
import javax.script.ScriptEngineManager
import java.util.Map
import java.util.HashMap
def calculateFunction = (mathExpression: String, A : Double, B : Double, C : Double ) => {
val vars: Map[String, Object] = new HashMap[String, Object]();
vars.put("A",A.asInstanceOf[Object])
vars.put("B",B.asInstanceOf[Object])
vars.put("C",C.asInstanceOf[Object])
val engine = new ScriptEngineManager().getEngineByExtension("js");
val result = engine.eval(mathExpression, new SimpleBindings(vars));
result.asInstanceOf[Double]
}
val calculateUDF = spark.udf.register("calculateFunction",calculateFunction)
NOTE : This will handle generic expressions and is robust but it give a lot more worse performance than the accepted answer and heavy on memory.

Spark (scala) dataframes - Check whether strings in column contain any items from a set

I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).
Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+
Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.
if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")

Reference column by id in Spark Dataframe

I have multiple duplicate columns (due to joins) If I try to call them by alias, I get an ambiguous reference error:
Reference 'customers_id' is ambiguous, could be: customers_id#13, customers_id#85, customers_id#130
Is there a way to reference a column in a Scala Spark Dataframe by it's order in the Dataframe or by numeric ID, not by an alias? Sanitized names suggest that columns do have an id assigned (13, 85, 130 in the example below)
LATER EDIT:
I found out that I can reference a specific column by the original dataframe it was in. But, while I can use OriginalDataframe.customer_id in select function, the withColumnRename function only accepts string alias so I cannot rename the duplicate column in the final dataframe.
So, I guess the end question is:
Is there a way to reference a column that has a duplicate alias, that works with all functions that require a string alias as argument?
LATER EDIT 2:
Renaming seemed to have worked via adding a new column and dropping one of the current ones:
joined_dataframe = joined_dataframe.withColumn("renamed_customers_id", original_dataframe("customers_id")).drop(original_dataframe("customers_id"))
But, I'd like to keep my question open:
Is there a way to reference a column that has a duplicate alias (so, using something other than alias) in a way that all functions which expect a string alias accept it?
One way to get out of such a situation would be to create a new Dataframe using the old one's rdd, but with a new schema, where you can name each column as you'd like. This, of course, requires you to explicitly describe the entire schema, including the type of each column. As long as the new schema you provides matches the number of columns, and the column types, of the old Dataframe - this should work.
For example - starting with a Dataframe with two columns named type we can rename them type1 and type2:
df.show()
// +---+----+----+
// | id|type|type|
// +---+----+----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+----+----+
val newDF = sqlContext.createDataFrame(df.rdd, new StructType()
.add("id", IntegerType)
.add("type1", StringType)
.add("type2", StringType)
)
newDF.show()
// +---+-----+-----+
// | id|type1|type2|
// +---+-----+-----+
// | 1| AAA| aaa|
// | 1| BBB| bbb|
// +---+-----+-----+
The main problem is join, ı use python.
h1.createOrReplaceTempView("h1")
h2.createOrReplaceTempView("h2")
h3.createOrReplaceTempView("h3")
joined1 = h1.join(h2, (h1.A == h2.A) & (h1.B == h2.B) & (h1.C == h2.C), 'inner')
Result dataframe columns:
A B Column1 Column2 A B Column3 ...
I don't like this , but join must be implement like this:
joined1 = h1.join(h2, [*argv], 'inner')
We assume argv = ["A", "B", "C"]
Result columns:
A B column1 column2 column3 ...