there is the table in which the field "A" contains sql query. It is necessary to add an additional field "B" that would contain the time spent on the execution of the query from the "A" field. I wrote a UDF and everything works well, but when caching the resulting table or trying to write the final dataframe to a physical table, I got the error:
"Failed to execute user defined function ($anonfun$1: (string) =>
string)"
. What could be the problem?
Example:
val set_time = udf((query: String) => {
val start = new Timestamp(new Date().getDate)
val count = spark.sql(s"${query}").count
val time_query = (new Timestamp(new Date().getTime)).getTime() - start.getTime()
time_query.toString
})
Source table "source":
+--------------------+
| A |
+--------------------+
|"Select * From ..." |
|"Select * From ..." |
|"Select * From ..." |
|"Select * From ..." |
|"Select * From ..." |
+--------------------+
val result = spark.sql("from source").
withColumn("B", set_time(col("A")))
result.show
+--------------------+------+
| A | B |
+--------------------+------+
|"Select * From ..." | 356 |
|"Select * From ..." | 642 |
|"Select * From ..." | 2745 |
|"Select * From ..." | 1324 |
|"Select * From ..." | 635 |
+--------------------+------+
But:
//ERROR
result.write.mode("overwrite").saveAsTable("dbName.result")
//ERROR
val result_cache = result.persist
result_cache.show
The issue here is that UDF works on a executors where spark session isn't available. So I guess you get NullPointer exception on a "val count = spark.sql..." line.
You should do it on a driver using not a UDF but just function1. Also using collect() I suppose that the main table isn't big and will fit into a driver memory:
Example:
import java.util.Date
import java.time.LocalDateTime
val set_time = (query: String) => {
val start = new Timestamp(new Date().getTime)
val count = spark.sql(s"${query}").count
val time_query = (new Timestamp(new Date().getTime)).getTime() - start.getTime()
time_query.toString
}
val result = spark.sql("select 'select 1' as A union all select 'select 2' as A")
val s = result.collect().map(x =>(x(0).asInstanceOf[String],set_time(x(0).asInstanceOf[String]))).toList.toDF("A","B")
s.show
s.cache().show
+--------+---+
| A| B|
+--------+---+
|select 1|171|
|select 2|135|
+--------+---+
PS: also val start = new Timestamp(new Date().getDate) in your example should be val start = new Timestamp(new Date().getTime)
Related
So i have a data set like
{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}
I have flattened the data in the DF like this
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a| att-b| att-c| att-d| att-e| att-f| att-g| att-h| att-i| att-j| customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|
I want to complete the comapreColumns function.
which compares the columns of the two dataframes(userDF and flattenedDF) and returns a new DF as sample output.
how to do that? Like, compare each row's and column in flattenedDF with userDF and count++ if they match? e.g att-a with att-a att-b with att-b.
def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
dataFrame.filter($"customer" === customerID).toDF()
}
def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
val userDF = dataFrame.transform(getCustomer(customerID))
userDF.printSchema()
userDF
}
Sample Output:
+--------------------+-----------+
| customer | similarity_score |
+--------------------+-----------+
|customer-1 | -1 | its the same as the reference customer so to ignore '-1'
|customer-12 | 2 |
|customer-3 | 2 |
|customer-44 | 5 |
|customer-5 | 1 |
|customer-6 | 10 |
Thanks
I'm trying to do something quite simple where I have 2 arrays that have been converted into a Data Frame, and I want to show all possible combinations. So for example my output at the moment looks something like this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| Second | P |
+-----------|-----------+
However what I'm actually looking for is this:
+-----------+-----------+
| A | B |
+-----------+-----------+
| First | T |
| First | P |
| Second | T |
| Second | P |
+-----------|-----------+
So far I've got some fairly straight forward code to map my arrays into columns but being quite new to using both Scala and Spark I'm not sure how I'd grab all those combinations. Here is what I have so far:
val firstColumnValues = Array("First", "Second")
val secondColumnValues = Array("T", "P")
val xs = Array(firstColumnValues, secondColumnValues).transpose
val mapped = sparkContext.parallelize(xs).map(ys => Row(ys(0), ys(1)))
val df = mapped.toDF("A", "B")
df.show
...
case class Row(first: String, second: String)
Thanks in advance for any help
In Spark 2.3
val firstColumnValues = sc.parallelize(Array("First", "Second")).toDF("A")
val secondColumnValues = sc.parallelize(Array("T", "P")).toDF("B")
val fullouter = firstColumnValues.crossJoin(secondColumnValues).show
I have a Dataframe :
| subcategory | subcategory_label | category |
| 00EEE | 00EEE FFF | Drink |
| 0000EEE | 00EEE FFF | Fruit |
| 0EEE | 000EEE FFF | Meat |
from which I need to remove leading 0's from the columns in Dataframe and need a result like this
| subcategory | subcategory_label | category |
| EEE | EEE FFF | Drink |
| EEE | EEE FFF | Fruit |
| EEE | EEE FFF | Meat |
So far, I am able to remove the leading 0's from one column using
df.withColumn("subcategory ", regexp_replace(df("subcategory "), "^0*", "")).show
How to remove the leading 0's from dataframe in one go?
With this as the provided dataframe :
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|0000FFFF |0000EE 000FF |ABC |
+-----------+-----------------+--------+
You can create a regexp_replace for all the columns. Something like :
val regex_all = df.columns.map( c => regexp_replace(col(c), "^0*", "" ).as(c) )
And then, use select since it takes a varargs of type Column :
df.select(regex_all :_* ).show(false)
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|FFFF |EE 000FF |ABC |
+-----------+-----------------+--------+
EDIT:
Defining a function to do return a regexp_replaced Sequence is straight forward :
/**
* #param origCols total cols in the DF, pass `df.columns`
* #param replacedCols `Seq` of columns for which expression is to be generated
* #return `Seq[org.apache.spark.sql.Column]` Spark SQL expression
*/
def createRegexReplaceZeroes(origCols : Seq[String], replacedCols: Seq[String] ) = {
origCols.map{ c =>
if(replacedCols.contains(c)) regexp_replace(col(c), "^0*", "" ).as(c)
else col(c)
}
}
This function will return an Array[org.apache.spark.sql.Column]
Now, store the columns you want to replace in an Array :
val removeZeroes = Array( "subcategory", "subcategory_label" )
And, then call the function with removeZeroes as argument. This will return the regexp_replace statements for the columns available in removeZeroes
df.select( createRegexReplaceZeroes(df.columns, removeZeroes) :_* )
You can use UDF for doing the same.
I feel it looks more elegant.
scala> val removeLeadingZerosUDF = udf({ x: String => x.replaceAll("^0*", "") })
removeLeadingZerosUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df = Seq( "000012340023", "000123400023", "001234000230", "012340002300", "123400002300" ).toDF("cols")
df: org.apache.spark.sql.DataFrame = [cols: string]
scala> df.show()
+------------+
| cols|
+------------+
|000012340023|
|000123400023|
|001234000230|
|012340002300|
|123400002300|
+------------+
scala> df.withColumn("newCols", removeLeadingZerosUDF($"cols")).show()
+------------+------------+
| cols| newCols|
+------------+------------+
|000012340023| 12340023|
|000123400023| 123400023|
|001234000230| 1234000230|
|012340002300| 12340002300|
|123400002300|123400002300|
+------------+------------+
After a series of validations over a DataFrame,
I obtain a List of String with certain values like this:
List[String]=(lvalue1, lvalue2, lvalue3, ...)
And I have a Dataframe with n values:
dfield 1 | dfield 2 | dfield 3
___________________________
dvalue1 | dvalue2 | dvalue3
dvalue1 | dvalue2 | dvalue3
I want to append the values of the List at the beggining of my Dataframe, in order to get a new DF with something like this:
dfield 1 | dfield 2 | dfield 3 | dfield4 | dfield5 | dfield6
__________________________________________________________
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
lvalue1 | lvalue2 | lvalue3 | dvalue1 | dvalue2 | dvalue3
I have found something using a UDF. Could be this correct for my purpose?
Regards.
TL;DR Use select or withColumn with lit function.
I'd use lit function with select operator (or withColumn).
lit(literal: Any): Column Creates a Column of literal value.
A solution could be as follows.
val values = List("lvalue1", "lvalue2", "lvalue3")
val dfields = values.indices.map(idx => s"dfield ${idx + 1}")
val dataset = Seq(
("dvalue1", "dvalue2", "dvalue3"),
("dvalue1", "dvalue2", "dvalue3")
).toDF("dfield 1", "dfield 2", "dfield 3")
val offsets = dataset.
columns.
indices.
map { idx => idx + colNames.size + 1 }
val offsetDF = offsets.zip(dataset.columns).
foldLeft(dataset) { case (df, (off, col)) => df.withColumnRenamed(col, s"dfield $off") }
val newcols = colNames.zip(dfields).
map { case (v, dfield) => lit(v) as dfield } :+ col("*")
scala> offsetDF.select(newcols: _*).show
+--------+--------+--------+--------+--------+--------+
|dfield 1|dfield 2|dfield 3|dfield 4|dfield 5|dfield 6|
+--------+--------+--------+--------+--------+--------+
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
| lvalue1| lvalue2| lvalue3| dvalue1| dvalue2| dvalue3|
+--------+--------+--------+--------+--------+--------+
In scala/spark code I have 1 Dataframe which contains some rows:
col1 col2
Abc someValue1
xyz someValue2
lmn someValue3
zmn someValue4
pqr someValue5
cda someValue6
And i have a variable of ArrayBuffer[String] which contains [xyz,pqr,abc];
I want to filter given dataframe based on given values in arraybuffer at col1.
In SQL it would be like:
select * from tableXyz where col1 in("xyz","pqr","abc");
Assuming you have your dataframe:
val df = sc.parallelize(Seq(("abc","someValue1"),
("xyz","someValue2"),
("lmn","someValue3"),
("zmn","someValue4"),
("pqr","someValue5"),
("cda","someValue6")))
.toDF("col1","col2")
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| lmn|someValue3|
| zmn|someValue4|
| pqr|someValue5|
| cda|someValue6|
+----+----------+
Then you can define an UDF to filter the dataframe based on array's values:
val array = ArrayBuffer[String]("xyz","pqr","abc")
val function: (String => Boolean) = (arg: String) => array.contains(arg)
val udfFiltering = udf(function)
val filtered = df.filter(udfFiltering(col("col1")))
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+
Alternately you can register your dataframe and sql-query it by SQLContext:
var elements = ""
array.foreach { el => elements += "\"" + el + "\"" + "," }
elements = elements.dropRight(1)
val query = "select * from tableXyz where col1 in(" + elements + ")"
df.registerTempTable("tableXyz")
val filtered = sqlContext.sql(query)
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+