I have an RDD[String] of a whole lot of strings that look like "INSERT INTO hive_metastore.default.redirects VALUES (123,56),(589,32)(267,11)". I'd like to be able to run all of these commands to get the data into my actual table, instead of just having a bunch of strings with instructions to get them into the table. For context, I'm doing this on databricks, and I don't know enough to have set up any odd settings there. (I hope.)
At first I tried just doing insertIntoLines.foreach{ x => spark.sql(x) }, but that does not seem to work. That is, code like this:
val test = sc.parallelize(Array("INSERT INTO hive_metastore.default.svwiki_redirect VALUES (18,0,'Användbarhet','','')","INSERT INTO hive_metastore.default.svwiki_redirect VALUES (25,0,'Apokryferna','','')")).toDS()
test.foreach{x => spark.sql(x)}
gives an error like this:
error: overloaded method value foreach with alternatives: (func: org.apache.spark.api.java.function.ForeachFunction[SqlCommand])Unit <and> (f: SqlCommand => Unit)Unit cannot be applied to (SqlCommand => org.apache.spark.sql.DataFrame)
It does, however, work if I insert a collect to get insertIntoLines.collect().foreach{ x => spark.sql(x) } - and that is fine for my toy data, but for the actual data, I really don't want to have to fit it all into memory on the driver.
Surely there is a nice and principled way of doing this, that doesn't either bottleneck hard at the driver or involve digging into the SQL commands with bespoke regexes?
Can you try using Dataset or Dataframe instead of RDDs. ? Thereby you could avoid calling collect().
import org.apache.spark.sql._
case class SqlCommand(query:String)
val querySet = Seq(SqlCommand("query1"), SqlCommand("query2")).toDS()
querySet.foreach(query => spark.sql(query.query))
NOTE: spark.sql command will not work with foreach due to the mismatch with return types that foreach expects which is a Unit and spark.sql returning a DataFrame/Dataset. So this solution will not work as expected.
Related
I have a spark-streaming application where I want to do some data transformations before my main operation, but the transformation involves some data validation.
When the validation fails, I want to log the failure cases, and then proceed on with the rest.
Currently, I have code like this:
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validationResults = values.map(validate)
validationResults.foreachRDD { rdd =>
rdd.foreach {
case Left(error) => logger.error(error)
case _ =>
}
}
val validatedValues: DStream[MyCaseClass] =
validationResults.mapPartitions { partition =>
partition.collect { case Right(record) => record }
}
This currently works, but it feels like I'm doing something wrong.
Questions
As far as I understand, this will perform the validation twice, which is potentially wasteful.
Is it correct to use values.map(validation).persist() to solve that problem?
Even if I persist the values, it still iterates and pattern matches on all the elements twice. It feels like there should be some method I can use to solve this. On a regular scala collection, I might use some of the cats TraverseFilter api, or with fs2.Stream an evalMapFilter. What DStream api can I use for that? Maybe something with mapPartitions?
I would say that the best way to tackle this is to take advantage that the stdlib flatMap accepts Option
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validatedValues: DStream[MyCaseClass] =
values.map(validate).flatMap {
case Left(error) =>
logger.error(error)
None
case Right(record) =>
Some(record)
}
You can also be a little bit more verbose using mapPartitions which should be a little bit more efficient.
The 'best' option here depends a bit on the rest of your spark job and your version of spark.
Ideally you'd pick a mechanism natively understood by catalyst. The spark3 dataset observe listener may be what you're looking for there. (I haven't seen many examples of using this in the wild but it seems like this was the motivation behind such a thing.)
In pure spark sql, one option is to add a new column with the results of validation, e.g. a column named invalid_reason which is NULL if the record is valid or some [enumerated] string containing the reason the column failed validation. At this point, you likely want to persist/cache the dataset before doing a groupBy/count/collect/log operation, then filter where invalid_reason is null on the persisted dataframe and continue on the rest of the processing.
tl;dr: consider adding a validation column rather than just applying a 'validate' function. You then 'fork' processing here: log the records which have the invalid column specified, process the rest of your job on the records which don't. It does add some volume to your dataframe, but doesn't require processing the same records twice.
I need to be able to register a udf from a string which I will get from a web service, i.e at run time I call a web service to get the scala code which constitutes the udf, compile it and register it as an udf in the spark context. As as example let's say my web service return the following scala code in a json response -
(row: Row, field:String) => {
import scala.util.{Try, Success, Failure}
val index: Int = Try(row.fieldIndex(field)) match {
case Success(_) => 1
case Failure(_) => 0
}
index
})
I want to compile this code on the fly and then register it as an udf. I have already multiple options such as using toolbox, twitter eval util etc. but found that I need to explicity specify the arguments types of the method while creating an instance for ex -
val code =
q"""
(a:String, b:String) => {
a+b
}
"""
val compiledCode = toolBox.compile(code)
val compiledFunc = compiledCode().asInstanceOf[(String, String) => Option[Any]]
This udf takes two strings as arguments hence I need to specify the types while creating the object like
compiledCode().asInstanceOf[(String, String) => Option[Any]]
The other option I explored is
https://stackoverflow.com/a/34371343/1218856
In both the cases I have to know the no of arguments, argument types and the return type before hand to instantiate the code as a method. But in my case as the udfs are created my users, I have no control over the no of arguments and thier types, so I would like to know if there any way I can register the UDF by compiling the scala code with out knowing the argument number and type information.
In a nut shell, I get the code as string, compile it and register it as udf without knowing the type information.
I think you'd be much better off by not trying to generate/execute code directly but defining a different kind of expression language and executing that. Something like ANTLR could help you with writing the grammar of that expression language and generating the parser and the Abstract Syntax Trees. Or even scala's parser combinators. It's of course more work but also a far less risky and error-prone way of allowing custom function execution.
I tried to use the Slick(3.0.2) to operate database in my project with scala.
Here is my part of code
val query = table.filter(_.username === "LeoAshin").map { user => (user.username, user.password, user.role, user.delFlg) }
val f = db.run(query.result)
How can I read the data from "f"
I have tried google the solution many times, but no answer can solved my confuse
Thanks a lot
f is a Future and there are several things you can do to get at the value, depending on just how urgently you need it. If there's nothing else you can do while you're waiting then you will just have to wait for it to complete and then get the value. Perhaps the easiest is along the following lines (from the Slick documentation):
val q = for (c <- coffees) yield c.name
val a = q.result
val f: Future[Seq[String]] = db.run(a)
f.onSuccess { case s => println(s"Result: $s") }
You could then go on to do other things that don't depend on the result of f and the result of the query will be printed on the console asynchronously.
However, most of the time you'll want to use the value for some other operation, perhaps another database query. In that case, the easiest thing to do is to use a for comprehension. Something like the following:
for (r <- f) yield db.run(q(r))
where q is a function/method that will take the result of your first query and build another one. The result of this for comprehension will also be a Future, of course.
One thing to be aware of is that if you are running this in a program that will do an exit after all the code has been run, you will need to use Await (see the Scala API) to prevent the program exiting while one of your db queries is still working.
Type of query.result is DBIO. When you call db.run, it turns into Future.
If you want to print the data, use
import scala.concurrent.ExecutionContext.Implicits.global
f.foreach(println)
To continue working with data, use f.map { case (username, password, role, delFlg) ⇒ ... }
If you want to block on the Future and get the result (if you're playing around in REPL for example), use something like
import scala.concurrent.Await
import scala.concurrent.duration._
Await.result(f, 1.second)
Bear in mind, this is not what you want to do in production code — blocking on Futures is a bad practice.
I generally recommend learning about Scala core types and Futures specifically. Slick "responsibility" ends when you call db.run.
I have encoded a list of values to a single database column by joining them with a delimiter. This works fine, except when the list is empty. In that case the database column is filled with an empty string, and when mapping back this gives me a Seq("") instead of Seq.empty.
implicit val SeqUriColumnType = MappedColumnType.base[Seq[Uri], String](
p => p.map(_.toString).mkString(","),
s => if (s.isEmpty) Seq.empty else s.split(",").map(Uri(_)).toSeq
)
I've worked around this by using an if statement but that feels odd. I've tried using MappedColumnType.base[Seq[Uri], Option[String]], but that didn't compile. I think it requires me to also use an option for the Seq, and that's not what I'm looking for.
In essence I want an empty Seq to result in a null value in the db, and to return an empty Seq again when retrieving. How do I do this properly?
Oh, if they handle options, you can remove orNull from the end :). Also, note, that here you are not really converting a collection to option. You are converting a String to option. Does it make it better? :)
I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();