for loop alternative in scala? (Improve performance) - scala

i'm new to scala ,my requirement is delete the particular column records from almost 100 tables,so that i read the data from csv (which is my source) ,selected that particular column and changed into List.
val csvDF = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("inferSchema", true).option("escape", "\"").option("multiline", "true").option("quotes", "").load(inputPath)
val badrecods = csvDF.select("corrput_id").collect().map(_ (0)).toList
then read the metadata from postgres schema, there will get the all the tables list ,here i write the two for loops which is working fine,but performance wat too bad,how can i imporve this
val query = "(select table_name from information_schema.tables where table_schema = '" + db + "' and table_name not in " + excludetables + ") temp "
val tablesdf = spark.read.jdbc(jdbcUrl, table = query, connectionProperties)
val tablelist = tablesdf.select($"corrput_id").collect().map(_(0)).toList
println(tablelist)
for (i <- tablelist) {
val s2 = dbconnection.createStatement()
for (j <- bad_records) {
s2.execute("delete from " + db + "." + i + " where corrput_id = '" + j + "' ")
}
s2.close()
Thanks in advance

If you're looking to improve your performance, in my opinion, I think you should consider more on optimizing your queries instead! executing a query per row in a table WILL affect your performance, something like
" where corrput_id IN " + bad_records.map(str => s" '$str' ").mkString("(", ",", ")")
would be better. The second point, why don't you just use spark APIs? I mean like using collect on a DF and then processing it in a single thread is kind of like awaiting a Future (I mean you are not using the actual power that you can), spark is made to do such things, and can do these efficiently I believe.

Related

Simple TableAPI SQL query doesn't work on Flink 1.10 and Blink

I want to define Kafka connector using TableAPI and run SQL over such described table (backed by Kafka). Unfortunately, it seems that Rowtime definition doesn't work as expected.
Here's a reproducible example:
object DefineSource extends App {
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.scala._
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val config = EnvironmentSettings.newInstance().inStreamingMode().useBlinkPlanner().build()
val tEnv = StreamTableEnvironment.create(env, config)
val rowtime = new Rowtime().watermarksPeriodicBounded(5000)
val schema = new Schema()
.field("k", "string")
.field("ts", "timestamp(3)").rowtime(rowtime)
tEnv.connect(new Kafka()
.topic("test")
.version("universal"))
.withSchema(schema)
.withFormat(new Csv())
.createTemporaryTable("InputTable")
val output = tEnv.sqlQuery(
"""SELECT k, COUNT(*)
| FROM InputTable
| GROUP BY k, TUMBLE(ts, INTERVAL '15' MINUTE)
|""".stripMargin
)
tEnv.toAppendStream[(String, Long)](output).print()
env.execute()
}
which yields
org.apache.flink.table.api.TableException: Window aggregate can only be defined over a time attribute column, but TIMESTAMP(3) encountered.
at org.apache.flink.table.planner.plan.rules.logical.StreamLogicalWindowAggregateRule.getInAggregateGroupExpression(StreamLogicalWindowAggregateRule.scala:51)
at org.apache.flink.table.planner.plan.rules.logical.LogicalWindowAggregateRuleBase.onMatch(LogicalWindowAggregateRuleBase.scala:79)
at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:319)
at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:560)
at org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:419)
at org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:256)
at org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:127)
at org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:215)
at org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:202)
at org.apache.flink.table.planner.plan.optimize.program.FlinkHepProgram.optimize(FlinkHepProgram.scala:69)
at org.apache.flink.table.planner.plan.optimize.program.FlinkHepRuleSetProgram.optimize(FlinkHepRuleSetProgram.scala:87)
at org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram.$anonfun$optimize$1(FlinkChainedProgram.scala:62)
at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:160)
at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:160)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:160)
at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:158)
at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:108)
at org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram.optimize(FlinkChainedProgram.scala:58)
at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizer.optimizeTree(StreamCommonSubGraphBasedOptimizer.scala:170)
at org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizer.doOptimize(StreamCommonSubGraphBasedOptimizer.scala:94)
at org.apache.flink.table.planner.plan.optimize.CommonSubGraphBasedOptimizer.optimize(CommonSubGraphBasedOptimizer.scala:77)
at org.apache.flink.table.planner.delegation.PlannerBase.optimize(PlannerBase.scala:248)
at org.apache.flink.table.planner.delegation.PlannerBase.translate(PlannerBase.scala:151)
at org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImpl.toDataStream(StreamTableEnvironmentImpl.scala:210)
at org.apache.flink.table.api.scala.internal.StreamTableEnvironmentImpl.toAppendStream(StreamTableEnvironmentImpl.scala:107)
I'm on Flink 1.10.0.
this is a bug and fixed 1.10.0+
https://issues.apache.org/jira/browse/FLINK-16160
Unfortunately it is a bug in the 1.10, which as #lijiayan said should be fixed in 1.11+
As a workaround in 1.10 you can use DDL instead:
tEnv.sqlUpdate(
"CREATE TABLE InputTable (\n" +
" k STRING,\n" +
" ts TIMESTAMP(3),\n" +
" WATERMARK FOR ts AS ts - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'connector.type' = 'kafka',\n" +
" 'connector.version' = 'universal',\n" +
" 'connector.topic' = 'test',\n" +
" 'connector.properties.zookeeper.connect' = 'localhost:2181',\n" +
" 'connector.properties.bootstrap.servers' = 'localhost:9092',\n" +
" 'format.type' = 'csv'\n" +
")"
);

Outer join two Datasets (not DataFrames) in Spark Structured Streaming

I have some code that joins two streaming DataFrames and outputs to console.
val dataFrame1 =
df1Input.withWatermark("timestamp", "40 seconds").as("A")
val dataFrame2 =
df2Input.withWatermark("timestamp", "40 seconds").as("B")
val finalDF: DataFrame = dataFrame1.join(dataFrame2,
expr(
"A.id = B.id" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour")
, joinType = "leftOuter")
finalDF.writeStream.format("console").start().awaitTermination()
What I now want is to refactor this part to use Datasets, so I can have some compile-time checking.
So what I tried was pretty straightforward:
val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
expr(
"A.id = B.id" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour")
, joinType = "leftOuter")
finalDS.writeStream.format("console").start().awaitTermination()
However, this gives the following error:
org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;
As you can see, the join code hasn't changed, so there is a watermark on both sides and a range condition. The only change was to use the Dataset API instead of DataFrame.
Also, it is fine when I use inner join:
val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
expr(
"A.id = B.id" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour")
)
finalDS.writeStream.format("console").start().awaitTermination()
Does anyone know how can this happen?
Well, when you using joinWith method instead of join you rely on different implementation and it seems like this implementation not support leftOuter join for streaming Datasets.
You can check outer joins with watermarking section of the official documentation. Method join not joinWith used. Note that result type will be DataFrame. That means that you most likely will have to map field manually
val finalDS = dataFrame1.as[A].join(dataFrame2.as[B],
expr(
"A.key = B.key" +
" AND " +
"B.timestamp >= A.timestamp " +
" AND " +
"B.timestamp <= A.timestamp + interval 1 hour"),
joinType = "leftOuter").select(/* useful fields */).as[C]
If you here for understnding why this exception
org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;
still aprears while you have introduced the watermark to the join and Spark 3 supports the streams join already, you probably have added watermarking AFTER the join, but Spark want you to add watermarking BEFORE the join on each stream!

List in the Case-When Statement in Spark SQL

I'm trying to convert a dataframe from long to wide as suggested at How to pivot DataFrame?
However, the SQL seems to misinterpret the Countries list as a variable from the table. The below are the messages I saw from the console and the sample data and codes from the above link. Anyone knows how to resolve the issues?
Messages from the scala console:
scala> val myDF1 = sqlc2.sql(query)
org.apache.spark.sql.AnalysisException: cannot resolve 'US' given input columns >id, tag, value;
id tag value
1 US 50
1 UK 100
1 Can 125
2 US 75
2 UK 150
2 Can 175
and I want:
id US UK Can
1 50 100 125
2 75 150 175
I can create a list with the value I want to pivot and then create a string containing the sql query I need.
val countries = List("US", "UK", "Can")
val numCountries = countries.length - 1
var query = "select *, "
for (i <- 0 to numCountries-1) {
query += "case when tag = " + countries(i) + " then value else 0 end as " + countries(i) + ", "
}
query += "case when tag = " + countries.last + " then value else 0 end as " + countries.last + " from myTable"
myDataFrame.registerTempTable("myTable")
val myDF1 = sqlContext.sql(query)
Country codes are literals and should be enclosed in quotes otherwise SQL parser will treat these as the names of the columns:
val caseClause = countries.map(
x => s"""CASE WHEN tag = '$x' THEN value ELSE 0 END as $x"""
).mkString(", ")
val aggClause = countries.map(x => s"""SUM($x) AS $x""").mkString(", ")
val query = s"""
SELECT id, $aggClause
FROM (SELECT id, $caseClause FROM myTable) tmp
GROUP BY id"""
sqlContext.sql(query)
Question is why even bother with building SQL strings from scratch?
def genCase(x: String) = {
when($"tag" <=> lit(x), $"value").otherwise(0).alias(x)
}
def genAgg(f: Column => Column)(x: String) = f(col(x)).alias(x)
df
.select($"id" :: countries.map(genCase): _*)
.groupBy($"id")
.agg($"id".alias("dummy"), countries.map(genAgg(sum)): _*)
.drop("dummy")

JPQL "DISTINCT" returns only one result

I am confused by DISTINCT in JPQL. I have two JPQL queries identical except for "DISTINCT" in one of them:
String getObjectsForFlow =
"SELECT " +
" se.componentID " +
"FROM " +
" StatisticsEvent se " +
"WHERE " +
" se.serverID IS NOT NULL " +
" AND se.flowID = :uuid " +
" AND se.componentID IS NOT NULL " +
"ORDER BY " +
" se.timeStamp desc ";
String getObjectsForFlowDistinct =
"SELECT DISTINCT " +
" se.componentID " +
"FROM " +
" StatisticsEvent se " +
"WHERE " +
" se.serverID IS NOT NULL " +
" AND se.flowID = :uuid " +
" AND se.componentID IS NOT NULL " +
"ORDER BY " +
" se.timeStamp desc ";
I run a little code to get the results from each query and dump them to stdout, and I get many rows with some duplicates for non-distinct, but for distinct I get only one row which is part of the non-distinct list.
NOT DISTINCT
::: 01e2e915-35c1-6cf0-9d0e-14109fdb7235
::: 01e2e915-35c1-6cf0-9d0e-14109fdb7235
::: 01e2e915-35d9-afe0-9d0e-14109fdb7235
::: 01e2e915-35d9-afe0-9d0e-14109fdb7235
::: 01e2e915-35bd-c370-9d0e-14109fdb7235
::: 01e2e915-35bd-c370-9d0e-14109fdb7235
::: 01e2e915-35aa-1460-9d0e-14109fdb7235
::: 01e2e915-35d1-2460-9d0e-14109fdb7235
::: 01e2e915-35e1-7810-9d0e-14109fdb7235
::: 01e2e915-35e1-7810-9d0e-14109fdb7235
::: 01e2e915-35d0-12f0-9d0e-14109fdb7235
::: 01e2e915-35b0-cb20-9d0e-14109fdb7235
::: 01e2e915-35a8-66b0-9d0e-14109fdb7235
::: 01e2e915-35a8-66b0-9d0e-14109fdb7235
::: 01e2e915-35e2-6270-9d0e-14109fdb7235
::: 01e2e915-357f-33d0-9d0e-14109fdb7235
DISTINCT
::: 01e2e915-35e2-6270-9d0e-14109fdb7235
Where are the other entries? I would expect a DISTINCT list containing eleven (I think) entries.
Double check equals() method on your StatisticsEvent entity class. Maybe those semantically different values returns same when equals() is called hence producing this behavior
The problem was the "ORDER BY se.timeStamp" clause. To fulfill the request, JPQL added the ORDER BY field to the SELECT DISTINCT clause.
This is like a border case in the interplay between JPQL and SQL. The JPQL syntax clearly applies the DISTINCT modifier only to se.componentID, but when translated into SQL the ORDER BY field gets inserted.
I am surprised that the ORDER BY field had to be selected at all. Some databases can return a data set ORDERed by a field not in the SELECTion. Oracle can do so. My underlying database is Derby -- could this be a limitation in Derby?
Oracle does not support SELECT DISTINCT with an order by unless the order by columns are in the SELECT. Not sure if any databases do. It will work in Oracle if the DISTINCT is not required (does not run because rows are unique), but if it needs to run you will get an error.
You will get, "ORA-01791: not a SELECTed expression"
If you are using EclipseLink this functionality is controlled by the DatabasPlatform method,
shouldSelectDistinctIncludeOrderBy()
You can extend your platform to return false if your database does not require this.
Still, I don't see how adding the TIMESTAMP will change the query results?
Both queries are incorrect JPQL queries, because ORDER BY clause refers to the item that is not on select list. JPA 2.0 specification contains example that matches to this case:
The following two queries are not legal because the orderby_item is
not reflected in the SELECT clause of the query.
SELECT p.product_name
FROM Order o JOIN o.lineItems l JOIN l.product p JOIN o.customer c
WHERE c.lastname = ‘Smith’ AND c.firstname = ‘John’
ORDER BY p.price
SELECT p.product_name
FROM Order o, IN(o.lineItems) l JOIN o.customer c
WHERE c.lastname = ‘Smith’ AND c.firstname = ‘John’
ORDER BY
o.quantity
Of course it would be nicer if if implementation could give clear error message instead of trying to guess what is expected result of incorrect query.

scala newbie having troubles with Option, what's the equivalent of the ternary operator

I've already read that the if statement in scala always returns an expression
So I'm trying to do the following (pseudo code)
sql = "select * from xx" + iif(order.isDefined, "order by " order.get, "")
I'm trying with
val sql: String = "select * from xx" + if (order.isDefined) {" order by " + order.get} else {""}
But I get this error:
illegal start of simple expression
order is an Option[String]
I just want to have an optional parameter to a method, and if that parameter (in this case order) is not passed then just skip it
what would be the most idiomatic way to achieve what I'm trying to do?
-- edit --
I guess I hurried up too much to ask
I found this way,
val orderBy = order.map( " order by " + _ ).getOrElse("")
Is this the right way to do it?
I thought map was meant for other purposes...
First of all you are not using Option[T] idiomatically, try this:
"select * from xx" + order.map(" order by " + _).getOrElse("")
or with different syntax:
"select * from xx" + (order map {" order by " + _} getOrElse "")
Which is roughly equivalent to:
"select * from xx" + order match {
case Some(o) => " order by " + o
case None => ""
}
Have a look at scala.Option Cheat Sheet. But if you really want to go the ugly way of ifs (missing parentheses around if):
"select * from xx" + (if(order.isDefined) {" order by " + order.get} else {""})
...or, if you really want to impress your friends:
order.foldLeft ("") ((_,b)=>"order by " + b)
(I would still recommend Tomasz's answer, but I think this one is not included in the scala.Option cheat sheet, so i thought I'd mention it)