I need to do certain operation on a set of BQ tables but I want to do the operation if and only if I know for certain that all the BQ tables exist.
I have checked the google big query package and it has a sample to read the data from BQ tables - fine But what if my tables are really huge? I can't load all the tables for existence check as it would take too much time and seems redundant.
Is there another way to achieve this? I would be very glad if I could get some pointers in the right direction.
Thank you in advance.
Gaurav
spark.read.option(...).load does will not load all the objects into a dataframe.
spark.read.option(...) returns a DataFrameReader. when you call load on it , it will test the connection and issue a query like
SELECT * FROM (select * from objects) SPARK_GEN_SUBQ_11 WHERE 1=0
The query will not scan any records and will error out when the table does not exist. I am not sure about the BigQuery driver but jdbc drivers throw a java exception here, which you need to handle in a try {} catch {} block.
Thus you can just call load, catch exceptions and check wether all dataframes could be instantiated. Here is some example code
def query(q: String) = {
val reader = spark.read.format("bigquery").option("query", q)
try {
Some(reader.load())
} catch {
case e: Exception => None
}
}
val dfOpts = Seq(
query("select * from foo"),
query("select * from bar"),
query("select * from baz")
)
if(dfOpts.exists(_.isEmpty)){
println("Some table is missing");
}
You could use the method tables.get
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get
Otherwise, you can run BG CLI command in a bash script, which can be called from your spark program.
Related
I got question. How I can copy dataframe without unload it again to redshift ?
val companiesData = spark.read.format("com.databricks.spark.redshift")
.option("url","jdbc:redshift://xxxx:5439/cf?user="+user+"&password="+password)
.option("query","select * from cf_core.company")
//.option("dbtable",schema+"."+table)
.option("aws_iam_role","arn:aws:iam::xxxxxx:role/somerole")
.option("tempdir","s3a://xxxxx/Spark")
.load()
import class.companiesData
class test {
val secondDF = filteredDF(companiesData)
def filteredDF(df: Dataframe): Dataframe {
val result = df.select("companynumber")
result
}
}
In this case this will unload data twice. First select * from table and second it will unload by select only companynumber. How I can unload data once and operate on this many times ? This is serious problem for me. Thanks for help
By "unload", do you mean read the data? If so, why are you sure it's being read twice? In fact, you don't have any action in your code, so I'm not even sure if the data is being read at all. If you do try to access secondDF somewhere else in the code, spark should only read the column you select in your class 'test'. I'm not 100% sure of this because I've never used redshift to load data into spark before.
In general, if you want to reuse a dataframe, you should cache it using
companiesData.cache()
Then, whenever you call an action on the dataframe, it will be cached into memory.
In our application , most of our code is just apply filter , group by and aggregate operations on DataFrame and save the DF to Cassandra database.
Like the below code, we have several methods which do the same kind of operations[filter, group by, join, agg] on different number of fields and returns an DF and that will be saved to Cassandra tables.
Sample code is:
val filteredDF = df.filter(col("hour") <= LocalDataTime.now().getHour())
.groupBy("country")
.agg(sum(col("volume")) as "pmtVolume")
saveToCassandra(df)
def saveToCassandra(df: DataFrame) {
try {
df.write.format("org.apache.spark.sql.cassandra")
.options(Map("Table" -> "tableName", "keyspace" -> keyspace)
.mode("append").save()
}
catch {
case e: Throwable => log.error(e)
}
}
Since i am calling the action by saving the DF to Cassandra, i hope i need to handle the exception only on that line as per this thread.
If i get any exception, i can see the exception in the Spark detailed log by default.
Do i have to really surround the filter, group by code with Try or try , catch?
I don't see any example on Spark SQL DataFrame API examples with exception handling.
How do i use the Try on saveToCassandra method? it returns Unit
There is no point wrapping the lazy DAG in try catch.
You would need to wrap the lambda function in Try().
Unfortunately there AFAIK there is no way to do row level exception handling in DataFrames.
You can use RDD or DataSet as mentioned in answer to this post below
spache spark exception handling
You don't really need to surround the filter, group by code with Try or try , catch. Since, all of these operations are transformations, they don't get execute until an action is performed on them, like saveToCassandra in your case.
However, if an error occurs while filtering, grouping or aggregating the dataframe, the catch clause in saveToCassandra function will log it as action is being performed there.
I spent quite some time to code multiple SQL queries that were formerly used to fetch the data for various R scripts. This is how it worked
sqlContent = readSQLFile("file1.sql")
sqlContent = setSQLVariables(sqlContent, variables)
results = executeSQL(sqlContent)
The clue is, that for some queries a result from a prior query is required - why creating VIEWs in the database itself does not solve this problem. With Spark 2.0 I already figured out a way to do just that through
// create a dataframe using a jdbc connection to the database
val tableDf = spark.read.jdbc(...)
var tempTableName = "TEMP_TABLE" + java.util.UUID.randomUUID.toString.replace("-", "").toUpperCase
var sqlQuery = Source.fromURL(getClass.getResource("/sql/" + sqlFileName)).mkString
sqlQuery = setSQLVariables(sqlQuery, sqlVariables)
sqlQuery = sqlQuery.replace("OLD_TABLE_NAME",tempTableName)
tableDf.createOrReplaceTempView(tempTableName)
var data = spark.sql(sqlQuery)
But this is in my humble opinion very fiddly. Also, more complex queries, e.g. queries that incooporate subquery factoring currently don't work. Is there a more robust way like re-implementing the SQL code into Spark.SQL code using filter($""), .select($""), etc.
The overall goal is to get multiple org.apache.spark.sql.DataFrames, each representing the results of one former SQL query (which always a few JOINs, WITHs, etc.). So n queries leading to n DataFrames.
Is there a better option than the provided two?
Setup: Hadoop v.2.7.3, Spark 2.0.0, Intelli J IDEA 2016.2, Scala 2.11.8, Testcluster on Win7 Workstation
It's not especially clear what your requirement is, but I think you're saying you have queries something like:
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM (SELECT * FROM people LEFT OUTER JOIN places ON ...) WHERE age>20
and you would want to declare and execute this efficiently as
SELECT * FROM people LEFT OUTER JOIN places ON ...
SELECT * FROM <cachedresult> WHERE age>20
To achieve that I would enhance the input file so each sql statement has an associated table name into which the result will be stored.
e.g.
PEOPLEPLACES\tSELECT * FROM people LEFT OUTER JOIN places ON ...
ADULTS=SELECT * FROM PEOPLEPLACES WHERE age>18
Then execute in a loop like
parseSqlFile().foreach({case (name, query) => {
val data: DataFrame = execute(query)
data.createOrReplaceTempView(name)
}
Make sure you declare the queries in order so all required tables have been created. Other do a little more parsing and sort by dependencies.
In an RDMS I'd call these tables Materialised Views. i.e. a transform on other data, like a view, but with the result cached for later reuse.
I am doing my first steps with Spark and looking currently into ways to Import some data from a database via a JDBC driver.
My plan is that I prepare the access for many tables from the
DB for a possible later usage from another team with pure SparkSQL commands.
So they can focus on the data and have no contact with the code anymore.
My connection to the DB is working and I found so far two working ways to get some data.
Way 1:
sqlContext.read.jdbc(url,"tab3",myProp).registerTempTable("tab3")
Way 2:
case class RowClass_TEST (COL1:String, COL2:String)
val myRDD_TEST= new JdbcRDD(sc,() => DriverManager.getConnection(url,username,pw), "select * from TEST where ? < ?", 0,1,1,row => RowClass_TEST(row.getString("COL1"),row.getString("COL2")) myRDD_TEST.toDF().registerTempTable("TEST")
But both ways have some bad effects,
Way 1 is not so fast if you have to prepare a higher amount of tables that are not used later.
(I trace 5 jdbc commandos during the execution of the example ( create connection, login, settings, query for header, terminate connection) )
Way 2 works very fast, but the case class from Scala hast a heavy limitation.
You can only setup 22 values with this kind of class.
So is there an easy solution to setup way 2 without a case class?
I want to access some DB tables with more than 22 columns.
I tried already to get it working ,but my Scala know-how is not good enough yet.
You can write something like this:
sqlContext.load("jdbc",
Map(
"url" -> "jdbc:mysql://dbConnectionString",
"dbtable" ->
"(SELECT * FROM someTable WHERE someField > 10 ) AS a"
)
).registerTempTable("tmp_table")
I am currently incrementing a column (not a auto-increment PK column) in my database using the following:
def incrementLikeCount(thingId: Int)(implicit session: Session) = {
sqlu"update things set like_count = like_count + 1 where id = $thingId".first
}
Is this currently (slick 2.0.2) the best and fastest way to do this? (I'm using postgresql)
I was hoping for a more typesafe way of doing this e.g. if I rename my table or column I want compile time errors.
I don't want to read in the row and then update, because then I would have to wrap the call in a transaction during the read + write operation and that is not as efficient as I would want.
I would love if there was a way to do this using the normal slick api, and also be able to update/increment multiple counters at the same time in a single operation (but even one column increment/decrement at a time would be lovely)
Not on Slick, in the lovely ScalaQuery stone ages here, but you should be able to use what was called a MutatingUnitInvoker to modify DB row in place (i.e. perform a single query).
Something like:
val q = for{id <- Parameters[Int]; t <- Things if t.id is id} yield t
def incrementLikeCount(thingId: Int)(implicit session: Session) = {
q(thingId).mutate(r=> r.row.copy(like_count = r.row.like_count + 1))
}
Performance should be acceptible, prepared statement generated once at compile time and a single query against the database. Not sure how you can improve on that in a type safe manner with what Slick currently has on offer.