I have a spark streaming job which reads from Kafka and does some comparisons with an existing table in Postgres before writing to Postrges again. This is what it looks like :
val message = KafkaUtils.createStream(...).map(_._2)
message.foreachRDD( rdd => {
if (!rdd.isEmpty){
val kafkaDF = sqlContext.read.json(rdd)
println("First")
kafkaDF.foreachPartition(
i =>{
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://...",
"dbtable" -> "table", "user" -> "user", "password" -> "pwd" )).load()
createConnection()
i.foreach(
row =>{
println("Second")
connection.sendToTable()
}
)
closeConnection()
}
)
This code is giving me NullPointerException at the line val jbdcDF = ...
What am I doing wrong? Also, my log "First" works, but "Second" doesn't show up anywhere in the logs. I tried the entire code with kafkaDF.collect().foreach(...) and it works perfectly, but has very poor performance. I am looking to replace it with foreachPartition.
Thanks
It is not clear if there are any issues inside createConnection, closeConnection or connection.sendToTable but fundamental problem is an attempt to nest actions / transformations. It is not supported in Spark and Spark Streaming is not different.
It means that nested DataFrame initialization (val jdbcDF = sqlContext.read.format ...) simply cannot work and should be removed. If you use it as a reference it should be created at the same level as kafkaDF and refferenced using standard transformations (unionAll, join, ...).
If for some reason it is not an acceptable solution you can create plain JDBC connection inside forEachPartition and operate on PostgreSQL table (I guess it is what you're already do inside sendToTable).
As #zero323 correctly pointed out, you can't broadcast your jdbc connection around and you cannot create nested RDDs either. Spark simply does not support using sparkContext or sqlContext for that matter within an existing closure, i.e. foreachPartition, hence the null pointer exception.
The only way to solve this efficiently is to create a JDBC connection within foreachPartition and execute SQL directly on it to do whatever you intended and then use that same connection to write back the records.
As to your second, edited, question:
Change:
kafkaDF.foreachPartition(..)
to
kafkaDF.repartition(numPartition).foreachPartition(..)
where numPartition is the desired number of partitions. This will increase the number of partitions. If you have multiple executors (and multiple tasks per executor), these will run in parallel.
Related
Scenario: Cassandra is hosted on a server a.b.c.d and spark runs on server say w.x.y.z.
Assume i want to transform the data from a table(say table)casssandra and rewrite the same to other table(say tableNew) in cassandra using Spark,The code that i write looks something like this
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "a.b.c.d")
.set("spark.cassandra.auth.username", "<UserName>")
.set("spark.cassandra.auth.password", "<Password>")
val spark = SparkSession.builder().master("yarn")
.config(conf)
.getOrCreate()
val dfFromCassandra = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<table>", "keyspace" -> "<Keyspace>")).load()
val filteredDF = dfFromCassandra.filter(filterCriteria).write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "<tableNew>", "keyspace" -> "<Keyspace>")).save
Here filterCriteria represents the transformation/filtering that I do. Iam not sure how Spark cassandra connector works in this case internally.
This is the Confusion that I have:
1: Does spark load the data from Cassandra source table to the memory and then filter the same and reload the same to the Target table Or
2: Does Spark cassandra connector convert the filter criteria to Where clause and loads only the relevant data to form RDD and writes the same back to target table in Cassandra Or
3:Does the entire operation happens as a cql operation where the query is converted to sqllike query and is executed in cassandra itself?(I am almost sure that this is not what happens)
It is either 1. or 2. depending on your filterCriteria. Naturally Spark itself can't do any CQL filtering but custom datasources can implement it using predicate pushdown. In case if Cassandra driver, it is implemented here and the answer depends if that covers the used filterCriteria.
I got question. How I can copy dataframe without unload it again to redshift ?
val companiesData = spark.read.format("com.databricks.spark.redshift")
.option("url","jdbc:redshift://xxxx:5439/cf?user="+user+"&password="+password)
.option("query","select * from cf_core.company")
//.option("dbtable",schema+"."+table)
.option("aws_iam_role","arn:aws:iam::xxxxxx:role/somerole")
.option("tempdir","s3a://xxxxx/Spark")
.load()
import class.companiesData
class test {
val secondDF = filteredDF(companiesData)
def filteredDF(df: Dataframe): Dataframe {
val result = df.select("companynumber")
result
}
}
In this case this will unload data twice. First select * from table and second it will unload by select only companynumber. How I can unload data once and operate on this many times ? This is serious problem for me. Thanks for help
By "unload", do you mean read the data? If so, why are you sure it's being read twice? In fact, you don't have any action in your code, so I'm not even sure if the data is being read at all. If you do try to access secondDF somewhere else in the code, spark should only read the column you select in your class 'test'. I'm not 100% sure of this because I've never used redshift to load data into spark before.
In general, if you want to reuse a dataframe, you should cache it using
companiesData.cache()
Then, whenever you call an action on the dataframe, it will be cached into memory.
I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.
I've done some experiments in the spark-shell with the elasticsearch-spark connector. Invoking spark:
] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
In the scala shell:
scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")
It works well, the result contains the good records as specified in myquery. The only thing is that I get all the fields, even if I specify a subset of these fields in the query. Example:
myquery = """{"query":..., "fields":["a","b"], "size":10}"""
returns all the fields, not only a and b (BTW, I noticed that size parameter is not taken in account neither : result contains more than 10 records). Maybe it's important to add that fields are nested, a and b are actually doc.a and doc.b.
Is it a bug in the connector or do I have the wrong syntax?
The spark elasticsearch connector uses fields thus you cannot apply projection.
If you wish to use fine-grained control over the mapping, you should be using DataFrame instead which are basically RDDs plus schema.
pushdown predicate should also be enabled to translate (push-down) Spark SQL into Elasticsearch Query DSL.
Now a semi-full example :
myQuery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myQuery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10) // instead of size
.select("a","b") // instead of fields
what about calling:
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery", Map[String, String] ("es.read.field.include"->"a,b"))
You want restrict fields returned from elasticsearch _search HTTP API? (I guess to improve download speed).
First of all, use a HTTP proxy to see what the elastic4hadoop plugin is doing (I use on MacOS Apache Zeppelin with Charles proxy). This will help you to understand how pushdown works.
There are several solutions to achieve this:
1. dataframe and pushdown
You specify fields, and the plugin will "forward" to ES (here the _source parameter):
POST ../events/_search?search_type=scan&scroll=5m&size=50&_source=client&preference=_shards%3A3%3B_local
(-) Not fully working for nested fields.
(+) Simple, straightaway, easy to read
2. RDD & query fields
With JavaEsSpark.esRDD, you can specify fields inside the JSON query, like you did. This only work with RDD (with DataFrame, the fields is not sent).
(-) no dataframe -> no Spark way
(+) more flexible, more control
In our application , most of our code is just apply filter , group by and aggregate operations on DataFrame and save the DF to Cassandra database.
Like the below code, we have several methods which do the same kind of operations[filter, group by, join, agg] on different number of fields and returns an DF and that will be saved to Cassandra tables.
Sample code is:
val filteredDF = df.filter(col("hour") <= LocalDataTime.now().getHour())
.groupBy("country")
.agg(sum(col("volume")) as "pmtVolume")
saveToCassandra(df)
def saveToCassandra(df: DataFrame) {
try {
df.write.format("org.apache.spark.sql.cassandra")
.options(Map("Table" -> "tableName", "keyspace" -> keyspace)
.mode("append").save()
}
catch {
case e: Throwable => log.error(e)
}
}
Since i am calling the action by saving the DF to Cassandra, i hope i need to handle the exception only on that line as per this thread.
If i get any exception, i can see the exception in the Spark detailed log by default.
Do i have to really surround the filter, group by code with Try or try , catch?
I don't see any example on Spark SQL DataFrame API examples with exception handling.
How do i use the Try on saveToCassandra method? it returns Unit
There is no point wrapping the lazy DAG in try catch.
You would need to wrap the lambda function in Try().
Unfortunately there AFAIK there is no way to do row level exception handling in DataFrames.
You can use RDD or DataSet as mentioned in answer to this post below
spache spark exception handling
You don't really need to surround the filter, group by code with Try or try , catch. Since, all of these operations are transformations, they don't get execute until an action is performed on them, like saveToCassandra in your case.
However, if an error occurs while filtering, grouping or aggregating the dataframe, the catch clause in saveToCassandra function will log it as action is being performed there.