Spark : how to parallelize subsequent specific work on each dataframe partitions - scala

My Spark application is as follow :
1) execute large query with Spark SQL into the dataframe "dataDF"
2) foreach partition involved in "dataDF" :
2.1) get the associated "filtered" dataframe, in order to have only the partition associated data
2.2) do specific work with that "filtered" dataframe and write output
The code is as follow :
val dataSQL = spark.sql("SELECT ...")
val dataDF = dataSQL.repartition($"partition")
for {
row <- dataDF.dropDuplicates("partition").collect
} yield {
val partition_str : String = row.getAs[String](0)
val filtered = dataDF.filter($"partition" .equalTo( lit( partition_str ) ) )
// ... on each partition, do work depending on the partition, and write result on HDFS
// Example :
if( partition_str == "category_A" ){
// do group by, do pivot, do mean, ...
val x = filtered
.groupBy("column1","column2")
...
// write final DF
x.write.parquet("some/path")
} else if( partition_str == "category_B" ) {
// select specific field and apply calculation on it
val y = filtered.select(...)
// write final DF
x.write.parquet("some/path")
} else if ( ... ) {
// other kind of calculation
// write results
} else {
// other kind of calculation
// write results
}
}
Such algorithm works successfully. The Spark SQL query is fully distributed. However the particular work done on each resulting partition is done sequentially, and the result is inneficient especially because each write related to a partition is done sequentially.
In such case, what are the ways to replace the "for yield" by something in parallel/async ?
Thanks

You could use foreachPartition if writing to data stores outside Hadoop scope with specific logic needed for that particular env.
Else map, etc.
.par parallel collections (Scala) - but that is used with caution. For reading files and pre-processing them, otherwise possibly considered risky.
Threads.
You need to check what you are doing and if the operations can be referenced, usewd within a foreachPartition block, etc. You need to try as some aspects can only be written for the driver and then get distributed to the executors via SPARK to the workers. But you cannot write, for example, spark.sql for the worker as per below - at the end due to some formatting aspect errors I just got here in the block of text. See end of post.
Likewise df.write or df.read cannot be used in the below either. What you can do is write individual execute/mutate statements to, say, ORACLE, mySQL.
Hope this helps.
rdd.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
// do something
spark.sql("INSERT INTO tableX VALUES(2,7, 'CORN', 100, item)")
// do some other stuff
})
or
RDD.foreachPartition (records => {
val JDBCDriver = "com.mysql.jdbc.Driver" ...
...
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(ConnectionURL, jdbcUsername, jdbcPassword)
...
val mutateStatement = connection.createStatement()
val queryStatement = connection.createStatement()
...
records.foreach (record => {
val val1 = record._1
val val2 = record._2
...
mutateStatement.execute (s"insert into sample (k,v) values(${val1}, ${nIterVal})")
})
}
)

Related

Scala: get data from scylla using spark

scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.

How to determine partition key/column with Spark

Suppose I use partitionBy to save some data to disk, e.g. by date so my data looks like this:
/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...
When I read the data using Hive config and DataFrame, so
val df = sparkSession.sql(s"select * from $database.$tableName")
I can know that:
Filter queries on column d will push down
No shuffles will occur if I try to partition by d (e.g. GROUP BY d)
BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d. Similarly if we have multiple partitions (e.g. by month, week, then day).
Currently the best code we have is really ugly:
def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
val cols = sparkSession.
sql(s"desc $databaseTableName")
.select("col_name")
.collect
.map(_.getAs[String](0))
.dropWhile(r => !r.matches("# col_name"))
if (cols.isEmpty) {
Set()
} else {
cols.tail.toSet
}
}
Assuming you don't have = and / in your partitioned column values, you can do:
val df = spark.sql("show partitions database.test_table")
val partitionedCols: Set[String] = try {
df.map(_.getAs[String](0)).first.split('/').map(_.split("=")(0)).toSet
} catch {
case e: AnalysisException => Set.empty[String]
}
You should get an Array[String] with the partitioned column names.
you can use sql statements to get this info, either show create table <tablename>, describe extended <tablename> or show partitions <tablename>. The last one gives the simplest output to parse:
val partitionCols = spark.sql("show partitions <tablename>").as[String].first.split('/').map(_.split("=").head)
Use the metadata to get the partition column names in a comma-separated string.
First check if the table is partitioned, if true get the partition columns
val table = "default.country"
def isTablePartitioned(spark:org.apache.spark.sql.SparkSession, table:String) :Boolean = {
val col_details = spark.sql(s" describe extended ${table} ").select("col_name").select(collect_list(col("col_name"))).as[Array[String]].first
col_details.filter( x => x.contains("# Partition Information" )).length > 0
}
def getPartitionColumns(spark:org.apache.spark.sql.SparkSession, table:String): String = {
val pat = """(?ms)^\s*#( Partition Information)(.+)(Detailed Table Information)\s*$""".r
val col_details = spark.sql(s" describe extended ${table} ").select("col_name").select(collect_list(col("col_name"))).as[Array[String]].first
val col_details2 = col_details.filter( _.trim.length > 0 ).mkString("\n")
val arr = pat.findAllIn(col_details2).matchData.collect{ case pat(a,b,c) => b }.toList(0).split("\n").filterNot( x => x.contains("#") ).filter( _.length > 0 )
arr.mkString(",")
}
if( isTablePartitioned(spark,table) )
{
getPartitionColumns(spark,table)
}
else
{
"--NO_PARTITIONS--"
}
Note: The other 2 answers assume the table to have data which will fail, if the table is empty.
Here's a one liner. When no partitions are present the spark call throws an AnalysisException (SHOW PARTITIONS is not allowed on a table that is not partitioned). I'm handling that with the scala.util.Try, but his could be improved catching the correct type of exception.
def getPartitionColumns(table: String) = scala.util.Try(spark.sql(s"show partitions $table").columns.toSeq).getOrElse(Seq.empty)

How to associate some data to each partition in spark and re-use it?

I have a partitioned rdd and I want to extract some data from each partition so that I can re-use it later. An over-simplification could be:
val rdd = sc.parallelize(Seq("1-a", "2-b", "3-c"), 3)
val mappedRdd = rdd.mapPartitions{ dataIter =>
val bufferedIter = dataIter.buffered
//extract data which we want to re-use inside each partition
val reusableData = bufferedIter.head.charAt(0)
//use that data and return (but this does not allow me to re-use it)
bufferedIter.map(_ + reusableData)
}
My solution is to extract the re-usable data in a rdd:
val reusableDataRdd = rdd.mapPartitions { dataIter =>
//return an iterator with only one item on each partition
Iterator(dataIter.buffered.head.charAt(0))
}
and then zip the partitions
rdd.zipPartitions(reusableDataRdd){(dataIter, reusableDataIter) =>
val reusableData = reusableDataIter.next
dataIter.map(_ + reusableData)
}
I will get the same result as mappedRdd but I will also get my reusable data rdd.
Is there a better option to extract and re-use the data? Maybe more elegant or optimized?

EsHadoopException: Could not write all entries for bulk operation Spark Streaming

I want to traverse the stream of data, run a query on it and return the results which should be written into ElasticSearch. I tried to use mapPartitions method for creation of the connection to the database, however, I get such an error, which indicates that partition returns None to the rdd (I guess, some action should be added after the transformations):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [10/10]. Error sample (first [5] error messages)
What can be changed in the code to get the data into rdd and send it to ElasticSearch without any troubles?
Alos, I had a variant of the solution for this problem with flatMap in foreachRDD, however, I create a connection to the database on each rdd, which is not effective in terms of performance.
This is the code for streaming data processing:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { part => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
part.map(
data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
val recommendationsMap = convertDataToMap(recommendations, calendarTime)
recommendationsMap
})
}
}
}.saveToEs("rdd-timed/output")
)
The problem was that I tried to convert the iterator directly into the Array, although it holds multiple rows of my records. That is why ElasticSEarch was not able to map this collection of records to the defined single record schema.
Here is the code that works properly:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { partition => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
val result = partition.map( data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
convertDataToMap(recommendations, calendarTime)
}).toList.flatten
result.iterator
}
}.saveToEs("rdd-timed/output")
})

Load PostgreSQL database to SchemaRDD

I have data source in PostgreSQL with 1 Million rows and 100+ columns, and I want to use Spark SQL so I want to transform this data source to get SchemaRDD.
Two approaches are introduced in Spark SQL Programming Guide,
one is through reflection, which means I need to define:
case class Row(Var1: Int, Var2: String, ...)
This is tedious because I have 100+ columns.
Another approach is "Programmatically Specifying the Schema", which means I need to define:
val schema =
StructType(
Seq(StructField("Var1", IntegerType), StructField("Var2", StringType), ...))
This is also tedious for me.
Actually, there's still another problem because I load my PostgreSQL database using JdbcRDD class but I found I also need to define the schema in the mapRow parameter of JdbcRDD constructor, which looks like:
def extractValues(r: ResultSet) = {
(r.getInt("Var1"), r.getString("Var2"), ...)
}
val dbRDD = new JdbcRDD(sc, createConnection,
"SELECT * FROM PostgreSQL OFFSET ? LIMIT ?",
0, 1000000, 1, extractValues)
This API still asks me to create the schema by myself, what's worse is that I need to redo the similar thing to transform this JdbcRDD to SchemaRDD, that would be really clumsy code.
So I want to know what's the best approach for this task?
There are only a limited number of data types that you need to support. Why not use the
java.sql.ResultSetMetaData
e.g.
val rs = jdbcStatement.executeQuery("select * from myTable limit 1")
val rmeta = rs.getMetaData
to read one row and then dynamically generate the required StructField for each of the columns.
You would need a case statement to handle
val myStructFields = for (cx <- 0 until rmeta.getColumnCount) {
val jdbcType = rmeta.getColumnType(cx)
} yield StructField(rmeta.getColumnName(cx),jdbcToSparkType(jdbcType))
val mySchema = StructType(myStructFields.toSeq)
Where jdbcToSparkType is along the following lines:
def jdbcToSparkType(jdbcType: Int) = {
jdbcType match {
case 4 => InteegerType
case 6 => FloatType
..
}
UPDATE To generate the RDD[Row] : you would follow a similar pattern. In this case you would
val rows = for (rs.next) {
row = jdbcToSpark(rs)
} yield row
val rowRDD = sc.parallelize(rows)
where
def jdbcToSpark(rs: ResultSet) = {
var rowSeq = Seq[Any]()
for (cx <- 0 to rs.getMetaData.getColumnCount) {
rs.getColumnType(cx) match {
case 4 => rowSeq :+ rs.getInt(cx)
..
}
}
Row.fromSeq(rowSeq)
}
then
val rows