Suppose I use partitionBy to save some data to disk, e.g. by date so my data looks like this:
/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...
When I read the data using Hive config and DataFrame, so
val df = sparkSession.sql(s"select * from $database.$tableName")
I can know that:
Filter queries on column d will push down
No shuffles will occur if I try to partition by d (e.g. GROUP BY d)
BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d. Similarly if we have multiple partitions (e.g. by month, week, then day).
Currently the best code we have is really ugly:
def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
val cols = sparkSession.
sql(s"desc $databaseTableName")
.select("col_name")
.collect
.map(_.getAs[String](0))
.dropWhile(r => !r.matches("# col_name"))
if (cols.isEmpty) {
Set()
} else {
cols.tail.toSet
}
}
Assuming you don't have = and / in your partitioned column values, you can do:
val df = spark.sql("show partitions database.test_table")
val partitionedCols: Set[String] = try {
df.map(_.getAs[String](0)).first.split('/').map(_.split("=")(0)).toSet
} catch {
case e: AnalysisException => Set.empty[String]
}
You should get an Array[String] with the partitioned column names.
you can use sql statements to get this info, either show create table <tablename>, describe extended <tablename> or show partitions <tablename>. The last one gives the simplest output to parse:
val partitionCols = spark.sql("show partitions <tablename>").as[String].first.split('/').map(_.split("=").head)
Use the metadata to get the partition column names in a comma-separated string.
First check if the table is partitioned, if true get the partition columns
val table = "default.country"
def isTablePartitioned(spark:org.apache.spark.sql.SparkSession, table:String) :Boolean = {
val col_details = spark.sql(s" describe extended ${table} ").select("col_name").select(collect_list(col("col_name"))).as[Array[String]].first
col_details.filter( x => x.contains("# Partition Information" )).length > 0
}
def getPartitionColumns(spark:org.apache.spark.sql.SparkSession, table:String): String = {
val pat = """(?ms)^\s*#( Partition Information)(.+)(Detailed Table Information)\s*$""".r
val col_details = spark.sql(s" describe extended ${table} ").select("col_name").select(collect_list(col("col_name"))).as[Array[String]].first
val col_details2 = col_details.filter( _.trim.length > 0 ).mkString("\n")
val arr = pat.findAllIn(col_details2).matchData.collect{ case pat(a,b,c) => b }.toList(0).split("\n").filterNot( x => x.contains("#") ).filter( _.length > 0 )
arr.mkString(",")
}
if( isTablePartitioned(spark,table) )
{
getPartitionColumns(spark,table)
}
else
{
"--NO_PARTITIONS--"
}
Note: The other 2 answers assume the table to have data which will fail, if the table is empty.
Here's a one liner. When no partitions are present the spark call throws an AnalysisException (SHOW PARTITIONS is not allowed on a table that is not partitioned). I'm handling that with the scala.util.Try, but his could be improved catching the correct type of exception.
def getPartitionColumns(table: String) = scala.util.Try(spark.sql(s"show partitions $table").columns.toSeq).getOrElse(Seq.empty)
Related
I am trying to use spark mapPartitions with Datasets[Spark 2.x] for copying large list of files [1 million records] from one location to another in parallel.
However, at times, I am seeing that one record is getting copied multiple times.
The idea is to split 1 million files into number of partitions (here, 24). Then for each partition, perform copy operation in parallel and finally get result from each partition to perform further actions.
Can someone please tell me what am I doing wrong?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 million records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
var dfBuffer = new ListBuffer[DataFrame]()
dfBuffer += SourceTargetDataSet
.mapPartitions(partition => {
println("partition id: " + TaskContext.getPartitionId)
//for each partition
val result = partition
.map(row => {
val source = row._1
val target = row._2
val copyStatus = copyFiles(source, target) // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.toList
result.toIterator
})
.toDF()
val dfList = dfBuffer.toList
val newDF = dfList.tail.foldLeft(dfList.head)(
(accDF, newDF) => accDF.join(newDF, Seq("_1"))
)
println("newDF Count " + newDF.count)
newDF
}
Update 2: I changed the function as shown below and so far it is giving me consistent results as expected. May I know what I was doing wrong and am I getting the required parallelization using below function? If not, how can this be optimized?
def process(spark: SparkSession): DataFrame = {
import spark.implicits._
//Get source and target List for 1 miilion records
val sourceAndTargetList =
List(("source1" -> "target1"), ("source 1 Million" -> "Target 1 Million"))
// convert list to dataframe with number of partitions as 24
val SourceTargetDataSet =
sourceAndTargetList.toDF.repartition(24).as[(String, String)]
val iterator = SourceTargetDataSet.toDF
.mapPartitions(
(it: Iterator[Row]) =>
it.toList
.map(row => {
println(row)
val source = row.toString.split(",")(0).drop(1)
val target = row.toString.split(",")(1).dropRight(1)
println("source : " + source)
println("target: " + target)
val copyStatus = copyFiles() // Function to copy files that returns a boolean
val dataframeRow = (target, copyStatus)
dataframeRow
})
.iterator
)
.toLocalIterator
val df = y.toList.toDF("targetKey", "copyStatus")
df
}
One should avoid performing write operations in map actions because they can be replayed when an executor dies and the same map has to be performed by another executer.
I'd choose foreach instead.
My Spark application is as follow :
1) execute large query with Spark SQL into the dataframe "dataDF"
2) foreach partition involved in "dataDF" :
2.1) get the associated "filtered" dataframe, in order to have only the partition associated data
2.2) do specific work with that "filtered" dataframe and write output
The code is as follow :
val dataSQL = spark.sql("SELECT ...")
val dataDF = dataSQL.repartition($"partition")
for {
row <- dataDF.dropDuplicates("partition").collect
} yield {
val partition_str : String = row.getAs[String](0)
val filtered = dataDF.filter($"partition" .equalTo( lit( partition_str ) ) )
// ... on each partition, do work depending on the partition, and write result on HDFS
// Example :
if( partition_str == "category_A" ){
// do group by, do pivot, do mean, ...
val x = filtered
.groupBy("column1","column2")
...
// write final DF
x.write.parquet("some/path")
} else if( partition_str == "category_B" ) {
// select specific field and apply calculation on it
val y = filtered.select(...)
// write final DF
x.write.parquet("some/path")
} else if ( ... ) {
// other kind of calculation
// write results
} else {
// other kind of calculation
// write results
}
}
Such algorithm works successfully. The Spark SQL query is fully distributed. However the particular work done on each resulting partition is done sequentially, and the result is inneficient especially because each write related to a partition is done sequentially.
In such case, what are the ways to replace the "for yield" by something in parallel/async ?
Thanks
You could use foreachPartition if writing to data stores outside Hadoop scope with specific logic needed for that particular env.
Else map, etc.
.par parallel collections (Scala) - but that is used with caution. For reading files and pre-processing them, otherwise possibly considered risky.
Threads.
You need to check what you are doing and if the operations can be referenced, usewd within a foreachPartition block, etc. You need to try as some aspects can only be written for the driver and then get distributed to the executors via SPARK to the workers. But you cannot write, for example, spark.sql for the worker as per below - at the end due to some formatting aspect errors I just got here in the block of text. See end of post.
Likewise df.write or df.read cannot be used in the below either. What you can do is write individual execute/mutate statements to, say, ORACLE, mySQL.
Hope this helps.
rdd.foreachPartition(iter => {
while(iter.hasNext) {
val item = iter.next()
// do something
spark.sql("INSERT INTO tableX VALUES(2,7, 'CORN', 100, item)")
// do some other stuff
})
or
RDD.foreachPartition (records => {
val JDBCDriver = "com.mysql.jdbc.Driver" ...
...
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(ConnectionURL, jdbcUsername, jdbcPassword)
...
val mutateStatement = connection.createStatement()
val queryStatement = connection.createStatement()
...
records.foreach (record => {
val val1 = record._1
val val2 = record._2
...
mutateStatement.execute (s"insert into sample (k,v) values(${val1}, ${nIterVal})")
})
}
)
I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/
After implement the code below:
def ngram(s: String, inSep: String, outSep: String, n:Int): Set[String] = {
s.toLowerCase.split(inSep).sliding(n).map(_.sorted.mkString(outSep)).toSet
}
val fPath = "/user/root/data/data220K.txt"
val resultPath = "data/result220K"
val lines = sc.textFile(fPath) // lines: Array[String]
val ngramNo = 2
val result = lines.flatMap(line => ngram(line, " ", "+", ngramNo)).map(word => (word, 1)).reduceByKey((a, b) => a+b)
val sortedResult = result.map(pair => pair.swap).sortByKey(true)
sortedResult.count + "============================")
sortedResult.take(10)
sortedResult.saveAsTextFile(resultPath)
I'm getting a big amount of files in HDFS with this schema:
(Freq_Occurrencies, FieldA, FieldB)
Is possible to join all the file from that directory? Every rows are diferent but I want to have only one file sorted by the Freq_Occurrencies. Is possible?
Many thanks!
sortedResult
.coalesce(1, shuffle = true)
.saveAsTextFile(resultPath)`
coalesce makes Spark use a single task for saving, thus creating only one part. The downside is, of course, performance - all data will have to be shuffled to a single executor and saved using a single thread.
I have data source in PostgreSQL with 1 Million rows and 100+ columns, and I want to use Spark SQL so I want to transform this data source to get SchemaRDD.
Two approaches are introduced in Spark SQL Programming Guide,
one is through reflection, which means I need to define:
case class Row(Var1: Int, Var2: String, ...)
This is tedious because I have 100+ columns.
Another approach is "Programmatically Specifying the Schema", which means I need to define:
val schema =
StructType(
Seq(StructField("Var1", IntegerType), StructField("Var2", StringType), ...))
This is also tedious for me.
Actually, there's still another problem because I load my PostgreSQL database using JdbcRDD class but I found I also need to define the schema in the mapRow parameter of JdbcRDD constructor, which looks like:
def extractValues(r: ResultSet) = {
(r.getInt("Var1"), r.getString("Var2"), ...)
}
val dbRDD = new JdbcRDD(sc, createConnection,
"SELECT * FROM PostgreSQL OFFSET ? LIMIT ?",
0, 1000000, 1, extractValues)
This API still asks me to create the schema by myself, what's worse is that I need to redo the similar thing to transform this JdbcRDD to SchemaRDD, that would be really clumsy code.
So I want to know what's the best approach for this task?
There are only a limited number of data types that you need to support. Why not use the
java.sql.ResultSetMetaData
e.g.
val rs = jdbcStatement.executeQuery("select * from myTable limit 1")
val rmeta = rs.getMetaData
to read one row and then dynamically generate the required StructField for each of the columns.
You would need a case statement to handle
val myStructFields = for (cx <- 0 until rmeta.getColumnCount) {
val jdbcType = rmeta.getColumnType(cx)
} yield StructField(rmeta.getColumnName(cx),jdbcToSparkType(jdbcType))
val mySchema = StructType(myStructFields.toSeq)
Where jdbcToSparkType is along the following lines:
def jdbcToSparkType(jdbcType: Int) = {
jdbcType match {
case 4 => InteegerType
case 6 => FloatType
..
}
UPDATE To generate the RDD[Row] : you would follow a similar pattern. In this case you would
val rows = for (rs.next) {
row = jdbcToSpark(rs)
} yield row
val rowRDD = sc.parallelize(rows)
where
def jdbcToSpark(rs: ResultSet) = {
var rowSeq = Seq[Any]()
for (cx <- 0 to rs.getMetaData.getColumnCount) {
rs.getColumnType(cx) match {
case 4 => rowSeq :+ rs.getInt(cx)
..
}
}
Row.fromSeq(rowSeq)
}
then
val rows