Spark sc.binaryFiles() partitioning small files and YARN - scala

Using the sc.binaryFiles() function in Spark 2.3.0 on a Hortonworks 2.6.5 server, I noticed its behavior which I cannot explain regarding the default partitioning in a YARN managed cluster. Please see the sample code below:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
object ReadTestYarn extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("yarn", "ReadTestYarn")
val inputRDD1 = sc.textFile("hdfs:/user/maria_dev/readtest/input/*")
val inputRDD2 = sc.binaryFiles("hdfs:/user/maria_dev/readtest/input/*")
println("Num of RDD1 partitions: " + inputRDD1.getNumPartitions)
println("Num of RDD2 partitions: " + inputRDD2.getNumPartitions)
}
[maria_dev#sandbox-hdp readtest]$ spark-submit --master yarn --deploy-mode client --class ReadTestYarn ReadTest.jar
Num of RDD1 partitions: 10
Num of RDD2 partitions: 1
The data I use is small, 10 csv files, each about 4-5MB in size, 43MB in total. In the case of RDD1, the number of the resulting partitions is understandable and the calculation method is well explained in the following post and article:
Spark RDD default number of partitions
https://medium.com/swlh/building-partitions-for-processing-data-files-in-apache-spark-2ca40209c9b7
But with RDD2, binaryFiles() function and master URL passed to Spark as "yarn", the number of partitions created is only 1 which I don't understand exactly.
#Mark Rajcok has given some explanation in the post below, but the link to commit changes there is not working. Could someone please provide detailed explanation about creating only one partition in this case?
PySpark: Partitioning while reading a binary file using binaryFiles() function

Related

How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?

Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. It should be possible to set "kafka.group.id" to track the offset. For our use case, I want to avoid the potential data loss if the streaming spark job failed and restart. Based on my previous questions, I have a feeling that kafka.group.id in Spark 3.0 is something that will help.
How to specify the group id of kafka consumer for spark structured streaming?
How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?
However, I tried the settings in spark 3.0 as below.
package com.example
/**
* #author ${user.name}
*/
import scala.math.random
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, BooleanType, LongType}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import org.apache.spark.SparkFiles
import java.util.Properties
import org.postgresql.Driver
import org.apache.spark.sql.streaming.Trigger
import java.time.Instant
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.SQLException
import java.sql.Statement
//import org.apache.spark.sql.hive.HiveContext
import scala.io.Source
import java.nio.charset.StandardCharsets
import com.amazonaws.services.kms.{AWSKMS, AWSKMSClientBuilder}
import com.amazonaws.services.kms.model.DecryptRequest
import java.nio.ByteBuffer
import com.google.common.io.BaseEncoding
object App {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName("MY-APP")
.getOrCreate()
import spark.sqlContext.implicits._
spark.catalog.clearCache()
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.setCheckpointDir("/home/ec2-user/environment/spark/spark-local/checkpoint")
System.gc()
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "mybroker.io:6667")
.option("subscribe", "mytopic")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.ssl.truststore.location", "/home/ec2-user/environment/spark/spark-local/creds/cacerts")
.option("kafka.ssl.truststore.password", "changeit")
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.sasl.kerberos.service.name", "kafka")
.option("kafka.sasl.mechanism", "GSSAPI")
.option("kafka.group.id","MYID")
.load()
df.printSchema()
val schema = new StructType()
.add("id", StringType)
.add("x", StringType)
.add("eventtime", StringType)
val idservice = df.selectExpr("CAST(value AS STRING)")
.select(from_json(col("value"), schema).as("data"))
.select("data.*")
val monitoring_df = idservice
.selectExpr("cast(id as string) id",
"cast(x as string) x",
"cast(eventtime as string) eventtime")
val monitoring_stream = monitoring_df.writeStream
.trigger(Trigger.ProcessingTime("120 seconds"))
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
if(!batchDF.isEmpty)
{
batchDF.persist()
printf("At %d, the %dth microbatch has %d records and %d partitions \n", Instant.now.getEpochSecond, batchId, batchDF.count(), batchDF.rdd.partitions.size)
batchDF.show()
batchDF.write.mode(SaveMode.Overwrite).option("path", "/home/ec2-user/environment/spark/spark-local/tmp").saveAsTable("mytable")
spark.catalog.refreshTable("mytable")
batchDF.unpersist()
spark.catalog.clearCache()
}
}
.start()
.awaitTermination()
}
}
The spark job is tested in the standalone mode by using below spark-submit command, but the same problem exists when I deploy in cluster mode in AWS EMR.
spark-submit --master local[1] --files /home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf,/home/ec2-user/environment/spark/spark-localreds/cacerts,/home/ec2-user/environment/spark/spark-local/creds/krb5.conf,/home/ec2-user/environment/spark/spark-local/creds/my.keytab --driver-java-options "-Djava.security.auth.login.config=/home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf -Djava.security.krb5.conf=/home/ec2-user/environment/spark/spark-local/creds/krb5.conf" --conf spark.dynamicAllocation.enabled=false --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf -Djava.security.krb5.conf=/home/ec2-user/environment/spark/spark-local/creds/krb5.conf" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf -Djava.security.krb5.conf=/home/ec2-user/environment/spark/spark-local/creds/krb5.conf" --conf spark.yarn.maxAppAttempts=1000 --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 --class com.example.App ./target/sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar
Then, I started the streaming job to read the streaming data from Kafka topic. After some time, I killed the spark job. Then, I wait for 1 hour to start the job again. If I understand correctly, the new streaming data should start from the offset when I killed the spark job. However, it still starts as the latest offset, which caused data loss during the time I stopped the job.
Do I need to configure more options to avoid data loss? Or do I have some misunderstanding for the Spark 3.0? Thanks!
Problem solved
The key issue here is that the checkpoint must be added to the query specifically. To just add checkpoint for SparkContext is not enough. After adding the checkpoint, it is working. In the checkpoint folder, it will create a offset subfolder, which contains offset file, 0, 1, 2, 3.... For each file, it will show the offset information for different partition.
{"8":109904920,"2":109905750,"5":109905789,"4":109905621,"7":109905330,"1":109905746,"9":109905750,"3":109905936,"6":109905531,"0":109905583}}
One suggestion is to put the checkpoint to some external storage, such as s3. It can help recover the offset even when you need to rebuild the EMR cluster itself in case.
According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files.
Even if you set the ConsumerGroup name with kafka.group.id, your application will still not commit the messages back to Kafka. The information on the next offset to read is only available in the checkpointing files of your Spark application.
If you stop and restart your application without a re-deployment and ensure that you do not delete old checkpoint files, your application will continue reading from where it left off.
In the Spark Structured Streaming documentation on Recovering from Failures with Checkpointing it is written that:
"In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) [...]"
This can be achieved by setting the following option in your writeStream query (it is not sufficient to set the checkpoint directory in your SparkContext configurations):
.option("checkpointLocation", "path/to/HDFS/dir")
In the docs it is also noted that "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."
In addition, the fault tolerance capabilities of Spark Structured Streaming also depends on your output sink as described in section Output Sinks.
As you are currently using the ForeachBatch Sink, you might not have restart capabilities in your application.

Spark Cassandra integration not using C* optimizations

I am running code from intellij IDE .My spark cassandra cluster has 3 nodes . Cassandra nodes and spark workers are on same machines
val sparkConf = new SparkConf()
.set(s"spark.sql.catalog.mycatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
.set("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.set("spark.sql.catalog.casscatalog", "com.datastax.spark.connector.datasource.CassandraCatalog");
val sc = SparkSession.builder()
.config(sparkConf)
.master("spark://master")
.withExtensions(new CassandraSparkExtensions)
.getOrCreate();
val table = sc.sql("select * from table where primarykeyA = 1");
table.show(10)
Now above query when I run normally will be in millisec as I have mentioned partition key
Expectation This query should only hit 1 of worker nodes which has the parition data and should return
Somehow when this runs it ends up going to all worker nodes which indicates Datastax optimization are not in place
Is there a way I can submit below parameter --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta via code

Execute job on distributed cassandra DSE spark cluster

I have three node Cassandra DSE cluster and db schema with RF=3. Now I'm creating a scala application to be executed on DSE spark. Scala code is as follow :-
package com.spark
import com.datastax.spark.connector._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
object sample {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local")
.setAppName("testing")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "1g")
.set("spark.driver.maxResultSize", "500M")
.set("spark.executor.heartbeatInterval", "30s")
.set("spark.submit.deployMode", "cluster")
val sc = new SparkContext(conf)
val lRDD = sc.cassandraTable("dbname", "tablename")
lRDD.collect.foreach(println)
}}
I'm running script using
dse> bin/dse spark-submit --class com.spark.sample --total-executor-cores 4 /home/db-svr/sample.jar
So, now I want to execute my spark application from 1 node but system should do processing on 3 nodes internally and I want to monitor the same so that I can utilize RAM and processor collectively of 3 nodes. How can I do that ?
Also, this current script is taking lot of time to bring result (table size 1 million rows with 128 byte each). Is there any performance tuning parameters that I'm missing?
There a few things you probably want to change. The main thing stopping you from running on multiple machines is
.setMaster("local")
Which instructs the application that it shouldn't use a distributed Resource Manager and instead should run everything locally in the application process. With DSE you should follow the relevant documentation or start with the Spark Build Examples.
In addition you most likely never want to set
.set("spark.driver.allowMultipleContexts", "true")
having multiple Spark Contexts in one JVM is frought with problems and usually means things are not set up correctly.

Unable to create dataframe using SQLContext object in spark2.2

I am using spark 2.2 version on Microsoft Windows 7. I want to load csv file in one variable to perform SQL related actions later on but unable to do so. I referred accepted answer from this link but of no use. I followed below steps for creating SparkContext object and SQLContext object:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc=SparkContext.getOrCreate() // Creating spark context object
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Creating SQL object for query related tasks
Objects are created successfully but when I execute below code it throws an error which can't be posted here.
val df = sqlContext.read.format("csv").option("header", "true").load("D://ResourceData.csv")
And when I try something like df.show(2) it says that df was not found. I tried databricks solution for loading CSV from the attached link. It downloads the packages but doesn't load csv file. So how can I rectify my problem?? Thanks in advance :)
I solved my problem for loading local file in dataframe using 1.6 version in cloudera VM with the help of below code:
1) sudo spark-shell --jars /usr/lib/spark/lib/spark-csv_2.10-1.5.0.jar,/usr/lib/spark/lib/commons-csv-1.5.jar,/usr/lib/spark/lib/univocity-parsers-1.5.1.jar
2) val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("treatEmptyValuesAsNulls", "true" ).option("parserLib", "univocity").load("file:///home/cloudera/Desktop/ResourceData.csv")
NOTE: sc and sqlContext variables are automatically created
But there are many improvements in the latest version i.e 2.2.1 which I am unable to use because metastore_db doesn't gets created in windows 7. I ll post a new question regarding the same.
In reference with your comment that you are able to access SparkSession variable, then follow below steps to process your csv file using SparkSQL.
Spark SQL is a Spark module for structured data processing.
There are mainly two abstractions - Dataset and Dataframe :
A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
You have a csv file and you can simply create a dataframe by doing one of the following:
From your spark-shell using the SparkSession variable spark:
val df = spark.read
.format("csv")
.option("header", "true")
.load("sample.csv")
After reading the file into dataframe, you can register it into a temporary view.
df.createOrReplaceTempView("foo")
SQL statements can be run by using the sql methods provided by Spark
val fooDF = spark.sql("SELECT name, age FROM foo WHERE age BETWEEN 13 AND 19")
You can also query that file directly with SQL:
val df = spark.sql("SELECT * FROM csv.'file:///path to the file/'")
Make sure that you run spark in local mode when you load data from local, or else you will get error. The error occurs when you have already set HADOOP_CONF_DIR environment variable,and which expects "hdfs://..." otherwise "file://".
Set your spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse).
.config("spark.sql.warehouse.dir", "file:///C:/path/to/my/")
It is the default location of Hive warehouse directory (using Derby)
with managed databases and tables. Once you set the warehouse directory, Spark will be able to locate your files, and you can load csv.
Reference : Spark SQL Programming Guide
Spark version 2.2.0 has built-in support for csv.
In your spark-shell run the following code
val df= spark.read
.option("header","true")
.csv("D:/abc.csv")
df: org.apache.spark.sql.DataFrame = [Team_Id: string, Team_Name: string ... 1 more field]

Simple spark job fail due to GC overhead limit

I've created a standalone spark (2.1.1) cluster on my local machines
with 9 cores / 80G each machine (total of 27 cores / 240G Ram)
I've got a sample spark job that sum all the numbers from 1 to x
this is the code :
package com.example
import org.apache.spark.sql.SparkSession
object ExampleMain {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("spark://192.168.1.2:7077")
.config("spark.driver.maxResultSize" ,"3g")
.appName("ExampleApp")
.getOrCreate()
val sc = spark.SparkContext
val rdd = sc.parallelize(Lisst.range(1, 1000))
val sum = rdd.reduce((a,b) => a+b)
println(sum)
done
}
def done = {
println("\n\n")
println("-------- DONE --------")
}
}
When running the above code I get results after a few seconds
so I've crancked up the code to sum all the numbers from 1 to 1B (1,000,000,000) and than I get GC overhead limit reached
I read that spark should spill memory to the HDD if there isn't enough memory, I've tried to play with my cluster configuration but that didn't helped.
Driver memory = 6G
Number of workers = 24
Cores per worker = 1
Memory per worker = 10
I'm not a developer, and have no knowledge in Scala but would like to find a solution to run this code without GC issues.
Per #philantrovert request I'm adding my spark-submit command
/opt/spark-2.1.1/bin/spark-submit \
--class "com.example.ExampleMain" \
--master spark://192.168.1.2:6066 \
--deploy-mode cluster \
/mnt/spark-share/example_2.11-1.0.jar
In addition my spark/conf are as following:
slaves file contain the 3 IP addresses of my nodes (including the master)
spark-defaults contain:
spark.master spark://192.168.1.2:7077
spark.driver.memory 10g
spark-env.sh contain:
SPARK_LOCAL_DIRS= shared folder among all nodes
SPARK_EXECUTOR_MEMORY=10G
SPARK_DRIVER_MEMORY=10G
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=10G
SPARK_WORKER_INSTANCES=8
SPARK_WORKER_DIR= shared folder among all nodes
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Thanks
I suppose the problem is that you create a List with 1 Billion entries on the driver, which is a huge datastructure (4GB). There is a more efficient way the programmatically create an Dataset/RDD:
val rdd = spark.range(1000000000L).rdd