I'm trying to make writes through Spark.
I have 6 nodes in my cluster and in it I made keyspace in which I want to write data:
CREATE KEYSPACE traffic WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'} AND durable_writes = true;
When I'm trying to write from Spark, I'm getting this kind of error:
16/08/17 16:14:57 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBatchStatement#7409fd2d
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)
This is snippet of code what am I doing exactly:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types.{StructType, StructField, DateType, IntegerType};
object ff {
def main(string: Array[String]) {
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
.set("spark.cassandra.connection.host","ONE")
.setMaster("local[4]")
.setAppName("ff")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true")
.load("test.csv")
df.registerTempTable("ff_table")
//df.printSchema()
df.count
time {
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ff_table", "keyspace" -> "traffic"))
.save()
}
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
}
}
Also, if I run nodetool describecluster I got this results:
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
bf6c3ae7-5c8b-3e5d-9794-8e34bee9278f: [127.0.0.1, 127.0.0.2, 127.0.0.3, 127.0.0.4, 127.0.0.5, 127.0.0.6]
I tried to insert in CLI on row for replication_factor:2 and it's working, so every node can see each other.
Why Spark can't insert anything than, why nodes can't see each other while trying to insert data from Spark, anyone idea?
It looks like you are running 6 nodes on one machine via loopback. This means there is a rather likely chance that the resources of this machine are being over subscribed. The various Cassandra instances are most likely taking turns or swapping which is causing them to go missing when under heavy load. Increasing the replication factor increases the chance that a valid target is up but will increase load even further.
C* requires at it's core several different resources from your system, if any of these become a bottleneck any one these there is a chance that a node will not respond to gossip in sufficent time.
These resources are
RAM - How much memory the JVM is able to acquire, this is affected by OS swap as well. This means if you allocate a large JVM but the OS swaps it to disk, you are likely to see massive performance issues. With multiple nodes on the same machine you need to make sure there is ample ram for the JVM of every node you are startin. In addition if any one instance's JVM is getting to close to full you will enter GC and possibly a GC Storm which will basically lock up that instance. Many of these details will be clear in the system.log.
CPU - Without exclusive access to at least one cpu you are almost guaranteed to have some important threads in C* scheduled with a long delay between them. This can cause gossip threads to be ignored and gossip to fail. This will give some nodes a view of a cluster with failed machines and cause unavailable errors.
DISK - Every Cassandra instance will maintain it's own CommitLog and HD files. The commit log flushes every 10 seconds and if you have multiple instances and only 1 harddrive the flushes between the commitlog and normal memtables can easily block one another. This is further compounded by compaction which requires another large amount of IO.
NETWORK - Although this isn't an issue with multiple nodes on the same machine.
In summation,
It is important to make sure the resources allocated to your C* instances are small enough that no instance will overrun the space/ram/cpu of another. If you do so you will end up with a cluster whose communication fails when under load because one of the above resources is bottlenecked. This doesn't mean it's impossible to run multiple nodes on the same machine but does mean you must take care in provisioning. You can also attempt to lessen the load by throttling your write speed which will give the nodes less of a chance of clobbering one another.
Related
I am having several csv files compressed within a google bucket, their are grouped in folders by hour, meaning another application saves several of those files in folders having the hour in their name.
I am basically then having a Spark application reading all of those files - thousands of them - with a simple code like the one below:
sparkSession.read
.format("csv")
.option("sep", "\t")
.option("header", false)
.option("inferSchema", false)
.csv(path))
It takes more than an hour to read, is that because they are compressed?
I also noted in the Spark UI I only have one executor, never more than one. Can't I use several executors to read those files in parallel and do the processing faster? How to do that? I am bascailly trying to create a temp view with the files for further SQL statements from Spark.
I am running in Dataproc with the default Yarn configuration.
According to this article, there are ten things that you should take into consideration, if you want to improve your cluster performance.
Perhaps it will be a good idea to let Spark scale the number of executors automatically by setting the spark.dynamicAllocation.enabled parameter to true. Please note that this configuration also requires to enable parameter spark.shuffle.service.enabled, refer to documentation.
A second approach on executors is explained here, if you want to try this configuration, another Stackoverflow thread explains how to configure the yarn.scheduler.capacity.resource-calculator parameter in Dataproc.
EDIT:
I have recreated your scenario with reading many files from GCS bucket and I was able to see that more than one executor was used to proceed this operation.
How?
Using RDD.
Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed acress an Apache Spark cluster. Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor fails, another will still process the data.
There are two ways to create RDDs: parallelizing an existing collection, or referencing a dataset in an external storage system (GCS bucket). RDDs can be created using SparkContext’s textFile()/wholeTextFile() methods.
SparkContext.wholeTextFiles lets you read a directory containing multiple small files, and returns each of them as (filename, content) pairs. This is in contrast with SparkContext.textFile, which would return one record per line in each file.
I wrote code in Python and run a pySpark job in Dataproc:
import pyspark
sc = pyspark.SparkContext()
rdd_csv = sc.wholeTextFiles("gs://<BUCKET_NAME>/*.csv")
rdd_csv.collect()
I see, that you are using Scala language. Please, refer to the Spark documentation to get Scala snippets of code. I assume it will be similar to that:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object ReadManyFiles {
def main(args: Array[String]) {
if (args.length != 1) {
throw new IllegalArgumentException(
"1 argument is required: <inputPath> ")
}
val inputPath = args(0)
val sc = new SparkContext(new SparkConf().setAppName("<APP_NAME>"))
val rdd_csv = sc.wholeTextFiles(inputPath)
rdd_csv.collect()
}
}
where inputPath can be specified when running Dataflow job (or you can hardcoded it in your .scala file):
gcloud dataproc jobs submit spark \
--cluster=${CLUSTER} \
--class <CLASS> \
--jars gs://${BUCKET_NAME}/<PATH>.jar \
-- gs://${BUCKET_NAME}/input/
I hope it will help you. If you have more questions, please ask.
The resources should have been scaled to your app dynamically already, usually you dont need to explicitly set executor numbers.
In your case, depends on how big your dataset is, it could be cluster size or VMs is too small to handle the increased input data size, maybe try to increase number of VMs/nodes in your cluster, or use VMs with more RAM.
I want to run multiple spark SQL parallel in a spark cluster, so that I can utilize the complete resource cluster wide. I'm using sqlContext.sql(query).
I saw some sample code here like follows,
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit
As I understood, the ExecutionContext compute the available cores in the machine(using ForkJoinPool) and do the parallelism accordingly. But what happens if we consider the spark cluster other-than the single machine and How can it guarantee the complete cluster resource utilization.?
eg: If I have a 10 node cluster with each 4 cores, then how can the above code guarantees that the 40 cores will be utilized.
EDITS:-
Lets say there are 2 sql to be executed, we have 2 way to do this,
submit the queries sequentially, so that second query will be completed only after the execution of the first. (because sqlContext.sql(query) is a synchronous call)
Submit both the queries parallel using Futures, so that both the queries will executed independently and parallel in the cluster
assuming there are enough resources (in both cases).
I think the second one is better because it uses the maximum resources available in the cluster and if the first query fully utilized the resources the scheduler will wait for the completion of the job(depending upon the policy) which is fair in this case.
But as user9613318 mentioned 'increasing pool size will saturate the driver'
Then how can I efficiently control the threads for better resource utilization.
Parallelism will have a minimal impact here, and additional cluster resources don't really affect the approach. Futures (or Threads) are use not to parallelize execution, but to avoid blocking execution. Increasing pool size can only saturate the driver.
What you really should be looking at is Spark in-application scheduling pools and tuning of the number of partitions for narrow (How to change partition size in Spark SQL, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?) and wide (What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?) transformations.
If jobs are completely independent (the code structure suggests that) it could be preferred to submit each one separately, with its own set of allocated resources, and configure cluster scheduling pools accordingly.
I have a modest-sized xml file (200MB, bz2) that I am loading using spark-xml on an AWS emr cluster with 1 master and two core nodes, each with 8cpus and 32GB RAM.
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._
val sqlContext = new SQLContext(sc)
val experiment = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "EXPERIMENT")
.load("s3n://bucket/path/meta_experiment_set.xml.bz2")
This load takes quite a while and from what I can tell is done with only one partition. Is it possible to tell spark to partition the file on loading to better use the compute resources? I know I can partition after loading.
You can repartition to increase the parallelism:
experiment.repartition(200)
where 200 is whatever nbr of executor you want to use.
See repartition
I recently discovered that adding parallel computing (e.g. using parallel-collections) inside UDFs increases performance considerable even when running spark in local[1] mode or using Yarn with 1 executor and 1 core.
E.g. in local[1] mode, the Spark-Jobs consumes as much CPU as possible (i.e. 800% if I have 8 cores, measured using top).
This seems strange because I thought Spark (or yarn) limits the CPU usage per Spark application?
So I wonder why that is and whether it's recommended to use parallel-processing/mutli-threading in spark or should I stick to sparks parallelizing pattern?
Here an example to play with (times measured in yarn client-mode with 1 instance and 1 core)
case class MyRow(id:Int,data:Seq[Double])
// create dataFrame
val rows = 10
val points = 10000
import scala.util.Random.nextDouble
val data = {1 to rows}.map{i => MyRow(i, Stream.continually(nextDouble()).take(points))}
val df = sc.parallelize(data).toDF().repartition($"id").cache()
df.show() // trigger computation and caching
// some expensive dummy-computation for each array-element
val expensive = (d:Double) => (1 to 10000).foldLeft(0.0){case(a,b) => a*b}*d
val serialUDF = udf((in:Seq[Double]) => in.map{expensive}.sum)
val parallelUDF = udf((in:Seq[Double]) => in.par.map{expensive}.sum)
df.withColumn("sum",serialUDF($"data")).show() // takes ~ 10 seconds
df.withColumn("sum",parallelUDF($"data")).show() // takes ~ 2.5 seconds
Spark does not limit CPU directly, instead it defines the number of concurrent threads spark creates. So for local[1] it would basically run one task at a time in parallel. When you are doing in.par.map{expensive} you are creating threads which spark does not manage and therefore are not handled by this limit. i.e. you told spark to limit itself to a single thread and then created other threads without spark knowing it.
In general, it is not a good idea to do parallel threads inside of a spark operation. Instead, it would be better to tell spark how many threads it can work with and make sure you have enough partitions for parallelism.
Spark are configuration of CPU usage
examople
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
change the local[*] it will utilization all of your CPU cores.
I am trying to write a (wordcount) program to simulate a use case where the network traffic would be very high because of spark's shuffle process. I have a 3 node apache spark cluster (2 cores each, 8GB RAM each) configured with 1 master and 2 workers. I processed a 5GB file for wordcount and was able to see a network traffic between 2 worker nodes raise up to 1GB in 10-15 mins. I am looking for a way where i could increase the traffic between nodes raise up to atleast 1GB within 30s-60s. The inefficiency of the program or best practices doesn't matter in my current use case as I am just trying to simulate traffic.
This is the program i have written
val sc = new SparkContext(new SparkConf().setAppName("LetterCount-GroupBy-Map"))
val x = sc.textFile(args(0)).flatMap(t => t.split(" "))
val y = x.map(w => (w.charAt(0),w))
val z = y.groupByKey().mapValues(n => n.size)
z.collect().foreach(println)
More shuffled data can be generated by doing operations which do not combine data very well on each node. For eg: In the code you have written, the groupby will combine the common keys ( or do groupby locally). Instead choose a high cardinality of keys (in above example its 26 only). In addition, the size of values after the map operation can be increased. In your case, its the text line. You might want to put a very long string of values for each key.
Apart from this, if you take 2 different files/tables and apply join on some parameter, it will also cause shuffling.
Note: Am assuming the contents does not matter. You are only interested in generating highly shuffled data.