I have some huge files (of 19GB, 40GB etc.). I need to execute following algorithm on these files:
Read the file
Sort it on the basis of 1 column
Take 1st 70% of the data:
a) Take all the distinct records of the subset of the columns
b) write it to train file
Take the last 30% of the data:
a) Take all the distinct records of the subset of the columns
b) write it to test file
I tried running following code in spark (using Scala).
import scala.collection.mutable.ListBuffer
import java.io.FileWriter
import org.apache.spark.sql.functions.year
val offers = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", ",")
.load("input.txt")
val csvOuterJoin = offers.orderBy("utcDate")
val trainDF = csvOuterJoin.limit((csvOuterJoin.count*.7).toInt)
val maxTimeTrain = trainDF.agg(max("utcDate"))
val maxtimeStamp = maxTimeTrain.collect()(0).getTimestamp(0)
val testDF = csvOuterJoin.filter(csvOuterJoin("utcDate") > maxtimeStamp)
val inputTrain = trainDF.select("offerIdClicks","userIdClicks","userIdOffers","offerIdOffers").distinct
val inputTest = testDF.select("offerIdClicks","userIdClicks","userIdOffers","offerIdOffers").distinct
inputTrain.rdd.coalesce(1,false).saveAsTextFile("train.csv")
inputTest.rdd.coalesce(1,false).saveAsTextFile("test.csv")
This is how I initiate spark-shell:
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --total-executor-cores 70 --executor-memory 10G --driver-memory 20G
I execute this code on a distributed cluster with 1 master and many slaves each having sufficient amount of RAM. As of now, this code ends up taking a lot of memory and I get java heap space issues.
Is there a way to optimize the above code (preferably in spark)? I appreciate any kind of minimal help in optimizing the above code.
The problem is you don't distribute at all. And the source is here:
val csvOuterJoin = offers.orderBy("utcDate")
val trainDF = csvOuterJoin.limit((csvOuterJoin.count*.7).toInt)
limit operation is not designed for large scale operations and it moves all records to a single partition:
val df = spark.range(0, 10000, 1, 1000)
df.rdd.partitions.size
Int = 1000
// Take all records by limit
df.orderBy($"id").limit(10000).rdd.partitions.size
Int = 1
You can use RDD API:
val ordered = df.orderBy($"utcDate")
val cnt = df.count * 0.7
val train = spark.createDataFrame(ordered.rdd.zipWithIndex.filter {
case (_, i) => i <= cnt
}.map(_._1), ordered.schema)
val test = spark.createDataFrame(ordered.rdd.zipWithIndex.filter {
case (_, i) => i > cnt
}.map(_._1), ordered.schema)
coalesce(1,false) means merging all data into one partition, aka keeping 40GB data in memory of one node.
Never try to get all data in one file by coalesce(1,false).
Instead, you should just call saveAsTextFile(so the output looks like part-00001, part00002, etc.) and then merge these partition files outside.
The merge operation depends on your file system. In case of HDFS, you can use http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge
Related
This question already has answers here:
How to read gz files in Spark using wholeTextFiles
(2 answers)
Closed 6 years ago.
I intend to apply linear regression on a dataset. it works fine when I apply a subset of the data in *.txt format as below:
// how could I read 26 *.tar.gz compressed files into a DataFrame?
val inputpath = "/Users/jasonzhu/Downloads/a.txt"
val rawDF = sc.textFile(inputpath).toDF()
val df = se.kth.spark.lab1.task2.Main.body(sqlContext, rawDF)
val splitDf = df.randomSplit(Array(0.95, 0.05), seed = 42L)
val (obsDF, testDF) =(splitDf(0).cache(), splitDf(1))
val maxIter = 6
val regParam = 0.07
val elasticNetParam = 0.1
println(s"maxIter=${maxIter}, regParam=${regParam}, elasticNetParam=${elasticNetParam}")
val myLR = new LinearRegression()
.setMaxIter(maxIter)
.setRegParam(regParam)
.setElasticNetParam(elasticNetParam)
val lrStage = 0
val pipeline = new Pipeline().setStages(Array(myLR))
val pipelineModel: PipelineModel = pipeline.fit(obsDF)
val lrModel = pipelineModel.stages(lrStage).asInstanceOf[LinearRegressionModel]
val trainingSummary = lrModel.summary
//print rmse of our model
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
//do prediction - print first k
val predictedDF = pipelineModel.transform(testDF)
predictedDF.show(5, false)
After spiking, I intend to apply the whole dataset, which resides in 26 *.tar.gz files, to the linear regression model. I'd like to know how I should read these compressed files into a DataFrame of Spark and consume it efficiently by taking the advantage of parallelism in Spark. Thanks!
textFile() method can take wildcards as well. From documentation:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
Start with an empty RDD and run a loop to read each of the files as a RDD and keep adding the RDD to the initial RDD by a union operation in each iteration.
I'm new to Spark + Scala and still developing my intuition. I have a file containing many samples of data. Every 2048 lines represents a new sample. I'm attempting to convert each sample into a vector and then run through a k-means clustering algorithm. The data file looks like this:
123.34 800.18
456.123 23.16
...
When I'm playing with a very small subset of the data, I create an RDD from the file like this:
val fileData = sc.textFile("hdfs://path/to/file.txt")
and then create the vector using this code:
val freqLineCount = 2048
val numSamples = 200
val freqPowers = fileData.map( _.split(" ")(1).toDouble )
val allFreqs = freqPowers.take(numSamples*freqLineCount).grouped(freqLineCount)
val lotsOfVecs = allFreqs.map(spec => Vectors.dense(spec) ).toArray
val lotsOfVecsRDD = sc.parallelize( lotsOfVecs ).cache()
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(lotsOfVecsRDD, numClusters, numIterations)
The key here is that I can call .grouped on an array of strings and it returns an array of arrays with the sequential 2048 values. That is then trivial to convert to vectors and run it through the KMeans training algo.
I'm attempting to run this code on a much larger data set and running into java.lang.OutOfMemoryError: Java heap space errors. Presumably because I'm calling the take method on my freqPowers variable and then performing some operations on that data.
How would I go about achieving my goal of running KMeans on this data set keeping in mind that
each data sample occurs every 2048 lines in the file (so the file should be parsed somewhat sequentially)
this code needs to run on a distributed cluster
I need to not run out of memory :)
thanks in advance
You can do something like:
val freqLineCount = 2048
val freqPowers = fileData.flatMap(_.split(" ")(1).toDouble)
// Replacement of your current code.
val groupedRDD = freqPowers.zipWithIndex().groupBy(_._2 / freqLineCount)
val vectorRDD = groupedRDD.map(grouped => Vectors.dense(grouped._2.map(_._1).toArray))
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(vectorRDD, numClusters, numIterations)
The replacing code uses zipWithIndex() and division of longs to group RDD elements into freqLineCount chunks. After the grouping, the elements in question are extracted into their actual vectors.
I have a cluster of 4 machines, 1 master and three workers, each with 128G memory and 64 cores. I'm using Spark 1.5.0 in stand alone mode. My program reads data from Oracle tables using JDBC, then does ETL, manipulating data, and does machine learning tasks like k-means.
I have a DataFrame (myDF.cache()) which is join results with two other DataFrames, and cached. The DataFrame contains 27 million rows and the size of data is around 1.5G. I need to filter the data and calculate 24 histogram as follows:
val h1 = myDF.filter("pmod(idx, 24) = 0").select("col1").histogram(arrBucket)
val h2 = myDF.filter("pmod(idx, 24) = 1").select("col1").histogram(arrBucket)
// ......
val h24 = myDF.filter("pmod(idx, 24) = 23").select("col1").histogram(arrBucket)
Problems:
Since my DataFrame is cached, I expect the filter, select, and histogram is very fast. However, the actual time is about 7 seconds for each calculation, which is not acceptable. From UI, it show the GC time takes 5 seconds and Task Deserialization Time 4 seconds. I've tried different JVM parameters but cannot improve further. Right now I'm using
-Xms25G -Xmx25G -XX:MaxPermSize=512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=32 \
-XX:ConcGCThreads=8 -XX:InitiatingHeapOccupancyPercent=70
What puzzles me is that the size of data is nothing compared with available memory. Why does GC kick in every time filter/select/histogram running? Is there any way to reduce the GC time and Task Deserialization Time?
I have to do parallel computing for h[1-24], instead of sequential. I tried Future, something like:
import scala.concurrent.{Await, Future, blocking}
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future{myDF.filter("pmod(idx, 24) = 1").count}
val f2 = Future{myDF.filter("pmod(idx, 24) = 2").count}
val f3 = Future{myDF.filter("pmod(idx, 24) = 3").count}
val future = for {c1 <- f1; c2 <- f2; c3 <- f3} yield {
c1 + c2 + c3
}
val summ = Await.result(future, 180 second)
The problem is that here Future only means jobs are submitted to the scheduler near-simultaneously, not that they end up being scheduled and run simultaneously. Future used here doesn't improve performance at all.
How to make the 24 computation jobs run simultaneously?
A couple of things you can try:
Don't compute pmod(idx, 24) all over again. Instead you can simply compute it once:
import org.apache.spark.sql.functions.{pmod, lit}
val myDfWithBuckets = myDF.withColumn("bucket", pmod($"idx", lit(24)))
Use SQLContext.cacheTable instead of cache. It stores table using compressed columnar storage which can be used to access only required columns and as stated in the Spark SQL and DataFrame Guide "will automatically tune compression to minimize memory usage and GC pressure".
myDfWithBuckets.registerTempTable("myDfWithBuckets")
sqlContext.cacheTable("myDfWithBuckets")
If you can, cache only the columns you actually need instead of projecting each time.
It is not clear for me what is the source of a histogram method (do you convert to RDD[Double] and use DoubleRDDFunctions.histogram?) and what is the argument but if you want to compute all histograms at the same time you can try to groupBy bucket and apply histogram once for example using histogram_numeric UDF:
import org.apache.spark.sql.functions.callUDF
val n: Int = ???
myDfWithBuckets
.groupBy($"bucket")
.agg(callUDF("histogram_numeric", $"col1", lit(n)))
If you use predefined ranges you can obtain a similar effect using custom UDF.
Notes
how to extract values computed by histogram_numeric? First lets create a small helper
import org.apache.spark.sql.Row
def extractBuckets(xs: Seq[Row]): Seq[(Double, Double)] =
xs.map(x => (x.getDouble(0), x.getDouble(1)))
now we can map using pattern matching as follows:
import org.apache.spark.rdd.RDD
val histogramsRDD: RDD[(Int, Seq[(Double, Double)])] = histograms.map{
case Row(k: Int, hs: Seq[Row #unchecked]) => (k, extractBuckets(hs)) }
I'm trying to fuzzy join two datasets, one of the quotes and one of the sales. For arguments sake, the joining attributes are first name, surname, dob and email.
I have 26m+ quotes and 1m+ sales. Customers may not have used the accurate information for one or more of the attributes, so I'm giving them a score for each match (1,1,1,1) where all match (0,0,0,0) where none match.
So I end up with something similar to
q1, s1, (0,0,1,0)
q1, s2, (0,1,0,1)
q1, s3, (1,1,1,1)
q2, s1, (1,0,0,1)
...
q26000000 s1 (0,1,0,0)
So I think this is equivalent to a cartesian product that I'm managing my making a large number of partitions for the quotes
val quotesRaw = sc.textfile(....)
val quotes = quotesRaw.repartition(quotesRaw.count().toInt() / 100000)
val sales = sc.textfile(...)
val sb = sc.broadcast(sales.collect())
quotes.mapPartitions(p=> (
p.flatMap(q => (
sb.value.map(s =>
q._1, s._1, ( if q._2 == s._2 1 else 0, etc)
)
)
This all works if I keep the numbers low, like 26m quotes but only 1000 sales but if I run it will all the sales it just stops responding when running
I'm running it with the following config.
spark-submit --conf spark.akka.frameSize=1024 \
--conf spark.executor.memory=3g --num-executors=30 \
--driver-memory 6g --class SalesMatch --deploy-mode client \
--master yarn SalesMatching-0.0.1-SNAPSHOT.jar \
hdfs://cluster:8020/data_import/Sales/SourceSales/2014/09/01/SourceSales_20140901.txt \
hdfs://cluster:8020/data_import/CDS/Enquiry/2014/01/01/EnquiryBackFill_20140101.txt \
hdfs://cluster:8020/tmp/_salesdata_matches_new
Is there anything that jumps out as obviously incorrect here?
Assuming 100k quotes per partition and 11M sales of total size 40MB your code generates roughly 4TB data per partition so it is rather unlikely your workers can handle this and it definitely cannot be done in memory.
I assume you're interested only in close matches so it makes sense to filter early. Simplifying your code a little (as far as I can tell there is no reason to use mapPartitions) :
// Check if match is close enough, where T is type of (q._1, s._1, (...))
def isCloseMatch(match: T): Boolean = ???
quotes.flatMap(q => sb.value
.map(s => (q._1, s._1, (....))) // Map as before
.filter(isCloseMatch) // yield only close matches
)
General remarks:
Creating broadcast from a RDD is an expensive process. First you have to transfer all data to the driver and then distribute among workers. It means repeated serialization/deserialization, network traffic and cost of storing data
For relatively simple operations like this it could be a good idea to use high level Spark SQL API:
import org.apache.spark.sql.DataFrame
val salesDF: DataFrame = ???
val salesDF: DataFrame = ???
val featureCols: Seq[String] = ???
val threshold: Int = ???
val inds = featureCols // Boolean columns
.map(col => (quotesDF(col) === salesDF(col)).alias(s"${col}_ind"))
val isSimilar = inds // sum(q == s) > threshold
.map(c => c.cast("integer").alias(c.toString))
.reduce(_ + _)
.geq(threshold)
val combined = quotesDF
.join(salesDF, isSimilar, "left")
I have a csv file stored a data of user-item of dimension 6,365x214 , and i am finding user-user similarity by using columnSimilarities() of org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.
My code looks like this:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix,
MatrixEntry, CoordinateMatrix}
import org.apache.spark.rdd.RDD
def rddToCoordinateMatrix(input_rdd: RDD[String]) : CoordinateMatrix = {
// Convert RDD[String] to RDD[Tuple3]
val coo_matrix_input: RDD[Tuple3[Long,Long,Double]] = input_rdd.map(
line => line.split(',').toList
).map{
e => (e(0).toLong, e(1).toLong, e(2).toDouble)
}
// Convert RDD[Tuple3] to RDD[MatrixEntry]
val coo_matrix_matrixEntry: RDD[MatrixEntry] = coo_matrix_input.map(e => MatrixEntry(e._1, e._2, e._3))
// Convert RDD[MatrixEntry] to CoordinateMatrix
val coo_matrix: CoordinateMatrix = new CoordinateMatrix(coo_matrix_matrixEntry)
return coo_matrix
}
// Read CSV File to RDD[String]
val input_rdd: RDD[String] = sc.textFile("user_item.csv")
// Read RDD[String] to CoordinateMatrix
val coo_matrix = rddToCoordinateMatrix(input_rdd)
// Transpose CoordinateMatrix
val coo_matrix_trans = coo_matrix.transpose()
// Convert CoordinateMatrix to RowMatrix
val mat: RowMatrix = coo_matrix_trans.toRowMatrix()
// Compute similar columns perfectly, with brute force
// Return CoordinateMatrix
val simsPerfect: CoordinateMatrix = mat.columnSimilarities()
// CoordinateMatrix to RDD[MatrixEntry]
val simsPerfect_entries = simsPerfect.entries
simsPerfect_entries.count()
// Write results to file
val results_rdd = simsPerfect_entries.map(line => line.i+","+line.j+","+line.value)
results_rdd.saveAsTextFile("similarity-output")
// Close the REPL terminal
System.exit(0)
and, when i run this script on spark-shell
i got following error, after running line of code simsPerfect_entries.count() :
java.lang.OutOfMemoryError: GC overhead limit exceeded
Updated:
I tried many solutions already given by others ,but i got no success.
1 By increasing amount of memory to use per executor process spark.executor.memory=1g
2 By decreasing the number of cores to use for the driver process
spark.driver.cores=1
Suggest me some way to resolve this issue.
All Spark transformations are lazy until you actually materialize it. When you define RDD-to-RDD data manipulations, Spark just chains operations together, not performing actual computation. So when you call simsPerfect_entries.count(), the chain of operations is executed and you got your number.
Error GC overhead limit exceeded means that JVM garbage collector activity was so high that execution of your code was stopped. GC activity can be so high due to these reasons:
You produce too many small objects and immediately discarding them. Looks like that you're not.
Your data does not fit into your JVM heap. Like if you try to load 2GB text file into RAM, but have only 1GB of JVM heap. Looks like that it's your case.
To fix this issue try to increase the amount of JVM heap on:
your worker nodes if you have a distributed Spark setup.
your spark-shell app.