Spark optimization - joins - very low number of task - OOM - postgresql

My spark application fail with this error : Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
This is what i get when I inspect the containger log : java.lang.OutOfMemoryError: Java heap space
My application is mainly get a table then join differents tables that i read from aws S3:
var result = readParquet(table1)
val table2 = readParquet(table2)
result = result.join(table2 , result(primaryKey) === table2(foreignKey))
val table3 = readParquet(table3)
result = result.join(table3 , result(primaryKey) === table3(foreignKey))
val table4 = readParquet(table4)
result = result.join(table4 , result(primaryKey) === table4(foreignKey))
and so on
My application fail when i try to save my result dataframe to postgresql using :
result.toDF(df.columns.map(x => x.toLowerCase()): _*).write
.mode("overwrite")
.format("jdbc")
.option(JDBCOptions.JDBC_TABLE_NAME, table)
.save()
On my failed join Stage i have a very low number of task : 6 tasks for 4 executors
Why my Stage stage generate 2 jobs ?
The first one is completed with 426 task :
and the second one is failing :
My spark-submit conf :
dynamicAllocation = true
num core = 2
driver memory = 6g
executor memory = 6g
max num executor = 10
min num executor = 1
spark.default.parallelism = 400
spark.sql.shuffle.partitions = 400
I tried with more resources but same problem :
num core = 5
driver memory = 16g
executor memory = 16g
num executor = 20
I think that all the data go to same partition/executor even with a default number of 400 partition and this cause a OOM error
I tried (without success) :
persit data
broadcastJoin, but my table is not small enough to broadcast it at the end.
repartition to higher number (4000) an do a count between each join to perform a action :
my main table seam to growth very fast :
(number of rows ) 40 -> 68 -> 7304 -> 946 832 -> 123 032 864 -> 246 064 864 -> (too much time after )
However the data size seam very low
If i look at task metrics a interesting thing is that my data seam skewed ( i am realy not sure )
In the last count action, i can see that ~120 task perform action , with ~10MB of input data for 100 Records and 12 seconds and the other 3880 tasks do absolutly nothings ( 3ms , 0 records 16B ( metadata ? ) ):

driver memory = 16g is too high memory and not needed. use only when you have a huge collection of data to master by actions like (collect() ) make sure to increase spark.maxResult.size if that is the case
you can do the following things
-- Do repartition while reading files readParquet(table1).repartition(x).if one of the tables is small then you can broadcast that and remove join instead use mapPartition and use a broadcast variable as lookup cache.
(OR)
-- Select a column that is uniformly distributed and repartition your table accordingly using that particular column.
Two points I need to press by looking in the above stats. your job has high scheduling delay which is caused by too many tasks and your task stats few stats are launched with input data as 10 bytes and few launched with 9MB.... obviously, there is data skewness here ... as you said The first one is completed with 426 tasks but with 4000 as repartition count it should launch more tasks
please look at https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c ... for more insights.

Related

How to load large amount of data from MySQL and save as text file?

I'm fetching large amount of data from MySql Database using LIMIT and OFFSET like:
var offset = 0
for (s <- a to partition) {
val query = "(select * from destination LIMIT 100000 OFFSET " + offset + ") as src"
data = data.union(spark.read.jdbc(url, query, connectionProperties).rdd.map(_.mkString(","))).persist(StorageLevel.DISK_ONLY)
offset += 100000
}
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY).saveAsTextFile("/home/hduser/Desktop/testing")
For small amount of data its working fine whereas for large amount of data it throws error like java.lang.OutOfMemoryError: Java heap space so if i can persist val dest = data.collect.toArray it will work as expected, sorry for such a naive question i'm new to spark.
Partition method:
val query = "(select * from destination) as dest"
val options = Map(
"url" -> "jdbc:mysql://192.168.175.35:3306/sample?useSSL=false",
"dbtable" -> query,
"user" -> "root",
"password" -> "root")
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "0", 1, 5, 4, new java.util.Properties()).rdd.map(_.mkString(","))
.persist(StorageLevel.DISK_ONLY).saveAsTextFile("/home/hduser/Desktop/testing")
Thank you
I'm fetching large amount of data
That's why you use Spark, isn't it? :)
for (s <- a to partition)
val dest = data.collect.toArray
spark.sparkContext.parallelize(dest, 1)
NOTE : Don't do that. I'd even call it a Spark anti-pattern where you load a
dataset on executors (from MySQL using JDBC) only to transfer this
"large amount of data" to the driver that in turn will transfer it
back to the executors to save it to disk.
It's as if you wanted to get rid of Spark doing these network round trips.
spark.read.jdbc supports partitioning your dataset at load time out of the box using partitionColumn, lowerBound, upperBound options (see JDBC To Other Databases) or (undocumented) predicates option.
partitionColumn, lowerBound, upperBound describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
Let Spark do its job(s).

How to tune mapping/filtering on big datasets (cross joined from two datasets)?

Spark 2.2.0
I have the following code converted from SQL script. It has been running for two hours and it's still running. Even slower than SQL Server. Is anything not done correctly?
The following is the plan,
Push table2 to all executors
Partition table1 and distribute the partitions to executors.
And each row in table2/t2 joins (cross join) each partition of table1.
So the calculation on the result of the cross-join can be run distributed/parallelly. (I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
case class Cols (Id: Int, F2: String, F3: BigDecimal, F4: Date, F5: String,
F6: String, F7: BigDecimal, F8: String, F9: String, F10: String )
case class Result (Id1: Int, ID2: Int, Point: Int)
def getDataFromDB(source: String) = {
import sqlContext.sparkSession.implicits._
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"$source"
)).load()
.select("Id", "F2", "F3", "F4", "F5", "F6", "F7", "F8", "F9", "F10")
.as[Cols]
}
val sc = new SparkContext(conf)
val table1:DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
println(table1.count()) // about 300K rows
val table2:DataSet[Cols] = getDataFromDB("table2") // ~20K rows
table2.take(1)
println(table2.count())
val t2 = sc.broadcast(table2)
import org.apache.spark.sql.{functions => func}
val j = table1.joinWith(t2.value, func.lit(true))
j.map(x => {
val (l, r) = x
Result(l.Id, r.Id,
(if (l.F1!= null && r.F1!= null && l.F1== r.F1) 3 else 0)
+(if (l.F2!= null && r.F2!= null && l.F2== r.F2) 2 else 0)
+ ..... // All kind of the similiar expression
+(if (l.F8!= null && r.F8!= null && l.F8== r.F8) 1 else 0)
)
}).filter(x => x.Value >= 10)
println("Total count %d", j.count()) // This takes forever, the count will be about 100
How to rewrite it with Spark idiomatic way?
Ref: https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
(Somehow I feel as if I have seen the code already)
The code is slow because you use just a single task to load the entire dataset from the database using JDBC and despite cache it does not benefit from it.
Start by checking out the physical plan and Executors tab in web UI to find out about the single executor and the single task to do the work.
You should use one of the following to fine-tune the number of tasks for loading:
Use partitionColumn, lowerBound, upperBound options for the JDBC data source
Use predicates option
See JDBC To Other Databases in Spark's official documentation.
After you're fine with the loading, you should work on improving the last count action and add...another count action right after the following line:
val table1: DataSet[Cols] = getDataFromDB("table1").repartition(32).cache()
// trigger caching as it's lazy in Dataset API
table1.count
The reason why the entire query is slow is that you only mark table1 to be cached when an action gets executed which is exactly at the end (!) In other words, cache does nothing useful and more importantly makes the query performance even worse.
Performance will increase after you table2.cache.count too.
If you want to do cross join, use crossJoin operator.
crossJoin(right: Dataset[_]): DataFrame Explicit cartesian join with another DataFrame.
Please note the note from the scaladoc of crossJoin (no pun intended).
Cartesian joins are very expensive without an extra filter that can be pushed down.
The following requirement is already handled by Spark given all the optimizations available.
So the calculation on the result of the cross-join can be run distributed/parallelly.
That's Spark's job (again, no pun intended).
The following requirement begs for broadcast.
I wanted to, for example suppose​ I have 16 executors, keep a copy of t2 on all the 16 executors. Then divide table 1 into 16 partitions, one for each executor. Then each executor do the calculation on one partition of table 1 and t2.)
Use broadcast function to hint Spark SQL's engine to use table2 in broadcast mode.
broadcast[T](df: Dataset[T]): Dataset[T] Marks a DataFrame as small enough for use in broadcast joins.

Cassandra ignores FetchSize

I have a cassandra table with 1000 rows. I am using the datastax java driver 2.1.8 with Cassandra 2.1.3
I have set the fetchSize to 10 for a prepared select statement. The code in Scala :
val stmt = preparedStatement.bind()
stmt.setFetchSize(10)
if(nextPage != null )
stmt.setPagingState(nextPage)
val rs = session.execute(stmt)
println( rs.all().count )
val nextPage = rs.getExecutionInfo().getPagingState()
I run this in a loop passing the nextPage value each time, starting with a null value.
But the result returned ignores the fetchSize although the PageState object is created according to the fetchSize. Result counts for each run of the loop are ...
1000
990
980
... so on.
what I want is the driver to return 10 results each time. What am I missing here ?
Calling all() will automatically consume all remaining pages. See javadoc on fetchMoreResults for details on how to manually fetch batches of the result.

Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

I have a local file movies.dat formatted as movie_id:movie_title:genre. For example:
1:movie1:Comedy
2:movie2:Drama
3:movie3:Horror
...
I create an external table using the following command.
CREATE EXTERNAL TABLE movies(movie_id INT, movie_title String, genre String)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\:' -- need backslash!!
LOCATION '/exc103320/movies_copy'; -- name of the directory to copy the original file
Then, I load the data to the table by
LOAD DATA LOCAL INPATH 'movies.dat' OVERWRITE INTO TABLE movies;
When I run SELECT * FROM movies LIMIT 3;
I see the first 3 rows.
When I run SELECT movie_id FROM movies LIMIT 3; I get the following error
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1420729875693_6595, Tracking URL = http://cshadoop1.utdallas.edu:8088/proxy/application_1420729875693_6595/
Kill Command = /usr/local/hadoop-2.4.1/bin/hadoop job -kill job_1420729875693_6595
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2015-03-29 17:14:54,820 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1420729875693_6595 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://cshadoop1.utdallas.edu:8088/cluster/app/application_1420729875693_6595
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Any idea why this happens?
I believe you dont need the backlash in the "ROW FORMAT
DELIMITED FIELDS TERMINATED BY" statement.
Try the DDL statement like this and see if it works.
CREATE EXTERNAL TABLE movies(movie_id INT, movie_title String, genre String)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ':'
LOCATION '/exc103320/movies_copy';

delete row key from cassandra cli

i set my column family gcgraceseconds to 0;
but stills rowkey is not deleted it remains in my column family
create column family workInfo123
with column_type = 'Standard'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and populate_io_cache_on_flush = true
and gc_grace = 0
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and default_time_to_live = 0
and speculative_retry = 'NONE'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor'}
and index_interval = 128;
see below the view of
[default#winoriatest] list workInfo123;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: a
-------------------
RowKey: xx
2 Rows Returned.
Elapsed time: 17 msec(s).
i am using cassandra -cli
should i have change anything else
UPDATE:-
after using ./nodetool -host 127.0.0.1 compact
[default#winoriatest] list workInfo123;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: xx
2 Rows Returned.
Elapsed time: 11 msec(s).
why xx remains ??
When you delete a row in Cassandra, it does not get deleted straight away. Instead it is marked with a tombstone. The effect is, that you still get a result for the key, but no columns will be delivered. The tombstone is required because
Cassandra data files become read-only once they are "full"; the tombstone is added to the currently open data file containing the deleted row.
you have to give the cluster a chance to propagate the delete to all nodes holding a copy of the row.
For the row and its tombstone to be removed a compaction is required. This process re-organizes the data files and while it does that, it prunes deleted rows. That is, if the GC grace period of the tombstone has been reached. For single-node(!) clusters it is OK to set the grace period to 0 because the delete does not have to be propagated to any other node (that might be down at the point in time you issued the delete).
If you want to enforce the removal of deleted rows, you can trigger a flush (sync memory with data files) and a major compaction via the nodetool utility. E.g.
./nodetool flush your_key_space the_column_family && ./nodetool compact your_key_space the_column_family
After the compaction completes, the deleted rows should truly be gone.
Default GC grace period is ten days(means 846000 sec) in order to remove the rowkey immediately
UPDATE COLUMN FAMILY column_family_name with GC_GRACE= 0;
execute the above cli query follow the nodetool flush and compact operation.