How to efficeintly self-join an RDD of 1.7 million lines? - scala

I am new to Spark/Big data, and this totally has me stumped.
I have an RDD of around 1.7 million lines, of the form
SetA = (1,7,9,12,.....)
They are all integers, with one value per line.
I need to do a self-join on this RDD, so that I get pairs of elements satisfying certain conditions.
For example, like this:
result = ((1, 7), (1, 9), (12, 1),......)
The result is supposed to have around 24 million lines.
I tried doing a Cartesian, followed by several filters, like this:
setA.cartesian(setA).filter(''do something).filter('do something more')
This works fine for small datasets and I get what is required, but for the huge dataset of 1.7 million lines, the job never completes even after waiting for several hours. The target time for completion is around 30 minutes.
The console keeps showing lines like this continuously (the RDD setA is cached):
2021-12-02 08:36:17,758 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:18,926 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:19,892 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:20,848 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:21,666 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:22,532 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:23,495 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:24,463 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:25,438 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:26,270 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:27,153 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:28,121 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:29,121 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:30,095 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:31,209 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:32,199 INFO storage.BlockManager: Found block rdd_2_0 locally
2021-12-02 08:36:33,033 INFO storage.BlockManager: Found block rdd_2_0 locally
When I just read the input and write it to a textfile, the job completes without issues in less than a minute.
I am pretty sure I am missing something basic here. Is the Cartesian product being performed first, and then the filters are being applied? I thought that RDD transformations aren't evaluated until the last step.
Is there a better way of doing a conditional self-join without using Cartesian?

Related

What does one enter on the command line to run spark in a bokeh serve app? Do I simply separate the two command line entries by &&?

My effort does not work:
/usr/local/spark/spark-2.3.2-bin-hadoop2.7/bin/spark-submit --driver-memory 6g --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.2 runspark.py && bokeh serve --show bokeh_app
runspark.py contains the instantiation of spark, and bokeh_app is the folder of the bokeh server app. spark is being used to update a streaming dask dataframe.
WHAT HAPPENS:
The spark instance starts running, loads as it normally would without the bokeh server. However as soon as the bokeh server app kicks in (i.e.) the web page opens, the spark instance shuts down. It doesn't send back any errors in the console output.
OUTPUT BELOW:
2018-11-26 21:04:05 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4f0492c9{/static/sql,null,AVAILABLE,#Spark}
2018-11-26 21:04:06 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-11-26 21:04:06 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-11-26 21:04:06 INFO AbstractConnector:318 - Stopped Spark#4f3c4272{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2018-11-26 21:04:06 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.1.25:4041
2018-11-26 21:04:06 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-11-26 21:04:06 INFO MemoryStore:54 - MemoryStore cleared
2018-11-26 21:04:06 INFO BlockManager:54 - BlockManager stopped
2018-11-26 21:04:06 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-11-26 21:04:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-11-26 21:04:07 INFO SparkContext:54 - Successfully stopped SparkContext
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Shutdown hook called
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-c42ce0b3-d49e-48ce-962c-277b42166267
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f/pyspark-117d2a10-7cb9-4eb3-b4d0-f92f9046522c
2018-11-26 21:04:08,542 Starting Bokeh server version 0.13.0 (running on Tornado 5.1.1)
2018-11-26 21:04:08,547 Bokeh app running at: http://localhost:5006/aion_analytics
2018-11-26 21:04:08,547 Starting Bokeh server with process id: 10769
Ok, I found the answer. The idea is simply to embed the bokeh server in the pyspark code instead of running the bokeh server from the command line. Use the pyspark submit command as normal.
https://github.com/bokeh/bokeh/blob/1.0.1/examples/howto/server_embed/standalone_embed.py
I did exactly what shown in the link above.

zeppelin spark context closed after one paragraph

I have a notebook in Zeppelin containing multiple paragraphs which were running fine earlier; suddenly, after a cluster restart, it has started behaving weirdly.
The first paragraph runs fine while anything that runs afterwards says Connection Refused.
On checking the logs in $ZEPPELIN_HOME/logs folder zeppelin-interpreter-spark-root-mn.log (where mn is machine name).
INFO [2018-02-21 21:42:43,301] ({dispatcher-event-loop-15} Logging.scala[logInfo]:54) - Removed broadcast_12_piece0 on mn5:45284 in memory (size: 88.2 KB, free: 2004.5 MB)
INFO [2018-02-21 21:42:43,401] ({Thread-3} Logging.scala[logInfo]:54) - Invoking stop() from shutdown hook
INFO [2018-02-21 21:42:43,412] ({Thread-3} AbstractConnector.java[doStop]:310) - Stopped Spark#7de3e842{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
INFO [2018-02-21 21:42:43,416] ({Thread-3} Logging.scala[logInfo]:54) - Stopped Spark web UI at http://10.28.37.82:4040
INFO [2018-02-21 21:42:43,440] ({Yarn application state monitor} Logging.scala[logInfo]:54) - Interrupting monitor thread
INFO [2018-02-21 21:42:43,442] ({Thread-3} Logging.scala[logInfo]:54) - Shutting down all executors
INFO [2018-02-21 21:42:43,443] ({dispatcher-event-loop-4} Logging.scala[logInfo]:54) - Asking each executor to shut down
INFO [2018-02-21 21:42:43,447] ({Thread-3} Logging.scala[logInfo]:54) - Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
INFO [2018-02-21 21:42:43,450] ({Thread-3} Logging.scala[logInfo]:54) - Stopped
INFO [2018-02-21 21:42:43,454] ({dispatcher-event-loop-9} Logging.scala[logInfo]:54) - MapOutputTrackerMasterEndpoint stopped!
INFO [2018-02-21 21:42:43,466] ({Thread-3} Logging.scala[logInfo]:54) - MemoryStore cleared
INFO [2018-02-21 21:42:43,466] ({Thread-3} Logging.scala[logInfo]:54) - BlockManager stopped
INFO [2018-02-21 21:42:43,467] ({Thread-3} Logging.scala[logInfo]:54) - BlockManagerMaster stopped
INFO [2018-02-21 21:42:43,471] ({dispatcher-event-loop-0} Logging.scala[logInfo]:54) - OutputCommitCoordinator stopped!
INFO [2018-02-21 21:42:43,472] ({Thread-3} Logging.scala[logInfo]:54) - Successfully stopped SparkContext
INFO [2018-02-21 21:42:43,473] ({Thread-3} Logging.scala[logInfo]:54) - Shutdown hook called
So the shut down hook is getting called. I have tried to check other posts on SO (like this and this) but it didn't help. Logs are not much helpful either.
Do I need to tweak code to add additional logging to fix this problem? Has someone has already faced and resolved the same?
It turned out to be the case of bad logging. I had checked yarn logs as well but couldn't find anything. It turns out that second paragraph had a RunTimeException which wasn't clear from any of the logs, but when i tried same command on spark-shell then i Realized what the problem was and fixed the same.
Run the scala command in spark-shell then see what exception it is throwing.

using boilerpipe with pyspark

I am using boilerpipe to get text out of html. However there is some issue that I have not been able to resolve. I have a list of 50k elements. I am creating an rdd of 1000 elements and then processing them and saving the resultant rdd in hdfs. The error that I have encountered is this:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 265, in <module>
x = get_data(line[:-1],c)
File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 208, in get_data
sc.parallelize(warcrecords).repartition(72).map(lambda s: classify(s)).saveAsTextFile(file_name)
File "/home/hadoopuser/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1552, in saveAsTextFile
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o40.saveAsTextFile
17/09/19 18:11:10 INFO SparkContext: Invoking stop() from shutdown hook
17/09/19 18:11:10 INFO SparkUI: Stopped Spark web UI at http://192.168.0.255:4040
17/09/19 18:11:10 INFO DAGScheduler: Job 0 failed: saveAsTextFile at NativeMethodAccessorImpl.java:0, took 14.746797 s
17/09/19 18:11:10 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:0) failed in 7.906 s due to Stage cancelled because SparkContext was shut down
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo#ec3ca3)
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1505824870317,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
17/09/19 18:11:10 INFO StandaloneSchedulerBackend: Shutting down all executors
17/09/19 18:11:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/09/19 18:11:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/09/19 18:11:10 INFO MemoryStore: MemoryStore cleared
17/09/19 18:11:10 INFO BlockManager: BlockManager stopped
17/09/19 18:11:10 INFO BlockManagerMaster: BlockManagerMaster stopped
17/09/19 18:11:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/09/19 18:11:10 INFO SparkContext: Successfully stopped SparkContext
17/09/19 18:11:10 INFO ShutdownHookManager: Shutdown hook called
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763/pyspark-b73e541b-1182-4449-96bc-26eabca1803d
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763
In the hdfs file, resultant of first 1000 elements are saved but going onwards it throws the above error. What is the fix to this?
removing this line from the code did the trick. still don't know why.
from boilerpipe.extract import Extractor

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.

Unable to connect to Spark master

I start my DataStax cassandra instance with Spark:
dse cassandra -k
I then run this program (from within Eclipse):
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Start {
def main(args: Array[String]): Unit = {
println("***** 1 *****")
val sparkConf = new SparkConf().setAppName("Start").setMaster("spark://127.0.0.1:7077")
println("***** 2 *****")
val sparkContext = new SparkContext(sparkConf)
println("***** 3 *****")
}
}
And I get the following output
***** 1 *****
***** 2 *****
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/29 15:27:50 INFO SparkContext: Running Spark version 1.5.2
15/12/29 15:27:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/29 15:27:51 INFO SecurityManager: Changing view acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: Changing modify acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nayan); users with modify permissions: Set(nayan)
15/12/29 15:27:52 INFO Slf4jLogger: Slf4jLogger started
15/12/29 15:27:52 INFO Remoting: Starting remoting
15/12/29 15:27:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.0.1.88:55126]
15/12/29 15:27:53 INFO Utils: Successfully started service 'sparkDriver' on port 55126.
15/12/29 15:27:53 INFO SparkEnv: Registering MapOutputTracker
15/12/29 15:27:53 INFO SparkEnv: Registering BlockManagerMaster
15/12/29 15:27:53 INFO DiskBlockManager: Created local directory at /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/blockmgr-21a96671-c33e-498c-83a4-bb5c57edbbfb
15/12/29 15:27:53 INFO MemoryStore: MemoryStore started with capacity 983.1 MB
15/12/29 15:27:53 INFO HttpFileServer: HTTP File server directory is /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/spark-fce0a058-9264-4f2c-8220-c32d90f11bd8/httpd-2a0efcac-2426-49c5-982a-941cfbb48c88
15/12/29 15:27:53 INFO HttpServer: Starting HTTP Server
15/12/29 15:27:53 INFO Utils: Successfully started service 'HTTP file server' on port 55127.
15/12/29 15:27:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/29 15:27:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/29 15:27:53 INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
15/12/29 15:27:54 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/29 15:27:54 INFO AppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
15/12/29 15:27:54 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#127.0.0.1:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/12/29 15:28:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1f22aef0 rejected from java.util.concurrent.ThreadPoolExecutor#176cb4af[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
So something is happening during the creation of the spark context.
When i look in $DSE_HOME/logs/spark, it is empty. Not sure where else to look.
It turns out that the problem was the spark library version AND the Scala version. DataStax was running Spark 1.4.1 and Scala 2.10.5, while my eclipse project was using 1.5.2 & 2.11.7 respectively.
Note that BOTH the Spark library and Scala appear to have to match. I tried other combinations, but it only worked when both matched.
I am getting pretty familiar with this part of your posted error:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://...
It can have numerous causes, pretty much all related to misconfigured IPs. First I would do whatever zero323 says, then here's my two cents: I have solved my own problems recently by using IP addresses, not hostnames, and the only config I use in a simple standalone cluster is SPARK_MASTER_IP.
SPARK_MASTER_IP in the $SPARK_HOME/conf/spark-env.sh on your master then should lead the master webui to show the IP address you set:
spark://your.ip.address.numbers:7077
And your SparkConf setup can refer to that.
Having said that, I am not familiar with your specific implementation but I notice in the error two occurrences containing:
/private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/
Have you looked there to see if there's a logs directory? Is that where $DSE_HOME points? Alternatively connect to the driver where it creates it's webui:
INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
and you should see a link to an error log there somewhere.
More on the IP vs. hostname thing, this very old bug is marked as Resolved but I have not figured out what they mean by Resolved, so I just tend toward IP addresses.