Hadoop - Map and reduce stuck at 0% all the time - eclipse

I'm trying to implement the PageRank algorithm on Hadoop platform with Eclipse, but I'm facing some unusual problems :). I tried this locally: installed cygwin, set up Hadoop 0.19.2 (and 0.18.0), started the necessary daemons and installed Eclipse 3.3.1. I uploaded testinf .txt file and then tried to run the WordCount example or even a simple .java and I got this output (for about 100 times :)) ):
10/07/22 22:10:23 INFO mapred.FileInputFormat: Total input paths to process : 1
10/07/22 22:10:23 INFO mapred.JobClient: Running job: job_201007220415_0017
10/07/22 22:10:24 INFO mapred.JobClient: map 0% reduce 0%
Map and reduce are 0% all the time. I tried with Hadoop on virtual machine and I got the same situation.
I followed all the instructions from Hadoop page and other useful pages, but it didn't resolve my problem. Any suggestions? :)

That sounds like a problem more with your Hadoop setup than with Eclipse. Make sure you have all the pieces of your cluster running, i.e. DataNode(s), TaskTracker(s), JobTracker. If those are all running, it might be a problem with the way you're setting up the job.

Are you bent to-do this in Java? If not you can use a Ruby gem called WUKONG that has a pagerank example http://github.com/mrflip/wukong/tree/master/examples/pagerank/

Related

Running Scala module with databricks-connect

I've tried to follow the instructions here to set up databricks-connect with IntelliJ. My understanding is that I can run code from the IDE and it will run on the databricks cluster.
I added the jar directory from the miniconda environment and moved it above all of the maven dependencies in File -> Project Structure...
However I think I did something wrong. When I tried to run my module I got the following error:
21/07/17 22:44:24 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:221)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:201)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:413)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:1016)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:1010)
at com.*.sitecomStreaming.sitecomStreaming$.main(sitecomStreaming.scala:184)
at com.*.sitecomStreaming.sitecomStreaming.main(sitecomStreaming.scala)
The system memory being 259 gb makes me think it's trying to run locally on my laptop instead of the dbx cluster? I'm not sure if this is correct and what I can do to get this up and running properly...
Any help is appreciated!
The driver in the databricks-connect is always running locally - only the executors are running in the cloud. Also, this reported memory is in the bytes, so 259522560 is ~256Mb - you can increase it using the option that it reports.
P.S. But if you're using structured streaming, then yes - it's a known limitation of databricks-connect.

How can I execute a S3-dist-cp command within a spark-submit application

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>
I also installed s3-dist-cp on all salves along with master.
The application starts and succeeded without error but does not move the data to S3.
This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.
s3-dist-cp is now a default thing on the Master node of the EMR cluster.
I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

Rundeck - any command execution fails when running on 5.8k nodes

I'm running a rundeck server to delegate a simple script to 5.8k other linux servers.
The very simple script is bellow
!/bin/bash
A=$(hostname)
echo $A
When i run the same job with a smaller number of targets (4089 nodes)
the comands work fine
I tried looking at my service.log page and its not incrementing anything
Any ideas on how to be able to run on all the 5.8k nodes? And where should i look for errors?
Rundeck does not have limits to nodes, certainly depends on how many executions you want to run, how much ram, how many processors and disk space.
Maybe you need to increase the Java heap size:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#java-heap-size
And how to adapt this to your SSH plugin:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#built-in-ssh-plugins

Spark: long delay between jobs

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.
In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.
When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.
(Spark 1.6.1 with Hadoop 2)
Updated:
I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.
I'm sure there are more ways to improve but below two methods were sufficient for me:
reduce file count
set parquet.enable.summary-metadata to false
I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node
Spark will write to temporary s3 directories, then move the files using the master node
Reading of text files often occur on the master node
When writing parquet files, the master node will scan all the files post-write to check the schema
These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.
Discussion of writing I/O Overhead with Parquet and s3
Discussion of reading I/O Overhead "s3 is not a filesystem"
Problem:
I faced similar issue when writing parquet data on s3 with pyspark on EMR 5.5.1. All workers would finish writing data in _temporary bucket in output folder & Spark UI would show that all tasks have completed. But Hadoop Resource Manager UI would not release resources for the application neither mark it as complete. On checking s3 bucket, it seemed like spark driver was moving the files 1 by 1 from _temporary directory to output bucket which was extremely slow & all the cluster was idle except Driver node.
Solution:
The solution that worked for me was to use committer class by AWS ( EmrOptimizedSparkSqlParquetOutputCommitter ) by setting the configuration property spark.sql.parquet.fs.optimized.committer.optimization-enabled to true.
e.g.:
spark-submit ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
or
pyspark ....... --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
Note that this property is available in EMR 5.19 or higher.
Result:
After running the spark job on EMR 5.20.0 using above solution, it did not create any _temporary directory & all the files were directly written to the output bucket, hence job finished very quickly.
Fore more details:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

PBS - nodes are free, but they do not start a job

I am new administrator of PBS. I downloaded and installed torque-4.2.6 version. I used default configuration that is provided by torque.setup.
The OS is CentOS with kernel 2.6.18.
I stopped all the firewall. I confirmed that all the ssh/scp works bi-directionally between server and nodes.
after configuration, everything looks fine. small number of jobs have finished well.
When I submitted 10000 jobs, they finished about 70% of the jobs, but the remainders do not start to work. I found that the server_priv/jobs directory contains the jobs.
I checked the log fines... but I could not find any clue to the problem.
I checked disk space by using df, and there is 10% (more than 100GB) of free space and it looks enough to run PBS jobs.
Before I check other things, I ask help to the others in this site.