Scalding Tutorial with HDFS: Data is missing from one or more paths in: List(tutorial/data/hello.txt) - scala

After configuring ssh and rsync when I try to run Scalding tutorial (https://github.com/Cascading/scalding-tutorial/) with command:
$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala
I get the following error:
com.twitter.scalding.InvalidSourceException: [com.twitter.scalding.TextLineWrappedArray(tutorial/data/hello.txt)] Data is missing from one or more paths in: List(tutorial/data/hello.txt)
This error happens notwithstanding file tutorial/data/hello.txt really exists.
How to fix this?
Stdout:
$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala
scripts/scald.rb:194: warning: already initialized constant SCALA_LIB_DIR
dkondratev#hadoop-n002.maxus.lan's password:
dkondratev#hadoop-n002.maxus.lan's password:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/phoenix/phoenix-4.0.0.2.1.2.1-471-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/07/07 19:05:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/07/07 19:05:45 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
Exception in thread "main" java.lang.Throwable: GUESS: Data is missing from the path you provided.
If you know what exactly caused this error, please consider contributing to GitHub via following link.
https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#comtwitterscaldinginvalidsourceexception
at com.twitter.scalding.Tool$.main(Tool.scala:132)
at com.twitter.scalding.Tool.main(Tool.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

I think the problem you are having is that you are telling Scalding to run in HDFS, but the file you are providing as input is in your local file system, not in HDFS. Before running the example, upload the file to your HDFS:
hadoop fs -mkdir tutorial
hadoop fs -mkdir tutorial/data
hadoop fs -put tutorial/data/hello.txt tutorial/data/hello.txt

Try to pack your job into a fat jar using Maven shade plugin and then run your Scalding job via the hadoop command:
hadoop jar your-uber.jar com.twitter.scalding.Tool bar.foo.MyClassJob --hdfs --input ... --output ...

Related

Problems with Kafka Source initialization in Siddhi

Can't create stream from Kafka topic using Siddhi. Even if I create string with Design View.
I copied all required jars to lib and bundle folders. Even started Kafka with Zookeeper locally (dunno why I need it locally but nwm).
On tooling.sh start I have following error:
[2020-02-26 22:15:43,041] WARNING {org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils lambda$getBundlesInfo$1} - Error when loading the OSGi bundle information from /home/Hed/StreamProcessor/siddhi-tooling-5.1.2/lib/kafka-clients-2.3.0.jar
java.io.IOException: Required bundle manifest headers do not exist
at org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils.getBundleInfo(OSGiLibBundleDeployerUtils.java:183)
at org.wso2.carbon.launcher.extensions.OSGiLibBundleDeployerUtils.lambda$getBundlesInfo$1(OSGiLibBundleDeployerUtils.java:135)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:313)
at java.util.stream.StreamSpliterators$DistinctSpliterator.forEachRemaining(StreamSpliterators.java:1291)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
For this script:
#App:name("HelloKafka")
#App:description('Consume events from a Kafka Topic and publish to a different Kafka Topic')
#source(type='kafka',
topic.list='kafka_topic',
partition.no.list='0',
threading.option='single.thread',
group.id="group",
bootstrap.servers='localhost:9092',
#map(type='json'))
define stream SweetProductionStream (name string, amount double);
I have see error on Run command:
io.siddhi.core.exception.SiddhiAppCreationException: Error on 'HelloKafka' # Line: 10. Position: 26, near '#source(type='kafka',
topic.list='kafka_topic',
partition.no.list='0',
threading.option='single.thread',
group.id="group",
bootstrap.servers='localhost:9092',
#map(type='json'))'. org/apache/kafka/clients/producer/Producer
at io.siddhi.core.util.ExceptionUtil.populateQueryContext(ExceptionUtil.java:43)
at io.siddhi.core.util.parser.helper.DefinitionParserHelper.addEventSource(DefinitionParserHelper.java:388)
at io.siddhi.core.util.SiddhiAppRuntimeBuilder.defineStream(SiddhiAppRuntimeBuilder.java:117)
at io.siddhi.core.util.parser.SiddhiAppParser.defineStreamDefinitions(SiddhiAppParser.java:374)
at io.siddhi.core.util.parser.SiddhiAppParser.parse(SiddhiAppParser.java:230)
at io.siddhi.core.SiddhiManager.createSiddhiAppRuntime(SiddhiManager.java:85)
at io.siddhi.core.SiddhiManager.createSiddhiAppRuntime(SiddhiManager.java:95)
at io.siddhi.distribution.editor.core.internal.DebugRuntime.createRuntime(DebugRuntime.java:201)
at io.siddhi.distribution.editor.core.internal.DebugRuntime.(DebugRuntime.java:56)
at io.siddhi.distribution.editor.core.internal.DebugProcessorService.start(DebugProcessorService.java:38)
at io.siddhi.distribution.editor.core.internal.EditorMicroservice.start(EditorMicroservice.java:761)
at io.siddhi.distribution.editor.core.internal.EditorMicroservice.startWithVariables(EditorMicroservice.java:781)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invokeResource(HttpMethodInfo.java:187)
at org.wso2.msf4j.internal.router.HttpMethodInfo.invoke(HttpMethodInfo.java:143)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.dispatchMethod(MSF4JHttpConnectorListener.java:218)
at org.wso2.msf4j.internal.MSF4JHttpConnectorListener.lambda$onMessage$58(MSF4JHttpConnectorListener.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/Producer
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at io.siddhi.core.util.SiddhiClassLoader.loadClass(SiddhiClassLoader.java:32)
at io.siddhi.core.util.SiddhiClassLoader.loadExtensionImplementation(SiddhiClassLoader.java:48)
at io.siddhi.core.util.parser.helper.DefinitionParserHelper.addEventSource(DefinitionParserHelper.java:346)
... 21 more
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.clients.producer.Producer cannot be found by siddhi-io-kafka_5.0.7
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:448)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:361)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:353)
at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:161)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 28 more
Can somebody tell me what am I doing wrong? :(
Please make sure you had the OSGi-converted jars to the "C:\Program Files\WSO2\Enterprise Integrator\7.0.2\streaming-integrator\lib".
The OSGi-converted jar list:
kafka_2.12_2.3.0_1.0.0
kafka_clients_2.3.0_1.0.0
metrics_core_2.2.0_1.0.0
scala_library_2.12.8_1.0.0
zkclient_0.11_1.0.0
zookeeper_3.4.14_1.0.0
The, copy the original jars to to the "C:\Program Files\WSO2\Enterprise Integrator\7.0.2\streaming-integrator\samples\sample-clients\lib"
The list of original jars:
kafka_2.12-2.3.0
kafka-clients-2.3.0
metrics-core-2.2.0
scala-library-2.12.8
zkclient-0.11
zookeeper-3.4.14
In order to generate the OSGi-converted jars, copy all original jars to a folder called "source" and create an empty folder called "destination". Then run the following command in the terminal:
MINGW32 /c/Program Files/WSO2/Enterprise Integrator/7.0.2/streaming-integrator/bin
$ ./jartobundle.sh C:/DevTools/source C:/DevTools/destination
Finally, distribute the OSGis and original in accordance with the directories above.
PS1: in my case i am using kafka_2.12-2.4.1, but the basename of the jars does not change.
PS2: adapt the directories to your installation path
For more details check WSO2 documentation: Kafka transport

Unable to run Scala Play app throwing com.typesafe.config.ConfigException$Missing: No configuration setting found for key

I am trying run a Scala Play app and got this exception:
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'memo'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152) ~[config-1.3.1.jar:na]
at com.typesafe.config.impl.SimpleConfig.findOrNull(SimpleConfig.java:170) ~[config-1.3.1.jar:na]
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:184) ~[config-1.3.1.jar:na]
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:189) ~[config-1.3.1.jar:na]
at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:264) ~[config-1.3.1.jar:na]
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:270) ~[config-1.3.1.jar:na]
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:37) ~[config-1.3.1.jar:na]
at lila.common.PlayApp$.loadConfig(PlayApp.scala:24) ~[na:na]
at lila.memo.Env$.lila$memo$Env$$$anonfun$1(Env.scala:28) ~[na:na]
at lila.common.Chronometer$.sync(Chronometer.scala:56) ~[na:na]
I put external reference to the configuration file such as :
run -Dconfig.resource=conf/base.conf
I am new to Scala.Tried to find out conifg-1.3.1-jar but unable to find out.Can you please suggest how to overcome this situation.
**In the base.conf there is configuration for the memo is present.
Try without conf/, run -Dconfig.resource=base.conf
From Play 2.6 docs
Using -Dconfig.resource
This will search for an alternative configuration file in the application classpath (you usually provide these alternative configuration files into your application conf/ directory before packaging). Play will look into conf/ so you don’t have to add conf/.
$ /path/to/bin/ -Dconfig.resource=prod.conf
Using -Dconfig.file
You can also specify another local configuration file not packaged into the application artifacts:
$ /path/to/bin/ -Dconfig.file=/opt/conf/prod.conf
If you are configuring an sbt task from Intellij you must quote the entire command.
"run -Dconfig.resource=prod.conf"

Apache-Beam exception while running WordCount example in eclipse

Downloaded maven dependecies in eclipse using
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>0.2.0-incubating</version></dependency>
When download and run WordCount example after changing the gs path to the C://examples//misc.txt.Getting the below exception.I did not pass any runner.How to pass the runner option and output params while running from eclipse??
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" java.lang.IllegalStateException: Failed to validate C://examples//misc.txt
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:288)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:195)
at org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
at org.apache.beam.runners.direct.DirectRunner.apply(DirectRunner.java:205)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:401)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:324)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:59)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:174)
at org.apache.beam.examples.WordCount.main(WordCount.java:206)
Caused by: java.io.IOException: Unable to find handler for C://examples//misc.txt
at org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:188)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:283)
... 8 more
I believe Apache Beam will parse syntax C://directory/file similar to http://domain/file -- it will think that C is a protocol name and that directory is a domain. The inner exception is saying that C is an unknown protocol.
Please try to avoid using :// symbol when referring to local files. I'd suggest using regular Windows standard: C:\directory\file.

PriviledgedActionException while running kmeans on hadoop

I am trying to run KMeans on hadoop, using this guidelines.
http://www.slideshare.net/titusdamaiyanti/hadoop-installation-k-means-clustering-mapreduce?qid=44b5881c-089d-474b-b01d-c35a2f91cc67&v=qf1&b=&from_search=1#likes-panel
I am running this in eclipse-luna. when I executed, both map and reduce are showing they are complete 100%. But i am not getting output. Instead i am getting following error at the end. Please some help me to solve this..
15/03/20 11:29:44 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-hduser/mapred/staging/hduser378797276/.staging/job_local378797276_0002
15/03/20 11:29:44 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: No input paths specified in job
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:193)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1071)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at com.clustering.mapreduce.KMeansClusteringJob.main(KMeansClusteringJob.java:114)
You have to provide the input file location before running the map reduce program. There are two ways for providing the input file location.
Using ecliple go to run configration and provide the file name as arguments
Convert your program to jar file and run the below command inside your hadoop cluster
hadoop jar NameOfYourJarFile InputFileLocation OutPutFileLocation `

zookeeper configs for Giraph 1.0 on Hadoop 2.2.0

New to stack exchange and Giraph so please overlook mistakes and ask any clarifying questions.
OS: ubuntu 13.10
Hadoop/Yarn: hadoop-2.2.0/ (2-node cluster)
Giraph: 1.0.0 (EDIT: trunk)
I'm getting a NullPointerException (NPE) when I attempt to run the following example:
$ hadoop jar
$GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner
org.apache.giraph.examples.SimpleShortestPathsComputation -vif
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hduser/rrdata/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
/user/hduser/rrdata/output/tiny_graph.out -w 1
Stack Trace:
Exception in thread "main" java.lang.NullPointerException at
org.apache.giraph.yarn.GiraphYarnClient.checkJobLocalZooKeeperSupported(GiraphYarnClient.java:460)
at
org.apache.giraph.yarn.GiraphYarnClient.run(GiraphYarnClient.java:116)
at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:96) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at
org.apache.giraph.GiraphRunner.main(GiraphRunner.java:126) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.main(RunJar.java:212)
It seems zookeeper related. I installed zookeeper but not having used it before it seems like the configs are wrong. I've tried -Dgiraph.zkList=hostname:port and related options but get 'Unrecognized option' exception.
Looking for the correct zookeeper settings for this scenarios. I'll post a reply if I figure it out.
This is an example how you can specify -D flags:
hadoop jar giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.2-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -D giraph.zkList="zkNode.net:2081" org.apache.giraph.examples.SimpleShortestPathsComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/rav/giraph/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/rav/giraph/output/shortestpaths -w 1
Btw local zookeeper is not supported in Giraph yet (GiraphYarnClient):
/**
* Check if the job's configuration is for a local run. These can all be
* removed as we expand the functionality of the "pure YARN" Giraph profile.
*/
private void checkJobLocalZooKeeperSupported() {
final boolean isZkExternal = giraphConf.isZookeeperExternal();
final String checkZkList = giraphConf.getZookeeperList();
if (!isZkExternal || checkZkList.isEmpty()) {
throw new IllegalArgumentException("Giraph on YARN does not currently" +
"support Giraph-managed ZK instances: use a standalone ZooKeeper.");
}
}
Unfortunately checkZkList is NULL so you will never see this exception :)
The reason for the NPE is probably the lack of a giraphConf to check for ZK settings. i think this is due to earlier problems in the run. Looks like the examples jar was not supplied using a -yj argument. the jar you run with "hadoop jar" is typically giraph-core itself.
Good luck, please post on the Giraph user list if you have more trouble.