hadoop-streaming example failed to run - Type mismatch in key from map - streaming

I was running $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
What hould be the input file when IdentityMapper is the mapper?
I was hoping to see it can sort on certain selected keys and not the entire keys. My input file is simple
"aa bb".
"cc dd"
Not sure what did I miss? I always get this error
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

This is a known bug and here is the JIRA. The bug has been identified in Hadoop 0.21.0, but I don't think it's in any of the Hadoop release version. If you are really interested to fix this, you can
download the source code for Hadoop (for the release you are working)
download the patch from JIRA and apply it
build and test Hadoop
Here are the instructions on how to apply a patch.
Or instead of using an IdentityMapper and the IdentityReducder, use a python/perl scripts which will read the k/v pairs from STDIN and then write the same k/v pairs to the STDOUT without any processing. It's like creating your own IdentityMapper and the IdentityReducder not using Java.

I was trying my hands on Hadoop with my own example, but got the same error. I used KeyValueTextInputFormat to resolve the issue. You can have a look at following blog for the same.
http://sanketraut.blogspot.in/2012/06/hadoop-example-setting-up-hadoop-on.html
Hope it helps you.
Peace.
Sanket Raut

Related

Cannot apply count() or collecr() on RDD from textfile(Spark)

I am new at Spark and I have Databricks Community Edition account. Right now I'm doing Lab and encountered with following error:
!rm README.md* -f
!wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
textfile_rdd = sc.textFile("README.md")
textfile_rdd.count()
Output:
IllegalArgumentException: Path must be absolute: dbfs:/../dbfs/README.md
By default, wget will download your file to /databricks/driver
You have to store it in the DataBricks File System (dbfs) in order to be able to read it with the -P option. See wget manual for reference.
It also seems that the !wget magic creates a file that is not available with the dbfs:/ path. On Databricks Community, !wget leads to a file not found as you mentionned.
You can do the following in a %sh cell first:
%sh
rm README.md* -f
wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md -P /dbfs/downloads/
And then in a second python cell, you can access the file throug the Files API (note the path starting with file:/
textfile_rdd = sc.textFile("file:/dbfs/downloads/README.md")
textfile_rdd.count()
--2022-02-11 13:48:19-- https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3624 (3.5K) [text/plain]
Saving to: ‘/dbfs/FileStore/README.md.1’
README.md.1 100%[===================>] 3.54K --.-KB/s in 0.001s
2022-02-11 13:48:19 (4.10 MB/s) - ‘/dbfs/FileStore/README.md.1’ saved [3624/3624]
Out[25]: 98
The following solution has been tested on a Databricks Community Edition with a 7.1 LTS ML and a 9.1 LTS ML Databricks Runtime.

Google Cloud AI Platform: I can't create a model version using the "--accelerator" parameter

In order to get online predictions, I'm creating a model version on the ai-platform. It works fine unless I want to use the --accelerator parameter.
Here is the command that works:
gcloud alpha ai-platform versions create [...] --model [...] --origin=[...] --python-version=3.5 --runtime-version=1.14 --package-uris=[...] --machine-type=mls1-c4-m4 --prediction-class=[...]
Here is the parameter that makes it not work:
--accelerator=^:^count=1:type=nvidia-tesla-k80
This is the error message I get:
ERROR: (gcloud.alpha.ai-platform.versions.create) INVALID_ARGUMENT: Request contains an invalid argument.
I expect it to work, since 1) the parameter exists and uses these two keys (count and type), 2) I use the correct syntax for the parameter, any other syntaxes would return a syntax error, and 3) the "nvidia-tesla-k80" value exists (it is described in --help) and is available in the region in which the model is deployed.
Make sure you are using a recent version of the Google Cloud SDK.
Then you can use the following command:
gcloud beta ai-platform versions create $VERSION_NAME \
--model $MODEL_NAME \
--origin gs://$MODEL_DIRECTORY_URI \
--runtime-version 1.15 \
--python-version 3.7 \
--framework tensorflow \
--machine-type n1-standard-4 \
--accelerator count=1,type=nvidia-tesla-t4
For reference you can enable logging during model creation:
gcloud beta ai-platform models create {MODEL_NAME} \
--regions {REGION}
--enable-logging \
--enable-console-logging
The format for the --accelerator parameter as you can check in the official documentation is:
--accelerator=count=1,type=nvidia-tesla-k80
I think that might cause your issue, let me know.

Apache Zeppelin cannot deserialize dataset: "NoSuchMethodError"

I am trying to use Apache Zeppelin (0.7.2, net install running locally on a Mac) to explore data loaded from an s3 bucket. The data seems to load just fine, as the command:
val p = spark.read.textFile("s3a://sparkcookbook/person")
gives the result:
p: org.apache.spark.sql.Dataset[String] = [value: string]
However, when I try to call methods on the object p, I get an error. For example:
p.take(1)
results in:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
My conf/zeppelin-env.sh is the same as the default, except that I have amazon access key and secret key environment variables defined there. In the Spark interpreter in the Zeppelin notebook, I have added the following artifacts:
org.apache.hadoop:hadoop-aws:2.7.3
com.amazonaws:aws-java-sdk:1.7.9
com.fasterxml.jackson.core:jackson-core:2.9.0
com.fasterxml.jackson.core:jackson-databind:2.9.0
com.fasterxml.jackson.core:jackson-annotations:2.9.0
(I think only the first two are necessary). The two commands above work fine in the Spark shell, just not in the Zeppelin notebook (see How to use s3 with Apache spark 2.2 in the Spark shell for how that was set up).
So it seems that there is a problem with one of the Jackson libraries. Maybe I'm using the wrong artifacts above for the Zeppelin interpreter?
UPDATE: Following the advice in the proposed answer below, I removed the jackson jars that came with Zeppelin, and replaced them with the following:
jackson-annotations-2.6.0.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
And replaced the artifacts with these, so my artifacts are now:
org.apache.hadoop:hadoop-aws:2.7.3
com.amazonaws:aws-java-sdk:1.7.9
com.fasterxml.jackson.core:jackson-core:2.6.7
com.fasterxml.jackson.core:jackson-databind:2.6.7
com.fasterxml.jackson.core:jackson-annotations:2.6.0
The error I get, however, from running the above commands is the same.
UDPATE2: As per I removed the jackson libraries from the list of artifacts, since they are already now in the jars/ folder - the only added artifacts are now the aws artifacts above. I then cleaned the classpath by entering the following in the notebook (as per the instructions):
%spark.dep
z.reset()
I get a different error now:
val p = spark.read.textFile("s3a://sparkcookbook/person")
p.take(1)
p: org.apache.spark.sql.Dataset[String] = [value: string]
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
UPDATE3: As per the suggestion in a comment to the proposed answer below, I cleaned the class path by deleting all the files in the local repo:
rm -rf local-repo/*
I then restarted the Zeppelin server. To check the class path, I executed the following in the notebook:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
This gave the following output (I include only the jackson libraries from the output here, otherwise the output is too long to paste):
...
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-annotations-2.6.0.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-core-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-databind-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-paranamer-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-scala_2.11-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/json4s-jackson_2.11-3.2.11.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/parquet-jackson-1.8.1.jar
...
It seems that multiple versions are fetched from the repo. Should I exclude the older versions? If so, how do I do that?
Use this jar versions;
aws-java-sdk-1.7.4.jar
hadoop-aws-2.6.0.jar
like in this script : https://github.com/2dmitrypavlov/sparkDocker/blob/master/zeppelin.sh
do not use package but download the jars and put them in a path, let's say in "/root/jars/" then edit your zeppelin-env.sh;
then run this command from zeppelin/conf dir;
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
after that restart zeppelin.
The code at the link above is pasted below (just in case the link becomes stale):
#!/bin/bash
# Download jars
cd /root/jars
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.39/mysql-connector-java-5.1.39.jar
cd /usr/share/
wget http://archive.apache.org/dist/zeppelin/zeppelin-0.7.1/zeppelin-0.7.1-bin-all.tgz
tar -zxvf zeppelin-0.7.1-bin-all.tgz
cd zeppelin-0.7.1-bin-all/conf
cp zeppelin-env.sh.template zeppelin-env.sh
echo 'export MASTER=spark://'$MASTERZ':7077'>>zeppelin-env.sh
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
echo 'export ZEPPELIN_NOTEBOOK_STORAGE="org.apache.zeppelin.notebook.repo.VFSNotebookRepo, org.apache.zeppelin.notebook.repo.zeppelinhub.ZeppelinHubRepo"'>>zeppelin-env.sh
echo 'export ZEPPELINHUB_API_ADDRESS="https://www.zeppelinhub.com"'>>zeppelin-env.sh
echo 'export ZEPPELIN_PORT=9999'>>zeppelin-env.sh
echo 'export SPARK_HOME=/usr/share/spark'>>zeppelin-env.sh
cd ../bin/
./zeppelin.sh
You are probably using a too recent Jackson version. Even spark 2.3 is still on `2.6.7. Downgrade, and make sure that all your jackson JARs are consistent.

When using Mallet, how do I get a list of topics associated with each document

When using Mallet, how do I get a list of topics associated with each document? I think I need to use train-topics and --output-topic-docs, but when I do, I get an error.
I'm using Mallet (2.0.8), and I use the following bash script to do my modeling:
MALLET=/Users/emorgan/desktop/mallet/bin/mallet
INPUT=/Users/emorgan/desktop/sermons
OBJECT=./object.mallet
$MALLET import-dir --input $INPUT --output $OBJECT --keep-sequence --remove-stopwords
$MALLET train-topics --input $OBJECT --num-topics 10 --num-top-words 1 \
--num-iterations 50 \
--output-doc-topics ./topics.txt \
--output-topic-keys ./keys.txt \
--xml-topic-report ./topic.xml \
--output-topic-docs ./docs.txt
Unfortunately, ./docs.txt does not get created. Instead I get the following error:
Exception in thread "main" java.lang.ClassCastException: java.net.URI cannot be cast to java.lang.String
at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1773)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)
More specifically, I want Mallet to generate a list of documents and the associated topics assigned to them, or I want a list of topics and then the list of associated documents. How do I create such lists?
At least in mallet 2.0.7, it is --output-doc-topics ./topics.txt that gives the desired table (a topic composition of each document). While the output format has changed from 2.0.7 to 2.0.8, the main content of the file stayed the same.

PredictionIO - getting error when build and run Evaluation metrics

I followed this quickstart:
https://docs.prediction.io/templates/classification/quickstart/
and this document for evaluation metrics
https://docs.prediction.io/evaluation/paramtuning/
Everything seems ok until the step build and run evaluation metrics
pio eval org.template.classification.AccuracyEvaluation \
org.template.classification.EngineParamsList
I am getting the exception:
Exception in thread "main" scala.reflect.internal.MissingRequirementError: object org.template.classification.AccuracyEvaluation not found.
at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at io.prediction.workflow.WorkflowUtils$.getEvaluation(WorkflowUtils.scala:103)
at io.prediction.workflow.CreateWorkflow$$anonfun$19.apply(CreateWorkflow.scala:146)
at io.prediction.workflow.CreateWorkflow$$anonfun$19.apply(CreateWorkflow.scala:144)
Could anyone help me with this?
Thank you very much.
Had the exact same problem. Fixed it by doing the following:
For each .scala file in engine_dir/src/main/scala/org/template/engine_name/ you need to change the first line from...
package <SomeTemplateName>
To the following (replacing engine_name with the name of the folder in the path mentioned above):
package org.template.<engine_name>
Then, in engine.json you need to change the following line...
"engineFactory": "<template name>.<template engine>",
To the following (once again replacing engine_name with the name of the folder in the path mentioned above):
"engineFactory": "org.template.<engine name>.<template engine>",
Now re-run...
pio build
pio train
pio deploy
Then you should be able to run the model evaluation without errors.
Simply run it like this
$ pio eval org.example.classification.AccuracyEvaluation \
org.example.classification.EngineParamsList
You dont have to change anything. The class package from the sample was org.example.classification not org.template.classification