Cannot apply count() or collecr() on RDD from textfile(Spark)

Cannot apply count() or collecr() on RDD from textfile(Spark) - pyspark

I am new at Spark and I have Databricks Community Edition account. Right now I'm doing Lab and encountered with following error:
!rm README.md* -f
!wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
textfile_rdd = sc.textFile("README.md")
textfile_rdd.count()
Output:
IllegalArgumentException: Path must be absolute: dbfs:/../dbfs/README.md

By default, wget will download your file to /databricks/driver
You have to store it in the DataBricks File System (dbfs) in order to be able to read it with the -P option. See wget manual for reference.
It also seems that the !wget magic creates a file that is not available with the dbfs:/ path. On Databricks Community, !wget leads to a file not found as you mentionned.
You can do the following in a %sh cell first:
%sh
rm README.md* -f
wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md -P /dbfs/downloads/
And then in a second python cell, you can access the file throug the Files API (note the path starting with file:/
textfile_rdd = sc.textFile("file:/dbfs/downloads/README.md")
textfile_rdd.count()
--2022-02-11 13:48:19-- https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3624 (3.5K) [text/plain]
Saving to: ‘/dbfs/FileStore/README.md.1’
README.md.1 100%[===================>] 3.54K --.-KB/s in 0.001s
2022-02-11 13:48:19 (4.10 MB/s) - ‘/dbfs/FileStore/README.md.1’ saved [3624/3624]
Out[25]: 98
The following solution has been tested on a Databricks Community Edition with a 7.1 LTS ML and a 9.1 LTS ML Databricks Runtime.

Related

How to fix the sdk_addon data copy build error?

I tried to build the Android SDK Addon System Image of halogenOS 13 (AOSP 13, nothing was changed in the AOSP source code yet).
The steps to build are as usual:
source build/envsetup.sh
lunch aosp_sdk_phone_x86_64-eng
m sdk_addon
At 100%, the build fails with following error:
[100% 12483/12483] Packaging SDK Addon System-Image: out/host/linux-x86/sdk_addon/custom-eng.simao--
FAILED: out/host/linux-x86/sdk_addon/custom-eng.simao--img.zip
/bin/bash -c "(cp -R out/target/product/emulator_x86_64/data out/host/linux-x86/obj/SDK_ADDON/custom_intermediates/custom-eng.simao--img/images/x86_64/data ) && (out/host/linux-x86/bin/soong_zip -o out/host/linux-x86/sdk_addon/custom-eng.simao--img.zip -C out/host/linux-x86/obj/SDK_ADDON/custom_intermediates/custom-eng.simao--img/images/ -D out/host/linux-x86/obj/SDK_ADDON/custom_intermediates/custom-eng.simao--img/images/x86_64 )"
cp: bad 'out/target/product/emulator_x86_64/data': No such file or directory
13:17:13 ninja failed with: exit status 1
If I just mkdir out/target/product/emulator_x86_64/data, of course, that just solves the build error but the SDK addon does not actually boot in the emulator due to encryption issues with the userdata partition so I think this is related. This makes me guess that in the data directory there should be some files in there but for some reason the are not created.
EDIT:
What's really odd here is that the file device/generic/goldfish/vendor.mk explicitly adds some data files to PRODUCT_COPY_FILES, notably:
PRODUCT_COPY_FILES += \
device/generic/goldfish/data/etc/dtb.img:dtb.img \
device/generic/goldfish/emulator-info.txt:data/misc/emulator/version.txt \
device/generic/goldfish/data/etc/apns-conf.xml:data/misc/apns/apns-conf.xml \
device/generic/goldfish/radio/RadioConfig/radioconfig.xml:data/misc/emulator/config/radioconfig.xml \
device/generic/goldfish/data/etc/iccprofile_for_sim0.xml:data/misc/modem_simulator/iccprofile_for_sim0.xml \
If I manually build them using, for example, m out/target/product/emulator_x86_64/data/misc/emulator/version.txt, the file is created at the correct location, as expected. Which leads me to wonder when entries PRODUCT_COPY_FILES are considered targets to be built and when they aren't.
EDIT2:
I got the emulator to boot but the data directory is still not being created. (Creating manually or building one target in the data dir is a workaorund).

Is it possible to configure Azure Windows VMs using Ansible on Azure DevOps Microsoft Hosted Ubuntu agents?

We try to configure an Azure VM using an Azure DevOps pipeline. We first create the machine using Terraform and then we need to configure it. Right now the pipeline is functional when we use a customized Ubuntu Azure DevOps agent (a VM we setup ourselves in Azure).
We prefer to use a Microsoft Hosted Ubuntu Agent. When we try to run our pipeline using the Microsoft Hosted Ubuntu agent we fail with a message "winrm or requests is not installed".
We have done a lot of research and attempts to install the needed components, but none have been fruitful.
All the examples and documentation on the internet we can find don't mention our specific use case. Ansible configuration of Windows VMs in Azure from a Microsoft Hosted Ubuntu agent. Isn't it possible for some reason?
If it is, any pointers in the right direction will be much appreciated!
The error we see in the Azure DevOps pipeline is this:
ansible-playbook -vvvv -i inventory/hosts.cfg main.yml --extra-vars '{"customer_name": "<REMOVED>" }'
ansible-playbook [core 2.12.5]
config file = None
configured module search path = ['/home/vsts/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/vsts/.local/lib/python3.8/site-packages/ansible
ansible collection location = /home/vsts/.ansible/collections:/usr/share/ansible/collections
executable location = /home/vsts/.local/bin/ansible-playbook
python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
jinja version = 2.10.1
libyaml = True
No config file found; using defaults
setting up inventory plugins
host_list declined parsing /home/vsts/work/1/s/ansible/inventory/hosts.cfg as it did not pass its verify_file() method
auto declined parsing /home/vsts/work/1/s/ansible/inventory/hosts.cfg as it did not pass its verify_file() method
yaml declined parsing /home/vsts/work/1/s/ansible/inventory/hosts.cfg as it did not pass its verify_file() method
Parsed /home/vsts/work/1/s/ansible/inventory/hosts.cfg inventory source with ini plugin
Loading collection ansible.windows from /home/vsts/.local/lib/python3.8/site-packages/ansible_collections/ansible/windows
Loading collection community.windows from /home/vsts/.local/lib/python3.8/site-packages/ansible_collections/community/windows
redirecting (type: modules) ansible.builtin.win_service to ansible.windows.win_service
redirecting (type: modules) ansible.builtin.win_service to ansible.windows.win_service
redirecting (type: modules) ansible.builtin.win_service to ansible.windows.win_service
redirecting (type: modules) ansible.builtin.win_service to ansible.windows.win_service
redirecting (type: modules) ansible.builtin.win_service to ansible.windows.win_service
Loading callback plugin default of type stdout, v2.0 from /home/vsts/.local/lib/python3.8/site-packages/ansible/plugins/callback/default.py
Skipping callback 'default', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.
PLAYBOOK: main.yml *************************************************************
Positional arguments: main.yml
verbosity: 4
connection: smart
timeout: 10
become_method: sudo
tags: ('all',)
inventory: ('/home/vsts/work/1/s/ansible/inventory/hosts.cfg',)
extra_vars: ('{"customer_name": "<REMOVED>"}',)
forks: 5
1 plays in main.yml
PLAY [windows:pro] *********************************************************
TASK [Gathering Facts] *********************************************************
task path: /home/vsts/work/1/s/ansible/main.yml:1
redirecting (type: modules) ansible.builtin.setup to ansible.windows.setup
Using module file /home/vsts/.local/lib/python3.8/site-packages/ansible_collections/ansible/windows/plugins/modules/setup.ps1
Pipelining is enabled.
**fatal: [51.144.125.149]: FAILED! => {
"msg": "winrm or requests is not installed: No module named 'winrm'"
}**
PLAY RECAP *********************************************************************
51.144.125.149 : ok=0 changed=0 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
We tried to fix the problem by installing various potentially relevant components in the pipeline just before running the ansible-playbook command, for instance this one
pip3 install pywinrm
Later, based on input on this SO question we tried this in the pipeline:
python3 -m pip install --ignore-installed pywinrm
find / -name winrm.py
ansible-playbook -vvv -i inventory/hosts.cfg main.yml
The find command finds winrm.py here:
/opt/pipx/venvs/ansible-core/lib/python3.8/site-packages/ansible/plugins/connection/winrm.py
The ansible-playbook configuration we are using is:
ansible-playbook [core 2.12.5]
config file = None
configured module search path =
['/home/vsts/.ansible/plugins/modules',
'/usr/share/ansible/plugins/modules']
ansible python module location = /opt/pipx/venvs/ansible-
core/lib/python3.8/site-packages/ansible
ansible collection location =
/home/vsts/.ansible/collections:/usr/share/ansible/collections
executable location = /opt/pipx_bin/ansible-playbook
python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC
9.4.0]
jinja version = 3.1.2
libyaml = True
No config file found; using defaults
The error we get is:
task path: /home/vsts/work/1/s/ansible/main.yml:1
redirecting (type: modules) ansible.builtin.setup to
ansible.windows.setup
Using module file /opt/pipx/venvs/ansible-
core/lib/python3.8/site-
packages/ansible_collections/ansible/windows/plugins/modules/
setup.ps1
Pipelining is enabled.
fatal: [13.73.148.141]: FAILED! => {
"msg": "winrm or requests is not installed: No module named
'winrm'"
}

you can try solution in RedHat knowledgebase
https://access.redhat.com/solutions/3356681
Last comment suggestion (replace yum with apt commands)
I was getting this error even if python2-winrm version 0.3.0 is
already installed via yum
yum list installed | grep winrm python2-winrm.noarch
0.3.0-1.el7 #epel
pip install "pywinrm>=0.2.2" only resulted in "Requirement already
satisfied"
I ran this to resolve the error -
yum autoremove python2-winrm.noarch
pip install "pywinrm>=0.2.2"
Then ping: pong worked just fine over https, port=5986
ram#thinkred1cartoon$ ansible all -i hosts.txt -m win_ping
172.16.96.135 | SUCCESS => {
"changed": false,
"ping": "pong" }
conversely, if you don't want to run command 1, then command 2 won't
work for you. In that case, run command 3
3 ) pip install --ignore-installed "pywinrm>=0.2.2"

postgresql pgbadger error - can not load incompatible binary data, binary file is from version < 4.0

I am trying to use pgbadger to make html report for postgres slow query log files. My postgres logfiles are in csvlog format in folder pg_log. I transfer all logfiles
(80 files with 10 MB each) to my local windows machine and trying to generate single html report for all files. I created all one file from all files in below way,
type postgresql-2020-06-18_075333.csv > postgresql.csv
type postgresql-2020-06-18_080011.csv >> postgresql.csv
....
....
type postgresql-2020-06-18_094812.csv >> postgresql.csv
I downloaded "pgbadger-11.2" and tried below command but getting error.
D:\pgbadger-11.2>perl --version
This is perl 5, version 28, subversion 1 (v5.28.1) built for MSWin32-x64-multi-thread
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql.csv" -o postgresql.html
[========================>] Parsed 923009530 bytes of 923009530 (100.00%), queries: 1254764, events: 53
can not load incompatible binary data, binary file is from version < 4.0.
LOG: Ok, generating html report...
postgresql.html is created but no data in any tab.But it works when i create separate report for individual csv. like below.
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql-2020-06-18_075333.csv" -o postgresql-2020-06-18_075333.html
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql-2020-06-18_080011.csv" -o postgresql-2020-06-18_080011.html
...
D:\pgbadger-11.2>perl pgbadger "D:\June-Logs\postgresql-2020-06-18_094812.csv" -o postgresql-2020-06-18_094812.html
Please suggest me something to fix this issue.

I going to say this due to:
type postgresql-2020-06-18_075333.csv > postgresql.csv
type postgresql-2020-06-18_080011.csv >> postgresql.csv
Pretty sure that is introducing Windows line endings and pgBadger is looking for Unix line endings. Can you do the concatenate on the server?
UPDATE. Hmm. Ran across this
https://github.com/darold/pgbadger/releases
"This new release breaks backward compatibility with old binary or JSON
files. This also mean that incremental mode will not be able to read
old binary file [...] Add a warning about version and skip loading incompatible binary file.
Update code formatter to pgFormatter 4.0."
Not sure why it is failing on CSV logs, still what is version of pgBadger generating logs?

Apache Zeppelin cannot deserialize dataset: "NoSuchMethodError"

I am trying to use Apache Zeppelin (0.7.2, net install running locally on a Mac) to explore data loaded from an s3 bucket. The data seems to load just fine, as the command:
val p = spark.read.textFile("s3a://sparkcookbook/person")
gives the result:
p: org.apache.spark.sql.Dataset[String] = [value: string]
However, when I try to call methods on the object p, I get an error. For example:
p.take(1)
results in:
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
My conf/zeppelin-env.sh is the same as the default, except that I have amazon access key and secret key environment variables defined there. In the Spark interpreter in the Zeppelin notebook, I have added the following artifacts:
org.apache.hadoop:hadoop-aws:2.7.3
com.amazonaws:aws-java-sdk:1.7.9
com.fasterxml.jackson.core:jackson-core:2.9.0
com.fasterxml.jackson.core:jackson-databind:2.9.0
com.fasterxml.jackson.core:jackson-annotations:2.9.0
(I think only the first two are necessary). The two commands above work fine in the Spark shell, just not in the Zeppelin notebook (see How to use s3 with Apache spark 2.2 in the Spark shell for how that was set up).
So it seems that there is a problem with one of the Jackson libraries. Maybe I'm using the wrong artifacts above for the Zeppelin interpreter?
UPDATE: Following the advice in the proposed answer below, I removed the jackson jars that came with Zeppelin, and replaced them with the following:
jackson-annotations-2.6.0.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
And replaced the artifacts with these, so my artifacts are now:
org.apache.hadoop:hadoop-aws:2.7.3
com.amazonaws:aws-java-sdk:1.7.9
com.fasterxml.jackson.core:jackson-core:2.6.7
com.fasterxml.jackson.core:jackson-databind:2.6.7
com.fasterxml.jackson.core:jackson-annotations:2.6.0
The error I get, however, from running the above commands is the same.
UDPATE2: As per I removed the jackson libraries from the list of artifacts, since they are already now in the jars/ folder - the only added artifacts are now the aws artifacts above. I then cleaned the classpath by entering the following in the notebook (as per the instructions):
%spark.dep
z.reset()
I get a different error now:
val p = spark.read.textFile("s3a://sparkcookbook/person")
p.take(1)
p: org.apache.spark.sql.Dataset[String] = [value: string]
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49)
at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala)
at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:20)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:37)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
UPDATE3: As per the suggestion in a comment to the proposed answer below, I cleaned the class path by deleting all the files in the local repo:
rm -rf local-repo/*
I then restarted the Zeppelin server. To check the class path, I executed the following in the notebook:
val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)
This gave the following output (I include only the jackson libraries from the output here, otherwise the output is too long to paste):
...
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-annotations-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.1.1.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-databind-2.2.3.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/local-repo/2CT9CPAA9/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-annotations-2.6.0.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-core-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/lib/jackson-databind-2.6.7.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/zeppelin-0.7.2-bin-netinst/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-annotations-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-core-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-databind-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-jaxrs-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-mapper-asl-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-paranamer-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-module-scala_2.11-2.6.5.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/jackson-xc-1.9.13.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/json4s-jackson_2.11-3.2.11.jar
file:/Users/shafiquejamal/allfiles/scala/spark/spark-2.1.0-bin-hadoop2.7/jars/parquet-jackson-1.8.1.jar
...
It seems that multiple versions are fetched from the repo. Should I exclude the older versions? If so, how do I do that?

Use this jar versions;
aws-java-sdk-1.7.4.jar
hadoop-aws-2.6.0.jar
like in this script : https://github.com/2dmitrypavlov/sparkDocker/blob/master/zeppelin.sh
do not use package but download the jars and put them in a path, let's say in "/root/jars/" then edit your zeppelin-env.sh;
then run this command from zeppelin/conf dir;
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
after that restart zeppelin.
The code at the link above is pasted below (just in case the link becomes stale):
#!/bin/bash
# Download jars
cd /root/jars
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.39/mysql-connector-java-5.1.39.jar
cd /usr/share/
wget http://archive.apache.org/dist/zeppelin/zeppelin-0.7.1/zeppelin-0.7.1-bin-all.tgz
tar -zxvf zeppelin-0.7.1-bin-all.tgz
cd zeppelin-0.7.1-bin-all/conf
cp zeppelin-env.sh.template zeppelin-env.sh
echo 'export MASTER=spark://'$MASTERZ':7077'>>zeppelin-env.sh
echo 'export SPARK_SUBMIT_OPTIONS="--jars /root/jars/mysql-connector-java-5.1.39.jar,/root/jars/aws-java-sdk-1.7.4.jar,/root/jars/hadoop-aws-2.6.0.jar"'>>zeppelin-env.sh
echo 'export ZEPPELIN_NOTEBOOK_STORAGE="org.apache.zeppelin.notebook.repo.VFSNotebookRepo, org.apache.zeppelin.notebook.repo.zeppelinhub.ZeppelinHubRepo"'>>zeppelin-env.sh
echo 'export ZEPPELINHUB_API_ADDRESS="https://www.zeppelinhub.com"'>>zeppelin-env.sh
echo 'export ZEPPELIN_PORT=9999'>>zeppelin-env.sh
echo 'export SPARK_HOME=/usr/share/spark'>>zeppelin-env.sh
cd ../bin/
./zeppelin.sh

You are probably using a too recent Jackson version. Even spark 2.3 is still on `2.6.7. Downgrade, and make sure that all your jackson JARs are consistent.

wget files from FTP-like listings

So, site that used to use FTP now has an HTTP front-end and won't allow FTP connections. The site in question (for an example directory) will show a page with links to different dates. Inside each of these different dates, there are many files, and I typically just need to get some file with some clear pattern e.g. *h17v04*.hdf. I thought this could work:
wget -I "${PLATFORM}/${PRODUCT}/${YEAR}.*" -r -l 4 \
--user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" \
--verbose -c -np -nc -nd \
-A "*h17v04*.hdf" http://e4ftl01.cr.usgs.gov/$PLATFORM/$PRODUCT/
where PLATFORM=MOLT, PRODUCT=MOD09GA.005 and YEAR=2004, for example. This seems to start looking into all the useful dates, finds the index.html, and then just skips to the next directory, without downloading the relevant hdf file:
--2013-06-14 13:09:18-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/
Reusing existing connection to e4ftl01.cr.usgs.gov:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html'
[ <=> ] 174,182 134K/s in 1.3s
2013-06-14 13:09:20 (134 KB/s) - `e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html' saved [174182]
Removing e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.01/index.html since it should be rejected.
--2013-06-14 13:09:20-- http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2004.01.02/
[...]
If I ignore the -A option, only the index.html file is downloaded to my system, but it appears it's not parsed and the links are not followed. I don't really know what more is required to make this work, as I can't see why it doesn't!!!
SOLUTION
In the end, the problem was due to an old bug in the local version of wget. However, I ended up writing my own script for downloading MODIS data from the server above. The script is pure Python, and is available from here.

Consider to use pyModis instead of wget which is a Free and Open Source Python based library to work with MODIS data. It offers bulk-download for user selected time ranges, mosaicking of MODIS tiles, and the reprojection from Sinusoidal to other projections, convert HDF format to other formats. See
http://www.pymodis.org/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cannot apply count() or collecr() on RDD from textfile(Spark) - pyspark

Related

How to fix the sdk_addon data copy build error?

Is it possible to configure Azure Windows VMs using Ansible on Azure DevOps Microsoft Hosted Ubuntu agents?

postgresql pgbadger error - can not load incompatible binary data, binary file is from version < 4.0

Apache Zeppelin cannot deserialize dataset: "NoSuchMethodError"

wget files from FTP-like listings

Categories

Resources