Unable to read file from HDFS

Unable to read file from HDFS - scala

I am trying to read files from HDFS. I am using the following code:
val sam = sc.wholeTextFiles("hdfs://localhost:9000"+inputFolder,4)
I am getting the following error:
java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost, expected: hdfs://localhost:9000
I had referenced this question for adding the URI in my file path:
Cannot Read a file from HDFS using Spark
But I am still not able to read the file due t the above error. How can I resolve this?

Can you check adding winutils.exe in your system and setting a environment variable for the same . Spark needs winutils.exe to do hdfs operations.

Try using IP instead of localhost

Related

How to set Spark structured streaming check point dir to windows local directory?

My OS is Windows 11 and Apache Spark version is spark-3.1.3-bin-hadoop3.2
I try to use Spark structured streaming with pyspark. Belows are my simple spark structured streaming codes.
spark = SparkSession.builder.master("local[*]").appName(appName).getOrCreate()
spark.sparkContext.setCheckpointDir("/C:/tmp")
The same Spark codes without spark.sparkContext.setCheckpointDir line throws no errors on Ubuntu 22.04. However the above codes do not work successfully on Windows 11. The exemptions are
pyspark.sql.utils.IllegalArgumentException: Pathname /C:/tmp/67b1f386-1e71-4407-9713-fa749059191f from C:/tmp/67b1f386-1e71-4407-9713-fa749059191f is not a valid DFS filename.
I think the error codes mean checkpoint directory are generated on Hadoop file system of Linux, not on Windows 11. My operating system is Windows and checkpoint directory should be Windows 11 local directory. How can I configure Apache Spark checkpoint with Windows 11 local directory? I used file:///C:/temp and hdfs://C:/temp URL for test. But the errors are still thrown.
Update
I set below line to be comments.
#spark.sparkContext.setCheckpointDir("/C:/tmp")
Then the exceptions are thrown.
WARN streaming.StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: C:\Users\joseph\AppData\Local\Temp\temporary-be4f3586-d56a-4830-986a-78124ab5ee74. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
pyspark.sql.utils.IllegalArgumentException: Pathname /C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 from hdfs://localhost:9000/C:/Users/joseph/AppData/Local/Temp/temporary-be4f3586-d56a-4830-986a-78124ab5ee74 is not a valid DFS filename.
I wonder why hdfs url contains c:/ driver letters and I want to know how to set spark.sql.streaming.forceDeleteTempCheckpointLocation to true.

step 1)
Since you are running spark from a windows machine, make sure winutils.exe file added in hadoop bin folder reference link for same (6th Step) https://phoenixnap.com/kb/install-spark-on-windows-10.
step 2)
then try to add like this
spark.sparkContext.setCheckpointDir("D:\Learn\Checkpoint")
spark.sparkContext.setCheckpointDir("D:\Learn\Checkpoint")
Make sure spark user does have the permission to write in mentioned checkpoint directory

Read Parquet files directly while using Postgres

I am trying to install an extension to postgres that will help me write postgres queries to read data directly from parquet files.
This is the extension I found - https://github.com/pgspider/parquet_s3_fdw
After installing the required dependencies I went ahead and tried running the 'make' command.
make install
But ends up with an error
Makefile:45: /contrib/contrib-global.mk: No such file or directory
make: *** No rule to make target '/contrib/contrib-global.mk'. Stop.
Has anyone else tried using this extension ? Or can you suggest me some other way to read data directly from parquet files while using postgres ? (Please note: conversion from parquet to any other format is not allowed under the circumstances that I'm trying this)
Thanks

I'm not sure about the error there but the FDW you referenced is for accessing parquet files on S3 which you didn't mention as a requirement. You might want to try a simpler version like https://github.com/adjust/parquet_fdw

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5

Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

spark read from hdfs with kerberos and write on local filesystem

I am trying to get the following use case:
spark read files from HDFS with Kerberos in parquet format
spark write this files in csv format
If I write to hdfs, it works perfectly. If I try to write to local filesystem, it doesn´t work: "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am using Spark 1.6.2.
To sumarize, my code is
val dfIn = sqc.read.parquet(pathIsilon)
dfIn.coalesce(1).write.format("com.databricks.spark.csv").save(pathFilesystem)

HBase Export/Import: Unable to find output directory

I am using HBase for my application and I am trying to export the data using org.apache.hadoop.hbase.mapreduce.Export as it was directed here. The issue I am facing with the command is that once the command is executed, there are no errors while creating the export. But the specified output directoy does not appear at its place.The command I used was
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name db_dump/

I got the solution hence I am replying my own answer
You must have following two lines in hadoop-env.sh in conf directory of hadoop
export HBASE_HOME=/home/sitepulsedev/hbase/hbase-0.90.4
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.90.4.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.90.4-test.jar:$HBASE_HOME/lib/zookeeper-3.3.2.jar:$HBASE_HOME
save it and restart mapred by ./stop-mapred.sh and ./start-mapred.sh
now run in bin directory of hadoop
./hadoop jar ~/hbase/hbase-0.90.4/hbase-0.90.4.jar export your_table /export/your_table
Now you can verify the dump by hitting
./hadoop fs -ls /export
finally you need to copy the whole thing into your local file system for which run
./hadoop fs -copyToLocal /export/your_table ~/local_dump/your_table
here are the References that helped me out in export/import and in hadoop shell commands
Hope this one helps you out!!

As you noticed the HBase export tool will create the backup in the HDFS, if you instead want the output to be written on your local FS you can use the file URI. In your example it would be something similar to:
bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
Related to your own answer, this would also avoid going through the HDFS. Just be very careful if your are running this is a cluster of servers, because each server will write the result files in their own local file systems.
This is true for HBase 0.94.6 at least.
Hope this helps

I think the previous answer needs some modification:
Platform: AWS EC2,
OS: Amazon Linux
Hbase Version: 0.96.1.1
Hadoop Distribution: Cloudera CDH5.0.1
MR engine: MRv1
To export data from Hbase Table to local filesystem:
sudo -u hdfs /usr/bin/hbase org.apache.hadoop.hbase.mapreduce.Export -Dmapred.job.tracker=local "table_name" "file:///backups/"
This command will dump data in HFile format with number of files equaling the number of regions of that table in Hbase.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse