I tried to load a csv from my local directory to hbase using similar command we have for hdfs but it failed.
I dont want to put my file into hdfs and load to hbase. Is there a way to load the data into hbase from local.
The below did not help
hbase(main):010:0> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," \
-Dimporttsv.columns="HBASE_ROW_KEY,events:driverId,events:driverName,events:eventTime,events:eventType,events:latitudeColumn,events:longitudeColumn,events:routeId,events:routeName,events:truckId" \
irfan_ns:driver_dangerous_event \
file:///home/aziz/driver
Error
SyntaxError: (hbase):10: syntax error, unexpected tIDENTIFIER
Related
I'm currently running PySpark via local mode. I want to be able to efficiently output parquet files to S3 via the S3 Directory Committer. This PySpark instance is using the local disk, not HDFS, as it is being submitted via spark-submit --master local[*].
I can successfully write to my S3 Instance without enabling the directory committer. However, this involves writing staging files to S3 and renaming them, which is slow and unreliable. I would like for Spark to write to my local filesystem as a temporary store, and then copy to S3.
I have the following configuration in my PySpark conf:
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
self.spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
My spark-submit command looks like this:
spark-submit --master local[*] --py-files files.zip --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark.internal.io.cloud.PathOutputCommitProtocol --driver-memory 4G --name clean-raw-recording_data main.py
spark-submit gives me the following error, due to the requisite JAR not being in place:
java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
My questions are:
Which JAR (specifically, the maven coordinates) do I need to include in spark-submit --packages in order to be able to reference PathOutputCommitProtocol?
Once I have (1) working, will I be able to use PySpark's local mode to stage temporary files on the local filesystem? Or is HDFS a strict requirement?
I need this to be running in local mode, not cluster mode.
EDIT:
I got this to work with the following configuration:
Using pyspark version 3.1.2 and the package
org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253.
I needed to add the cloudera repository using the --repositories option for spark-submit:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
you need the spark-hadoop-cloud module for the release of spark you are using
the committer is happy using the local fs (it's now the public integration test suites work https://github.com/hortonworks-spark/cloud-integration. all that's needed is a "real" filesystem shared across all workers and the spark driver, so the driver gets the manifests of each pending commit.
print the _SUCCESS file after a job to see what the committer did: 0 byte file == old committer, JSON with diagnostics == new one
I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.
Just as the title says, trying to move some data from Redshift to S3 via Sqoop:
sqoop-import -Dmapreduce.job.user.classpath.first=true --connect "jdbc:redshift://redshiftinstance.us-east-1.redshift.amazonaws.com:9999/stuffprd;database=ourDB;user=username;password=password;" --table ourtable -m 1 --as-avrodatafile --target-dir s3n://bucket/folder/folder1/
All drivers are in the proper folders however the error being throw is:
ERROR tool.BaseSqoopTool: Got error creating database manager: java.io.IOException: No manager for connect string:
Not sure if you already have got the answer to this, but you need to add the following to your sqoop command:
--driver com.amazon.redshift.jdbc42.Driver
--connection-manager org.apache.sqoop.manager.GenericJdbcManager
I can't help with the error but I recommend you not do it this way. Sqoop will try retrieve the table as SELECT * and all results will have to pass through the leader node. This will be much slower than using UNLOAD to export the data directly to S3 in parallel. You can then convert the unloaded text files to Avro using Sqoop.
I am trying to get the following use case:
spark read files from HDFS with Kerberos in parquet format
spark write this files in csv format
If I write to hdfs, it works perfectly. If I try to write to local filesystem, it doesn´t work: "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
I am using Spark 1.6.2.
To sumarize, my code is
val dfIn = sqc.read.parquet(pathIsilon)
dfIn.coalesce(1).write.format("com.databricks.spark.csv").save(pathFilesystem)
I am trying to read files from HDFS. I am using the following code:
val sam = sc.wholeTextFiles("hdfs://localhost:9000"+inputFolder,4)
I am getting the following error:
java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost, expected: hdfs://localhost:9000
I had referenced this question for adding the URI in my file path:
Cannot Read a file from HDFS using Spark
But I am still not able to read the file due t the above error. How can I resolve this?
Can you check adding winutils.exe in your system and setting a environment variable for the same . Spark needs winutils.exe to do hdfs operations.
Try using IP instead of localhost