HBase Export/Import: Unable to find output directory - import

I am using HBase for my application and I am trying to export the data using org.apache.hadoop.hbase.mapreduce.Export as it was directed here. The issue I am facing with the command is that once the command is executed, there are no errors while creating the export. But the specified output directoy does not appear at its place.The command I used was
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name db_dump/

I got the solution hence I am replying my own answer
You must have following two lines in hadoop-env.sh in conf directory of hadoop
export HBASE_HOME=/home/sitepulsedev/hbase/hbase-0.90.4
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.90.4.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.90.4-test.jar:$HBASE_HOME/lib/zookeeper-3.3.2.jar:$HBASE_HOME
save it and restart mapred by ./stop-mapred.sh and ./start-mapred.sh
now run in bin directory of hadoop
./hadoop jar ~/hbase/hbase-0.90.4/hbase-0.90.4.jar export your_table /export/your_table
Now you can verify the dump by hitting
./hadoop fs -ls /export
finally you need to copy the whole thing into your local file system for which run
./hadoop fs -copyToLocal /export/your_table ~/local_dump/your_table
here are the References that helped me out in export/import and in hadoop shell commands
Hope this one helps you out!!

As you noticed the HBase export tool will create the backup in the HDFS, if you instead want the output to be written on your local FS you can use the file URI. In your example it would be something similar to:
bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
Related to your own answer, this would also avoid going through the HDFS. Just be very careful if your are running this is a cluster of servers, because each server will write the result files in their own local file systems.
This is true for HBase 0.94.6 at least.
Hope this helps

I think the previous answer needs some modification:
Platform: AWS EC2,
OS: Amazon Linux
Hbase Version: 0.96.1.1
Hadoop Distribution: Cloudera CDH5.0.1
MR engine: MRv1
To export data from Hbase Table to local filesystem:
sudo -u hdfs /usr/bin/hbase org.apache.hadoop.hbase.mapreduce.Export -Dmapred.job.tracker=local "table_name" "file:///backups/"
This command will dump data in HFile format with number of files equaling the number of regions of that table in Hbase.

Related

uploading large file to AWS aurora postgres serverless

I have been trying for days to copy a large CSV file to a table in PostgreSQL I am using PGadmin4 to access the database. I have a file on my system the file is 10 GB so I am getting starting error when trying to upload it via UI or \copy command.
When talking about 10 GB CSV file, then you may use as well different options
I believe \copy should work, you did not provide any more information about the issue
I'd personally use the AWS Glue - an ETL service which could read from an S3 file

Using Postgresql JDBC source with Apache Spark on EMR

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.
I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5
Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

AWS DMS Streaming replication : Logical Decoding Output Plugins(test_decoding) not accessible

I'm trying to migrate a PostgreSQL DB persisted on cloud (on DO droplet) to RDS using AWS Database Migration Service (DMS).
I've successfully configured the replication instance and endpoints.
I've created a task with Migrate existing data and replicate ongoing changes. When I start the task it shows some error ERROR: could not access file "test_decoding": No such file or directory.
I've tried to create a replication slot manually on my DB console it throws the same error.
I've followed the procedures which was suggested on the DMS documentation for Postgres
I'm using PostgreSQL 9.4.6 on my source endpoint.
I presume that the problem is the output plugin test_decoding was not accessible to do the replication.
Please assist me to resolve this. Thanks in advance!
You must install postgresql-contrib additional supplied modules on Your source endpoint.
If it is installed, make sure, directory where test_decoding module located is the same with directory where PostgreSQL expect it.
In *nix, You can check module directory by command:
pg_config --pkglibdir
If it is not the same, copy module, or make symlink, or some other solution You prefer.

How to migrate/shift/copy/move data in Neo4j

Does any one know how to migrate data from one instance of Neo4j to another. To be more precise, I want to know, how to move the data from one instance of Neo4j on my local machine to another on remote machine. Does any one have any idea about it.
I'm working on my windows machine with Eclipse and Embedded Neo4j . I need to transfer this data to remote Neo4j instance on a Centos machine. Please help me with this.
Not sure how to do it for "embedded neo4j db".
But for standalone and in case you have something like the command line tool "Putty" on your windows machine, this should work. Instead of $NEO4j_HOME you can also use the normal path without the env variable.
$NEO4J_HOME/bin/neo4j stop
cd $NEO4J_HOME/data
tar -cvf graph.db.tar graph.db
gzip graph.db.tar
scp -i ~/some_path/key_for_remote_server.pem ./graph.db.tar.gz username#your_remote_domeain.com:~/
ssh -i ~/some_path/key_for_remote_server.pem/ username#your_remote_domeain.com
On your remote server (at least this works for ubuntu):
Maybe you need to use "sudo" (prefix the commands with sudo).
mv ./graph.db.tar.gz /some_path/
cd /some_path/
gunzip graph.db.tar.gz
tar -xvf graph.db.tar
$NEO4J_HOME/bin/neo4j start
$NEO4J_HOME/bin/neo4j status
You can migrate the data by using the apoc procedure by running the below query in the cypher shell from where the data needs to be exported:
CALL apoc.export.cypher.all('myfilename.cypher');
This will download the file with cypher queries in the import folder
Go the database instance where the data needs to be imported and copy the file in the import folder. Run the below command using the cypher shell:
apoc.cypher.runFile("myfilename.cypher",{}) yield row, result;
For more advanced options follow the below links:
https://neo4j.com/docs/labs/apoc/current/export/cypher/
http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.4/cypher-execution/run-cypher-scripts/
I found out the following workaround for copying the data from a server in the cluster to all others, after using the neo4j-import tool:
Stop all nodes.
On the new node/server, where you need your data to be copied, you have to create the database folder for that graph (in my case loadTest):
/neo4j-enterprise-3.1.0/data/databases/loadTest.db
Then, the source node/server that is holding the data, you have to copy here the neostore.id file to the destination server db folder (loadTest.db from the previous step).
Start all nodes. In the background neo4j will copy data from other cluster servers to the new node.
For embedded mode , you would just need to locate the graph neo4j-db folder location then zip and send it to the remote system.
In your code where you would have called graphdatabaseservice , you would have given the target location
Check if its relative path then the database might be in your project folder .
Now for running the db instance on browser , you will need to use the neo4j communty server and point it to the folder containing the index folder. So if your neo4j-db is located at $project/tmp/neo4j-db then you will give the file path till this folder(the index folder will be inside this folder)
Edit
The folder that will contain the schema and index folders needs to be zipped. You can upload and unzip the folder at a certain location using Putty on your standalone server. Then just change the org.neo4j.server.database.location in conf/neo4j-server.properties file.

Can PostgreSQL COPY read CSV from a remote location?

I've been using JDBC with a local Postgres DB to copy data from CSV files into the database with the Postgres COPY command. I use Java to parse the existing CSV file into a CSV format matches the tables in the DB. I then save this parsed CSV to my local disk. I then have JDBC execute a COPY command using the parsed CSV to my local DB. Everything works as expected.
Now I'm trying to perform the same process on a Postgres DB on a remote server using JDBC. However, when JDBC tries to execute the COPY I get
org.postgresql.util.PSQLException: ERROR: could not open file "C:\data\datafile.csv" for reading: No such file or directory
Am I correct in understanding that the COPY command tells the DB to look locally for this file? I.E. the remote server is looking on its C: drive (doesn't exist).
If this is the case, is there anyway to indicate to the copy command to look on my computer rather than "locally" on the remote machine? Reading through the copy documentation I didn't find anything that indicated this functionality.
If the functionality doesn't exist, I'm thinking of just populating the whole database locally then copying to database to the remote server but just wanted to check that I wasn't missing anything.
Thanks for your help.
Create your sql file as follows on your client machine
COPY testtable (column1, c2, c3) FROM STDIN WITH CSV;
1,2,3
4,5,6
\.
Then execute, on your client
psql -U postgres -f /mylocaldrive/copy.sql -h remoteserver.example.com
If you use JDBC, the best solution for you is to use the PostgreSQL COPY API
http://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html
Otherwise (as already noted by others) you can use \copy from psql which allows accessing the local files on the client machine
To my knowledge, the COPY command can only be used to read locally (either from stdin or from file) from the machine where the database is running.
You could make a shell script where you run you the java conversion, then use psql to do a \copy command, which reads from a file on the client machine.