import data from Postgres to Cassandra - postgresql

I need to import data from Postgres to Cassandra using open source technologies only.
Can anyone please outline the steps I need to take.
As per instructions, I have to refrain from using DataStax software as they come with license.
Steps I have already tried:
Export one table from Postgres in csv format and imported to HDFS (using sqoop) {If I take this approach do I need to use Map_Reduce after this?}.
Tried to import the csv file to Cassandra using cql, however, got this error
Cassandra: Unable to import null value from csv
I am trying several methods, but unable to find a solid approach of attach.
Can anyone of you please provide me the steps required for the whole process. I believe there would be many people who have already done that.

Related

Is there any import option available for Accumulo..?

I am new to Accumulo. We are migrating MongoDB to Accumulo database. we got a file with all tables information from mongoDB. Is there any option available in Accumulo to import the file and create the tables by its own? Through the API document I came to know that we can create table by shell script and also through programmatically. Can anyone tell me is there any import option available for Accumulo to import the file and create the tables?
There is no native way to do that. MongoDb deals with JSON documents which have an entirely different layout/schema than the way Accumulo does things. You could try using something like http://gora.apache.org/index.html, but that requires you to change the format of your MongoDB data. If you can't do that than you'll more than likely you'll have to do this pragmatically yourself.

Connect Neo4J on an existing Postgresql database

I'm a Neo4j new user and I played around with the webadmin interface of Neo4j to create small databases and simple queries in Cypher. Now I want to use Neo4J to create a graph with my existing database. It's a postgresql database with millions of entries with the same structure (Neo4J is very adapted to represent these data). My question is how to import these data ? What is the easiest way to do that ? I already saw that Cypher recognizes csv files but do I have to create a csv file with my data or is there another way to import them ? Thank you for your help. Sam
One option is to export your postgres data to csv and apply LOAD CSV to import them into the graph.
Another way is writing a script in a language of choice (I'd vote for groovy here) that connects to Postgres using JDBC and connects to Neo4j and then applies the business logic to transform between the two.
A third option is using a ETL tool like Talend. It basically does the same as your custom script but provides a point & click interface to define the transformation, see http://neo4j.com/blog/fun-with-music-neo4j-and-talend/ for more details.

Incrementally updating/adding data on HDFS

In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.
Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.
I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.
My need is :
there would be a scheduler which will run once in day, and It will:
Pick the data from all four tables and load in hdfs using sqoop.
PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
Sqoop will again export this data in a separate eporting table.
I have few questions around this:
Do i need to import whole data from DB to HDFS on every sqoop import call ?
in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.
Please help me out in this case...
Please suggest me the better solution if you have..
Sqoop support incremental and delta imports. Check the Sqoop documentation here for more details.

Import Data to cassandra and create the Primary Key

I've got some csv data to import to cassandra. This could work with the copy-command. The Problem is, that the csv doesn't serve a unique ID for the data so I need to create a timeuuid on import.
Is it possible to do this via copy-command or did I need to write a external script for importing?
I would write a quick script to do it, the copy command can really only handle small amounts of data anyway. Try the new python driver. I find it quite fast to setup loading scripts with, especially if you need any sort of minor modifications of the data before being loaded.
If you have a really big set of data bulk-loading is still the way to go.

Download HTTP data with Postgres StoredProcedure

I am wondering if there is a way to import data from an HTTP source from within an pgsql function.
I am porting an old system that harvests data from a website. Rather than maintaining a separate set of files to manage the downloading of the data, I was hoping to put the import routines directly into stored procedures.
I do know how to import data with COPY, but that requires the data to already be available locally. Is there a way to get the download the data with PL/PGSQL? Am I out to lunch?
Related: How to import CSV file data into a PostgreSQL table?
Depending what you're after, the Postgres extension www_fdw might work for you: http://pgxn.org/dist/www_fdw/
If you want download custom data by HTTP protocol, then PostgreSQL extensive support for different languages might be handy. Here is the example of connecting to Google Translate service from Postgres function written in Python:
https://wiki.postgresql.org/wiki/Google_Translate