How to do ultra fast batchimport in orientdb from csv files? - orientdb

We are evaluating graph databases to store our networked communication data and zeroed upon neo4j and orientdb.
Is there a batch importer tool or script for orient similar to what neo4j has? I was able to import a csv files with150M relationships and 18M nodes in under 25 mins for neo4j. Reading the documentation on orientdb site, looks like I need to use the ETL feature by modifying an json file to be able to do the import. Is there no other simpler and faster way to do the import from csv files?

Using OrientDB ETL is pretty easy. Look at: http://orientdb.com/docs/last/Import-from-CSV-to-a-Graph.html. Just create your json with the ETL steps and it's done.

Related

Migrate from OrientDB to AWS Neptune

I need to migrate a database from OrientDB to Neptune. I have an exported JSON file from Orient that contains the schema (classes) and the records - I now need to import this into Neptune. However, it seems that to import data into Neptune there must be a csv file containing all the vertex's and another file containing all the edges.
Are there any existing tools to help with this migration and converting to the required files/format?
If you are able to export the data as GraphML then you can use the GraphML2CSV tool. It will create a CSV file for the nodes and another for the edges with the appropriate header rows.
Keep in mind that GraphML is a lossy format (it cannot describe complex Java types the way GraphSON can) but you would not be able to import those into Neptune either.

What is the common practice to store users data and analysis it with Spark/hadoop?

I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark

Importing AccessDB and Oracle directly into MongoDB

I am receiving .dmp and .mdb files from a customer & need to get that data into MongoDB.
Is there any way to straight import these file types into Mongo?
The goal is to programmatically ingest these into mongo in any way I can. The only rule is that customer will not change their method of data delivery, meaning I'm stuck with the .dmp and .mdb files as a source.
Any assistance would be greatly appreciated.
Here are a few options/ideas:
Convert mdb to csv, then use mongoimport --type csv to import into MongoDB.
Use an ETL tool, e.g. Pentaho, Informatica, etc. This will give you much more flexibility for doing any necessary transformation/conversion of data.
Write a custom ETL tool, using libraries that know how to read mdb and dmp files.
You don't mention how you plan to use this data, how many tables are in the database, and how normalized the tables are. Depending on the specifics of your use case, it's very possible that loading the data from Access "as is" will not be a good choice since normalized schemas are not a good fit for MongoDB and MongoDB does not natively support joins. This is where an ETL tool can help, by extracting the source data and transforming it into an appropriate JSON structure.
MongoDB has released ODBC drivers. Go Here MongoDB ODBC Drivers connect MSAccess directly to MongoDB through ODBC. Voila!

Connect Neo4J on an existing Postgresql database

I'm a Neo4j new user and I played around with the webadmin interface of Neo4j to create small databases and simple queries in Cypher. Now I want to use Neo4J to create a graph with my existing database. It's a postgresql database with millions of entries with the same structure (Neo4J is very adapted to represent these data). My question is how to import these data ? What is the easiest way to do that ? I already saw that Cypher recognizes csv files but do I have to create a csv file with my data or is there another way to import them ? Thank you for your help. Sam
One option is to export your postgres data to csv and apply LOAD CSV to import them into the graph.
Another way is writing a script in a language of choice (I'd vote for groovy here) that connects to Postgres using JDBC and connects to Neo4j and then applies the business logic to transform between the two.
A third option is using a ETL tool like Talend. It basically does the same as your custom script but provides a point & click interface to define the transformation, see http://neo4j.com/blog/fun-with-music-neo4j-and-talend/ for more details.

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis).
How can we do it? Should we export the mongo to csv?
I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.
To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.
You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.
Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).
So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.
In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:
mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
head data.json > sample.json
ddlgenerator postgresql sample.json
Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.
Next you upload the data into Redshift.
If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.
For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.
Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.
For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.
Here are some related resources although in the end I did not use them:
Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift
Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.
AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.
MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)
Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
A few questions that would be helpful to know would be:
Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?
Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .
I had to tackle the same issue (not on a daily basis though).
as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.
My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output.
piping the output also allows you to make the process parallel.
Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.
Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.
Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.
Automatically and dynamically Detect DML, DDL for tables for replication.