Adding new data to the neo4j graph database - import

I am importing a huge dataset of about 46K nodes into Neo4j using import option.Now this dataset is dynamic i.e new entries keep getting adding to it now and then so if i have to re perform the entire import then its wastage of resources.I tried using neo4j rest client of python to send the queries to create the new data points but as the number of new data points increase the time taken is more than the importing of 46k nodes.So is there any alternative to add these datapoints or do i have to redo the entire import?

First of all - 46k is rather tiny.
The most easy way to import data into Neo4j is using LOAD CSV togesther with PERIODIC COMMIT. http://neo4j.com/developer/guide-import-csv/ contains all the details.
Be sure to have indexes in place to find the stuff that needs to be changed with an incremental update quickly.

Related

When you create a free form of Microstrategy, is it possible to do an automatic mapping?

When you finish the free form query in microstrategy, the next step is to map the columns.
Is there any way to do it automatically? At least make the list of the columns with its names.
Thanks!!!!
Sadly, this isn't possible. You will have to map all columns manually.
While this functionality isnt possible with freeform reporting specifically, Microstrategy Data Import will allow you the ability to create Data Import Cubes. These cubes can be configured as live connections, meaning they execute against the data source selected every time they are used, and are not your typical snapshot cube. Data Imports from a database can be sourced from a database query. This effectively allows you to write your own SQL with the end result being a report that you did not have to specify columns manually for.

How to import datasets as csv file to power bi using rest api?

I want to automate the import process in power bi, But I can't find how to publish a csv file as a dataset.
I'm using a C# solution for this.
Is there a way to do that?
You can't directly import CSV files into a published dataset in Power BI Service. AddRowsAPIEnabled property of datasets published from Power BI Desktop is false, i.e. this API is disabled. Currently the only way to enable this API is to create a push dataset programatically using the API (or create a streaming dataset from the site). In this case you will be able to push rows to it (read the CSV file and push batches of rows, either using C# or some other language, even PowerShell). In this case you will be able to create reports with using this dataset. However there are lots of limitations and you should take care of cleaning up the dataset (to avoid reaching the limit of 5 million rows, but you can't delete "some" of the rows, only to truncate the whole dataset) or to make it basicFIFO and lower the limit to 200k rows.
However a better solution will be to automate the import of these CSV files to some database and make the report read the data from there. For example import these files into Azure SQL Database or Data Bricks, and use this as a data source for your report. You can then schedule the refresh of this dataset (in case you use imported) or use Direct Query.
After a Power BI updates, it is now possible to import the dataset without importing the whole report.
So what I do is that I import the new dataset and I update parameters that I set up for the csv file source (stored in Data lake).

Using postgres to replace csv files (pandas to load data)

I have been saving files as .csv for over a year now and connecting those files to Tableau Desktop for visualization for some end-users (who use Tableau Reader to view the data).
I think I settled on migrating to postgreSQL and I will be using the pandas library to_sql to fill it up.
I get 9 different files each day and I process each of them (I currently consolidate them into monthly files in .csv.bz2 format) by adding columns, calculations, replacing information, etc.
I create two massive csv files using pd.concat and pd.merge out of those
processed files which Tableau is connected to. These files are literally overwritten every day when new data is added which is time consuming
Is it okay to still do my file joins and concatenation with pandas and export the output data to postgres? This will be my first time using a real database and I am more comfortable with pandas compared to learning SQL syntax and creating views or tables. I just want to avoid overwriting the same csv files over and over (and some other csv problems I run into).
Don't worry too much about normalization. A properly normalized database will usually be more efficient and easier to handle than an non-normalized. On the other hand, if you have non-normalized csv data you dump into a database, your import functions will be a lot more complicated if you do a proper normalization. I think I would recommend you to make one step at the time. Start up with just loading the processed csv-files into postgres. I am pretty sure all processing following that will be a lot easier and quicker than doing it using csv-files (just make sure you set up the right indexes). When you start to get used to using the database, you can start to do more processing there.
Just remember, one thing a database is really good at is to pick out the subset of data you want to work on. Try as much as possible to avoid pulling out huge amount of data from the database when you only intend to work on a subset of it.

Incrementally updating/adding data on HDFS

In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.
Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.
I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.
My need is :
there would be a scheduler which will run once in day, and It will:
Pick the data from all four tables and load in hdfs using sqoop.
PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
Sqoop will again export this data in a separate eporting table.
I have few questions around this:
Do i need to import whole data from DB to HDFS on every sqoop import call ?
in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.
Please help me out in this case...
Please suggest me the better solution if you have..
Sqoop support incremental and delta imports. Check the Sqoop documentation here for more details.

Import Data to cassandra and create the Primary Key

I've got some csv data to import to cassandra. This could work with the copy-command. The Problem is, that the csv doesn't serve a unique ID for the data so I need to create a timeuuid on import.
Is it possible to do this via copy-command or did I need to write a external script for importing?
I would write a quick script to do it, the copy command can really only handle small amounts of data anyway. Try the new python driver. I find it quite fast to setup loading scripts with, especially if you need any sort of minor modifications of the data before being loaded.
If you have a really big set of data bulk-loading is still the way to go.