Neo4j Import tool - Console output meaning - import

What does the console output of the Neo4j import tool mean?
Example lines:
[INPUT--------------PROPERTIES(2)======|WRITER: W:71.] 3M
[INPUT------|PREPARE(|RELATIO||] 49M
[Relationship --> Relationship + counts-------]282M
When I try to import a large dataset through this tool, it seems that at 248M, importing is hanging in the ‘calculate dense nodes’ step. What exactly does 'calculating dense nodes' do?

The import stages are:
Nodes
Prepare node index
Calculate dense nodes
Node --> Relationship Sparse
Relationship --> Relationship Sparse
Node counts
Relationship counts
As for interpreting the statistics, I guess #mattias-persson wrote it in the Neo4J manual. Copying it, for the record:
10.1.2.5. Output and statistics
While an import is running through its different stages, some statistics and figures are printed in the console. The general interpretation of that output is to look at the horizontal line, which is divided up into sections, each section representing one type of work going on in parallel with the other sections. The wider a section is, the more time is spent there relative to the other sections, the widest being the bottleneck, also marked with *. If a section has a double line, instead of just a single line, it means that multiple threads are executing the work in that section. To the far right a number is displayed telling how many entities (nodes or relationships) have been processed by that stage.
As an example:
[*>:20,25 MB/s-----------|PREPARE(3)==========|RELATIONSHIP(2)===========] 16M
Would be interpreted as:
> data being read, and perhaps parsed, at 20,25 MB/s, data that is being passed on to …​
PREPARE preparing the data for …​
RELATIONSHIP creating actual relationship records and …​
v writing the relationships to the store. This step is not visible in this example, because it is so cheap compared to the other sections.
Observing the section sizes can give hints about where performance can be improved. In the example above, the bottleneck is the data read section (marked with >), which might indicate that the disk is being slow, or is poorly handling simultaneous read and write operations (since the last section often revolves around writing to disk).

After some offline discussion, it seems that most, if not all "missing" nodes were due to one line in the CSV file had a property starting with quotation " but did not contain the end quote. This resulted in the parser reading until the next quote, i.e. through new-lines, thinking that it still read that same property value for that node.
It would be great with some sort of detection for such missing quotes, but that's not straight-forward given that it might mess with nodes/relationships actually spanning multiple lines.

Related

Is it a good idea to have a big set/list as column in `Scylla DB`?

Is it a good idea to have a table in Scylla DB with column type set with couple of thousands elements in it, e.g 5000 elements?
In Scylla documentation it's stated that:
Collections are meant for storing/denormalizing a relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all messages sent by a user”, “events registered by a sensor”…), then collections are not appropriate, and a specific table (with clustering columns) should be used. ~ [source]
My column is much bigger than "the phone numbers of a given user", but much smaller than “all messages sent by a user” (column set is going to be 'frozen', if that matters), so I am confused what to do?
If your set is frozen, you can be a little more relaxed about it. This is because ScyllaDB will not have to break it into components and re-create it so often as it does with non-frozen sets.
So if you're sure the frozen set won't be larger than a megabyte or so, it will be fine. For simple read/write queries it will be treated as a blob.
The main downside of having a large individual cell - frozen set, string, or a even an unfrozen set - is that the CQL API does not give you an efficient way to read or write only part of that cell. For example, every time you want to access your set, Scylla will need to read it entirely into memory. This takes time and effort. Even worse, it also increases the latency of other requests because Scylla's scheduling is cooperative and does not switch tasks in the middle of handling a single cell because it is assumed to be fairly small.
Whether or not 5,000 elements specifically is too much or not also depends on the size of each element - 5,000 elements of 10 bytes each totals 50K, but if it's 100 bytes each they total 500K. A 500K cell will certainly increase tail latency noticeably, but this may or may not be important for your application. If you can't think of a data model that doesn't involve large collections, then you can definitely try the one you thought of, and check if the performance is acceptable to you or not.
In any case, if your use case involves unbounded collections - i.e., 5,000 elements is not a hard limit but some sort of average, and if in some rows you actually have a million elements, you're in for a world of pain :-( You can start to see huge latencies (as one single 1-million-cell row delays many other requests waiting in line) and in extreme cases even allocation failures. So you will somehow need to avoid this problem. Avoiding it isn't always easy - Scylla doesn't have a feature that prevents your 5,000-element set growing into a million-element set (see https://github.com/scylladb/scylladb/issues/10070).

google dataprep (clouddataprep by trifacta) tip: jobs will not be able to run if they are to large

During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.

Neo4j's MERGE command on big datasets

Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.
Basically there are three methods
i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
followed by
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})
In my experience, this will return a error if a duplicate is found, which stops the query.
ii) Merging the nodes with all their properties.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})
I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.
iii) Merging nodes based on one property, and setting the rest after that.
USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc
I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.
I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).
Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.
CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE
Followed by:
USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
This should work, you can increase the PERIODIC COMMIT.
I can add a few hundred thousand nodes within minutes this way.
In general, make sure you have indexes in place. Merge a node first on the basis of the properties that are indexed (to exploit fast lookup) and then modify that node's properties as needed with SET.
Beyond that, both of your approaches are going through the transaction layer. If you need to jam a lot of data into the DB really quickly, you probably don't want to use transactions to do that, because they're giving you functionality you might not need, and they require overhead that's slowing you down. So a larger solution would be to not insert data with LOAD CSV but go another route entirely.
If you're using the 2.2 series of neo4j, you can go for the batch inserter via java, or the neo4j-import tool sadly not available prior to 2.2. What they both have in common is that they don't use transactions.
Finally, either way you go you should read Michael Hunger's article on importing data into neo4j as it provides a good conceptual discussion of what's happening, and why you need to skip transactions if you're going to load big huge piles of data into neo4j.

Why are zero byte files written to GCS when running a pipeline?

Our job/pipeline is writing the results of a ParDo transformation back out to GCS i.e. using TextIO.Write.to("gs://...")
We've noticed that when the job/pipeline completes, it leaves numerous 0 byte files in the output bucket.
The input to the pipeline is from multiple files from GCS, so I'm assuming the results are sharded, which is fine.
But why do we get empty files?
It is likely that these empty shards are the results of an intermediate pipeline step which turned out to be somewhat sparse and some pre-partitioned shards had no records in them.
E.g. if there was a GroupByKey right before the TextIO.Write and, say, the keyspace was sharded into ranges [00, 01), [01, 02), ..., [fe, ff) (255 shards total), but all actual keys emitted from the input of this GroupByKey were in the range [34, 81) and [a3, b5), then 255 output files will be produced, but most of them will turn out empty. (this is a hypothetical partitioning scheme, just to give you the idea)
The rest of my answer will be in the form of Q&A.
Why produce empty files at all? If there's nothing to output, don't create the file!
It's true that it would be technically possible to avoid producing them, e.g. by opening them lazily when writing output when the first element is written. AFAIK we normally don't do this because empty output files are usually not an issue, and it is easier to understand an empty file than absence of a file: it would be pretty confusing if, say, only the first of 50 shards turned out non-empty and you would only have a single output file named 00001-of-000050: you'd wonder what happened to the 49 other ones.
But why not add a post-processing step to delete the empty files? In principle we could add a post-processing step of deleting the empty outputs and renaming the rest (to be consistent with the xxxxx-of-yyyyy filepattern) if empty outputs became a big issue.
Does existence of empty shards signal a problem in my pipeline?
A lot of empty shards might mean that the system-chosen sharding was suboptimal/uneven and we should have split the computation into fewer, more uniform shards. If this is a problem for you, could you give more details about your pipeline's output, e.g.: your screenshot shows that the non-empty outputs are also pretty small: do they contain just a handful of records? (if so, it may be difficult to achieve uniform sharding without knowing the data in advance)
But the shards of my original input are not empty, doesn't sharding of output mirror sharding of input? If your pipeline has GroupByKey (or derived) operations, there will be intermediate steps where the number of shards in input and output are different: e.g. an operation may consume 30 shards of input but produce 50 shards of output, or vice versa. Different number of shards in input and output is also possible in some other cases not involving GroupByKey.
TL;DR If your overall output is correct, it's not a bug, but tell us if it is a problem for you :)

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.