I upload the dataset into the storage of google cloud ai. Next, I open the flow in dataprep and put there the dataset. When I made the first recipe (without any step already) the dataset has approximately half of its original rows, that is, 36 234 instead of 62 948.
I would like to know what could be causing this problem. Some missing configuration?
Thank you very much in advance
Here are a couple thoughts . . .
Data Sampling
Keep in mind that what's shown in the Dataprep editor is typically a sample of the data, not the full data (unless its very small). If the full file was small enough to load, you should see the "Full Data" label up where the sample is typically shown:
In other cases, what you're actually looking at is a sample, which will also be indicated:
It's very beneficial to have an idea of how Dataprep's sampling works if you haven't reviewed the documentation already:
https://cloud.google.com/dataprep/docs/html/Overview-of-Sampling_90112099
Compressed Sources:
Another issue I've noticed occasionally is when loading compresses CSVs. In this case, I've had the interface tell me that I'm looking at the "Full Data"—but the number of rows is incorrect. However, any time this has happened the job does actually process the full number of rows.
Related
During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.
I would like to share one of my findings regarding dataprep product limitations.
I was construction flows in which I needed the combine a number of json files before further processing. The flows are then combined through references datasets at the end.
After a significant struggle I noticed that when the total number of json files used as input is lower than around 15, a dataflow job could be started.
However, going above this limit would cause a failure withouth any explanation.
It would be great if there is somebody who can give more insight into this issue:
* Why is there such a limitation?
* Is it another problem that could cause me to think there is a limitation?
* Is there a quick way to identify the sources of these types of issues/bugs in dataprep?
* Is there a workaround to increase the number of input files?
Cheers,
Bram
I did have an issue with starting jobs in dataprep and it was solved by
disabling the "profile results" option in the run page.
I'm a bit confused after reading this doc.
The doc says:
The fragments of the file must be uploaded sequentially in order. Uploading fragments out of order will result in an error.
Does that mean that, for one file divided into #1~10 fragments in order, I can only upload fragment 2 after I finish uploading fragment 1? If so, why is it possible to have multiple nextExpectedRanges? I mean, if you upload fragments one by one, you can make sure that previous fragments have already been uploaded.
According to the doc, byte range size has to be a multiple of 320 KB. Does that imply that the total file size has to be a multiple of 320 KB also?
There are currently some limitations that necessitate this sequencing requirement, however the long-term goal is to not. As a result, the API reflects this by supporting multiple nextExpectedRanges, but does not currently leverage it.
No, multiples of 320KiB are just the ideal size. You can choose others, and you can mix them. So for you scenario you could use all 320KiB chunks, except for the last one which would be whatever size is relevant to hit the overall size of your file.
This question is strictly DQS-performance related.
The ‘customers’ table I need to clean has 40,000,000 rows… I created a matching policy using a subset (no issues there, I just used a top 10,000).
Now when I want to do a data quality project… I can’t take the entire table in one project… It just won’t respond… I only managed to handle 400,000 at a time and even in that situation it takes almost 2 hours… And it’s not the best solution, because I need to do the project on a view where id between 1 and 400,000.
Any solution to this guys?
I am also wondering… where's the bottleneck? is it CPU or disk?
Regards.
In my model I'm using behaviour space to carry out a number of runs, with variables changing for each run and the output being stored in a *.csv for later analysis. The model runs fine for the first few iterations, but quickly slows as the data grows. My questions is will file-flush when used in behaviour space help this? Or is there a way around it?
Cheers
Simon
Make sure you are using table format output and spreadsheet format is disabled. At http://ccl.northwestern.edu/netlogo/docs/behaviorspace.html we read:
Note however that spreadsheet data is not written to the results file until the experiment finishes. Since spreadsheet data is stored in memory until the experiment is done, very large experiments could run out of memory. So you should disable spreadsheet output unless you really want it.
Note also:
doing runs in parallel will multiply the experiment's memory requirements accordingly. You may need to increase NetLogo's memory ceiling (see this FAQ entry).
where the linked FAQ entry is http://ccl.northwestern.edu/netlogo/docs/faq.html#howbig
Using file-flush will not help. It flushes any buffered data to disk, but only for a file you opened yourself with file-open, and anyway, the buffer associated with a file is fixed-size, not something that grows over time. file-flush is really only useful if you're reading from the same file from another process during a run.