I would like to share one of my findings regarding dataprep product limitations.
I was construction flows in which I needed the combine a number of json files before further processing. The flows are then combined through references datasets at the end.
After a significant struggle I noticed that when the total number of json files used as input is lower than around 15, a dataflow job could be started.
However, going above this limit would cause a failure withouth any explanation.
It would be great if there is somebody who can give more insight into this issue:
* Why is there such a limitation?
* Is it another problem that could cause me to think there is a limitation?
* Is there a quick way to identify the sources of these types of issues/bugs in dataprep?
* Is there a workaround to increase the number of input files?
Cheers,
Bram
I did have an issue with starting jobs in dataprep and it was solved by
disabling the "profile results" option in the run page.
Related
I upload the dataset into the storage of google cloud ai. Next, I open the flow in dataprep and put there the dataset. When I made the first recipe (without any step already) the dataset has approximately half of its original rows, that is, 36 234 instead of 62 948.
I would like to know what could be causing this problem. Some missing configuration?
Thank you very much in advance
Here are a couple thoughts . . .
Data Sampling
Keep in mind that what's shown in the Dataprep editor is typically a sample of the data, not the full data (unless its very small). If the full file was small enough to load, you should see the "Full Data" label up where the sample is typically shown:
In other cases, what you're actually looking at is a sample, which will also be indicated:
It's very beneficial to have an idea of how Dataprep's sampling works if you haven't reviewed the documentation already:
https://cloud.google.com/dataprep/docs/html/Overview-of-Sampling_90112099
Compressed Sources:
Another issue I've noticed occasionally is when loading compresses CSVs. In this case, I've had the interface tell me that I'm looking at the "Full Data"—but the number of rows is incorrect. However, any time this has happened the job does actually process the full number of rows.
During my cloud dataprep adventures I have come across yet another very annoying bug.
The problem occurs when creating complex flow structures which need to be connected through reference datasets. If a certain limit is crossed in performing a number of unions or a joins with these sets, dataflow is unable to start a job.
I have had a lot of contact with support and they are working on the issue:
"Our Systems Engineer Team was able to determine the root cause resulting into the failed job. They mentioned that the job is too large. That means that the recipe (combined from all datasets) is too big, and Dataflow rejects it. Our engineering team is still investigating approaches to address this.
A workaround is to split the job into two smaller jobs. The first run the flow for the data enrichment, and then use the output as input in the other flow. While it is not ideal, this would be a working solution for the time being."
I ran into the same problem and have a fairly educated guess as to the answer. Keep in mind that DataPrep simply takes all your GUI based inputs and translates it into Apache Beam code. When you pass in a reference data set, it probably writes some AB code that turns the reference data set into a side-input (https://beam.apache.org/documentation/programming-guide/). DataFlow will perform a Parellel Do (ParDo) function where it takes each element from a PCollection, stuffs it into a worker node, and then applies the side-input data for transformation.
So I am pretty sure if the reference sets get too big (which can happen with Joins), the underlying code will take an element from dataset A, pass it to a function with side-input B...but if side-input B is very big, it won't be able to fit into the worker memory. Take a look at the Stackdriver logs for your job to investigate if this is the case. If you see 'GC (Allocation Failure)' in your logs this is a sign of not enough memory.
You can try doing this: suppose you have two CSV files to read in and process, file A is 4 GB and file B is also 4 GB. If you kick off a job to perform some type of Join, it will very quickly outgrow the worker memory and puke. If you CAN, see if you can pre-process in a way where one of the files is in the MB range and just grow the other file.
If your data structures don't lend themselves to that option, you could do what the Sys Engs suggested, split one file up into many small chunks and then feed it to the recipe iteratively against the other larger file.
Another option to test is specifying the compute type for the workers. You can iteratively grow the compute type larger and larger to see if it finally pushes through.
The other option is to code it all up yourself in Apache Beam, test locally, then port to Google Cloud DataFlow.
Hopefully these guys fix the problem soon, they don't make it easy to ask them questions, that's for sure.
I am using minizinc and gecode to solve a minimization problem in a distributed fashion. I have multiple distributed servers that solve the same model with identical input and I want all the servers to get the same solution.
The problem is that model has multiple solutions, which periodically causes servers to come up with different solutions independently. It is not significant which solution will be chosen, as long as it is identical among all servers. I am also using "-p" arguments with gecode to use multiple threads (if it is relevant).
Is there a way that I could address this issue?
For example, I was thinking about outputting all the solutions and then sort them alphanumerically on each server.
Thanks!
If the search strategy in the model does not contain randomisation, then, assuming all versioning is the same, a single thread executing of Gecode should always return the same answer for the same model and instance data. It does not matter if it's on a different node. Using single threaded execution is the easiest way of ensuring that the same solution is found on all nodes.
If you are however want to use multiple threads, no such guarantee can be made. Due to the concurrency of the program, the execution path can be different every run and a different solution might be found each time.
Your suggestion of sorting the solution is possible, but will come at a price. There are two ways of doing this. You can either find all solutions, using the -a flag, and sort them afterwards or you can change your model to force the solution to be the first solution if you would sort them. This second option can be achieved by changing the search strategy. Both these solutions can be very costly and might (more than) exponentially increase the runtime.
If you are concerned about runtime at all, then I suggest you take Patrick Trentin's advice and run the model on a master node and distribute the solution. This will be the most efficient in computational time and most likely as efficient in runtime.
I have a dashboard where I have kept all the filters used in the dashboard as global filters and most used filters I have put as context filters,
The problem is the time taken to compute filters is about 1-2 minutes,How can I reduce this time taken in computing these filters
I have about 2 Million of extracted data, on Oracle with Tableau 9.3
Adding to Aron's point, you can also use a custom SQL to select only the dimensions and measures which you are going to use for the dashboard. I have worked on big data and it used to take around 5-7 mins to load the dashboard. Finally, ended up using custom sql and removing unnecessary filters and parameters. :)
There are several things you can look at it to guide performance optimization, but the details matter.
Custom SQL can help or hurt performance (more often hurt because it prevents some query optimizations). Context filters can help or hurt depending on user behavior. Extracts usually help, especially when aggregated.
An extremely good place to start is the following white paper by Alan Eldridge
http://www.tableau.com/learn/whitepapers/designing-efficient-workbooks
I'm trying to merge two datasets from two time periods, time 1 & 2, to make a combined repeated measures dataset. There are some observations in time 1 which do not appear in time 2, as the observations are for participants who dropped out after time 1.
When I use the append command in Stata, it appears to drop the observations from time 1 that don't have corresponding data at time 2. It does, however, append observations for new participants who joined at time 2.
I would like to keep the time 1 data of those participants who dropped out, so that I can still use that information in the combined dataset.
How can I tell Stata not to automatically drop these participants?
Thanks,
Steve
Perhaps the best way of interesting people in advising you on your problems is to respect those answer your questions. You have been repeatedly advised, even as recently as yesterday, to review https://stackoverflow.com/help/someone-answers and provide the feedback that reflects itself in the reputation scores of those who take the time to help you.
In any event, append does not work as you describe it. If you take the time to work out a small reproducible example, by creating two small datasets, appending them, and then listing the results, you may find the roots of your misunderstanding. Or you will at least be able to provide substantive information from which others can work, should someone be interested in helping you.