Dataprep jobs running for over 72 hours since 6/20 update. Job status reads complete but not published - google-cloud-dataprep

I have been running daily Dataprep jobs and since the update last week, approximately half of my jobs are now hanging and not being published. They appear as jobs in progress although when I go to the actual job page, the job appears to be complete. There is no publishing action and the publishing target does not appear updated. Some jobs have now been going on for over 72 hours since Friday.
I've seen traces of other users having the same issue online but have not seen any sort of response or recognition from either Google or Trifacta.
I have tried restarting the jobs to no success and it appears that there is no way to cancel those hanging jobs because from Google's perspective, it seems as though the jobs were successful itself, just not published. This problem appears both on my jobs that publish to BigQuery as well as jobs that publish to Google Cloud Storage, as well as manual and scheduled jobs.

This may impact only jobs that have been pushed during the upgrade and should be rather cosmetic in nature. Please note that you won't get charged.
Did the exact same job work before with no changes? If so, please contact support and provide them as reference the successful and now failing job ID so it can be investigated further.
Cheers,
Sebastian

I have come acros the same problem! The output of the jobs is placed in a temp folder in cloudstorage with the output mostly consisting out of multiple files without headers....

It is also creating huge issues here. Instead of the normal output file, it places multiple parts of it in a temp folder without headers. The makes new scheduled jobs that rely on these outputs useless, because it does not load the new output.
If you manually merge the files in the temp folder and add headers (in case of csv) + place them in the correct folder, the output can be created manualy (for csv).
Also no response from Google yet.

We're seeing the exact same thing down to the destinations and job types . . . it's almost like Dataprep is losing track of the underlying DataFlow job and not finishing on its completion (that's why you see the temp files—that's the output, then Dataprep handles the formatting of the output file separately).
Someone was kind enough to already post this on the issue tracker, so please go star it and add any additional details that may be helpful to the Dataprep team:
https://issuetracker.google.com/issues/135865374

Related

Workflow platform for managing the processing of incoming files

In general, I have a single workflow that I want to be able to monitor. The workflow should start whenever new files arrive or alternatively at certain scheduled times, i.e. I want to be able to insert new "jobs" to the workflow as they come, and process the files by going through multiple different tasks and steps. I want to be able to monitor each file going through the tasks.
The queues and distributing the load for each task might be managed by Celery, but it's not decided yet either.
I've looked at Apache Airflow, and as far as I understand at the moment, is geared more towards monitoring many different workflows, such that each workflow is mostly running from start to end, not adding new files to the beginning of the flow before the previous run ended.
Cadence workflow seems like can do what I need, but also seems to be a bit of an overkill.
I'm not expecting a specific final solution here, but I would appreciate suggestions to more such solutions that I can look into and can fit the above.
Luigi - https://luigi.readthedocs.io/en/stable/
Extremely light-weight and fast compared to Airflow.

How to use Cron to run commands when jobs in a particular folder complete?

I am running a long simulation on our cluster. I submit dozens of jobs first, each job is "hold on" its previous one, so that the simulation could be extended to the period that I want.
Due to the limitation of total jobs we could submit, I have to submit many jobs every day, when the previous jobs have been completed.
I feel it is time-consuming to do this every day. So I wonder if Cron could
monitor if all the jobs launched in a particular folder have been
completed on the cluster
if yes, execute the commands written in a job.sh file, to submit
more jobs within that particular folder.
I am also happy if other methods could be used except Cron.
Thank you.
You may also be interested in trying BeyondCron, which is currently available for early adopters. Using conditions you should be able to solve you problem.

Google Cloud Dataflow: while in PubSub streaming mode, TextIO.Read uses massive amounts of vCPU time

I'm using Google Cloud Platform to transfer data from an Azure server to a BigQuery table (working nice and smoothly, functionally speaking).
The pipeline looks like this:
Dataflow streaming pipeline
The 'FetchMetadata' part of the pipeline is a simple TextIO.Read implementation where I read a 66-line .csv file with metadata from a GCP Storage bucket:
PCollection<String> metaLine = p.apply(TextIO.Read.named("FetchMetadata")
.from("gs://my-bucket"));
When I use my pipeline in Batch mode this works like a charm: first the metadata file is loaded in the pipeline in less than a second of vCPU time and then the data itself is loaded in the pipeline. Now when running in Streaming mode I would love to replicate that behaviour to some extent but when I just use the same code there is a problem: when running the pipeline for just 15 minutes (actual time) the TextIO.Read block uses a whopping 4 hours of vCPU time. For a pipeline that will be permanently running for a low budget project this is unacceptable.
So my question: is it possible to change the code so the file is periodically read again (if the file changes I want the pipeline to be updated, so let's say hourly updates) and not continiously like it's doing right now.
I've found some documentation where there is mention of TextIO.Read.Bound which looks like a good place to start solving this issue, but it's no solution for my periodical update problem (as far as I know)
I was stuck in a similar situation. The way I solved this problem is a bit different. I would like the community's insights into this solution.
I had files being updated every hour in a GCS bucket. I followed the blog post about Scheduling Dataflow Jobs from App Engine or Google Cloud Function.
I had the app engine endpoint configured to receive the object change notifications from the GCS bucket which contained the files to be processed. For every file that was created (update is also a create operation in an object store), app engine application would submit a job to google dataflow. The job would read the lines from the file (file name in the HTTP request body) and publish it to a Google PubSub topic.
A streaming pipeline then had been subscribed to the Google PubSub topic that would process and output the relevant rows to big query. This way, streaming pipeline ran at the minimum worker count when idle, the ingest of the files happened through a batch pipeline and the streaming pipeline scaled with respect to the volume of the publications in the Google PubSub topic.
In the tutorial for submitting jobs to Google Dataflow, the jar is executed on the underlying terminal. I modified the code to submit a job to google dataflow using templates which can be executed with parameters. This way, the job submission operation becomes super light weight while still creating a job for every new file upload to the GCS bucket. Please refer this link for details about executing google dataflow job templates.
Note: Please mention in the comments if the answer needs to be modified for the code snippets of the dataflow job template and app engine application and I will update the answer accordingly.

How to Implement Queue Based Workflow System?

I'm working on a document management system. An example workflow would be something like this:
A document is emailed to the system
The system does a number of preparatory actions to the document
Document is presented to a user for further processing
Afterwards, document is sent to Quality Assurance
Afterwards, the system does a number or post-processing actions to the document
Document is considered completely processed and disseminated (e.g. emailed back to whoever emailed the document to the system, etc.)
Since the volume of my input will vary (but will usually be high volume), I am very concerend about scalability.
For example, say the system has already downloaded the email attachments. If the attachments are PDF documents, the system needs to split the PDF into individual pages, then convert each page into multiple size thumbnails, etc. I plan to have a cron job check (say, every minute) to see if there are an PDF documents that need to be processed. Using a flagging system (e.g. "PDF Document Ready to be Processed"), I can check the database for all PDF documents that are flagged to be processed. Once the PDF processing is done, the flag can be updated to say "PDF Processing Done."
However, since the processing of each PDF document is very time consuming, I am concerned that when the next cron job is executed, that cron job will also try to process the PDFs that the previous cron job is still processing.
A possible solution is to immediately flag the PDF documents with "PDF Document Currently Being Processed." That way, when the next cron job is executed, it will exclude the ones already being processed.
Thus, each step in the workflow will probably have 3 flags:
PDF Document Ready to be Processed
PDF Document Currently Being Processed
PDF Processing Done
Same for QA:
Document Ready for QA
Document Currently Being QAd
Document QA Done
Is this a good approach? Is there a better approach? Would I have these flags as a single column of the "PDF Document" table in the database? Or should the flags be its own table (e.g. especially if a document can have multiple flags set).
I'd like to solicit suggestions on how to implement such a system.
To solve your concern about concurrent processing on the same document, you can use many scheduler packages to help you manage this aspect. http://www.quartz-scheduler.org/ is one I've used with great success.
To address your problem, I'd have the 3 states, received, queued, processed (similar to what you suggest).
I'd have a scheduled recurring job which polls the database, looking for received pdfs, and for each, queue a job to process and mark the pdf as queued. If you ensure this happens in the same transaction, and utilize optimistic locking, there is no risk another job could come along and re-read this as received.
Quartz uses a thread pool, with may configuration options, and is great for deferred, resource intensive processing (I use it for image thumbnailing in a server setting).
To take a step back, there are some great workflow packages in the java world which can handle most of what you want to do, including the deferred pdf processing. Take a look at jbpm or drools flow, these are two great, if complex, packages.
UPDATE: Drools Flow has been merged into JBPM. For this particular problem it may be a bit of "killing a mosquito with a bazooka" situation, but it's a great workflow package.
The solution kind of depends on what technologies you are using to implement this system is the pre / post processing done by the same software / language as the emailing software? Additionally are they running in seperate processes.
If you have distributed components you could do much worse than investigating an AMQP solution like RabbitMQ, as this takes care of putting each job into a queue, and making sure that only one of your consumers takes each job. (we'd model each thumbnailing job as individual tasks).
If however the entire system is implemented in one language, and inside a single process there's some simpler systems you can use:
Resque is a good solution for Ruby
Java would work well as a LinkedBlockingQueue
Uh, I'm sure c# will have some way of creating a queue of jobs (disclaimer: I know nothing of c#)

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.