How to run part of a stream without having to execute the whole stream?

How to run part of a stream without having to execute the whole stream? - spss-modeler

I use the select node to filter out some of the entries in the dataset, and then run a model.
I want to save time and not go through all the rows again when I rerun the model. I tried the Run from here option, but the execution goes through all the rows again. Also tried the sequal pushback purple button, but to no effort.
Is it possible to only go through the selected rows when running a model, without having to export the newly filtered database?

If this makes any sense to you, I guess it might work:
Once you have specified the required options for streams and
connected the required nodes, you can run the stream by running
the data through nodes in the stream. Run part of a data stream by
right-clicking any non-terminal node and clicking Run From Here on
the pop-up menu. Doing so causes only those operations after the
selected node to be performed.

Can you provide a snapshot of your data?
Do you use more than one column as a target in your time series?
You can change the selection and just click somewhere behind the nugget and use run from here maybe?
What's the error message or does it just rerun the model?

Related

Issue with Copy Activity Metadata

The Copy Data Activity Which used to show the number of rows written isnt showing up any more.
Is there any option in the copy activity to make sure it reflects the number of rows written.

Be it debug of the pipeline, or a triggered pipeline run, you can check the output of the copy data activity to conclude whether the data read is equal to data written or not.
Let's say it is a pipeline run. Navigate to monitor section and click on the pipeline.
Now, the activity run dialog opens up. There you can monitor from the activity debug output whether the data read is equal to data written or not:
NOTE: The above is for blob to blob copy. For your source and sink, there will be similar activity output data that might contain the required information (like rows read and rows written). The following is an example for Azure SQL database to blob:

Azure Data Factory ForEach is seemingly not running data flow in parallel

In Azure Data Factory I am using a Lookup activity to get a list of files to download, then pass it to a ForEach where a dataflow is processing each file
I do not have 'Sequential' mode turned on, I would assume that the data flows should be running in parallel. However, their runtimes are not the same but actually have almost constant time between them (like, first data flow ran 4 mins, second 6, third 8 and so on). It seems as if the second data flow is waiting for the first one to finish and then uses its cluster to process the file.
Is that intended behavior? I have TTL on the cluster set but that did not help too much. If it is, then what is a workaround? I am currently working on creating a list of files first and using that instead of a ForEach but I am not sure if I am going to see an increase in efficiency

I have not been able to solve the issue with the Parallel data flows not executing in parallel, however, I have managed to change the solution that would increase performance.
What was before: A lookup activity that would get a list of files to process, passed on to a ForEach loop with a data flow activity.
What I am testing now: A Data flow activity that would get a list of files, and save them in a text file in ADLS, Then another data flow activity that was previously in a ForEach loop, but changed its source to use "List of Files" and point to that list
The result was an increase in efficiency (Using the same cluster, 40 files would take around 80 mins using ForEach and only 2-3 mins using List of Files), however, debugging is not easy now that everything is in 1 data flow
You can overwrite a list of files file, or use dynamic expressions and name the file as the pipelineId or something else

Should mill tasks with non-file products produce something analogous to PathRef?

I'm using mill to build a pipeline that
cleans up a bunch of CSV files (producing new files)
loads them into a database
does more work in the database (create views, etc)
runs queries to extract some files.
Should the tasks associated with steps 2 and 3 be producing something analogous to PathRef? If so, what? They aren't producing a file on the disk but nevertheless should not be repeated unless the inputs change. Similarly, tasks associated with step 3 should run if tasks in step 2 are run again.
I see in the documentation for targets that you can return a case class and that re-evaluation depends on the .hashCode of the target's return value. But I'm not sure what to do with that information.
And a related question: Does mill hash the code in each task? It seems to be doing the right thing if I change the code for one task but not others.

A (cached) task in mill is re-run, when the build file (build.sc or it's dependencies/includes) or inputs/dependencies of that task change. Whenever you construct a PathRef, a checksum of the path content is calculated and used as hashCode. This makes it possible to detect changes and only act if anything has changed.
Of course there are exceptions, e.g. input tasks (created with T.input or T.sources) and commands (created with T.command) will always run.
It is in general a good idea to return something from a task. A simple String or Int will do, e.g. to show it in the shell with mill show myTask or post-process it later. Although I think a task running something in an external database should be implemented as a command or input task (which might check, when running, if it really needs something to do), you can also implement it as cached task. But please keep in mind, that the state will only be correct if no other process/user is changing the database in between.
That aside, You could return a current database schema/data version or a last change date in step 2. Make sure, it changes whenever the database was modified. Each change of that return value would trigger step 3 and other dependent tasks. (Of course, step 3 needs to depend on step 2.) As a bonus, you could return the same (previous/old) value in case you did not change the database, which would then avoid any later work in dependent tasks.

Logic App Blob Trigger for a group of blobs

I'm creating a Logic App that has to process all blobs that in a certain container. I would like to periodically check whether there are any new blobs and, if yes, start a run. I tried using the "When a blob is added or modified". However, if at the time of checking there are several new blobs, several new runs are initiated. Is there a way to only initiate one run if one or more blobs are added/modified?
I experimented with the "Number of blobs to return from the trigger" and also with the split-on setting, but I haven't found a way yet.

If you want to trigger with multiple blob files, yes you have to use When a blob is added or modified. From the connector description you could know
This operation triggers a flow when one or more blobs are added or modified in a container.
And you must set the maxFileCount also you already find the result is split into separate parts. This is because the default setting the splitOn is on, if you want the result be a whole you need to set it OFF.
The the result should be what you want.

How to determine which SSAS Cube is processing now?

There is a problem when several users can process the same cube simultaniously and as a result processing of cube fails. So I need to check if certain cube is processing at current moment.

I dont think you can prevent a cube from being processed if someone else is already processing it. What you can do to "help" is run a MDX query to check the last time the cube was processed:
SELECT CUBE_NAME, LAST_DATA_UPDATE FROM $System.MDSCHEMA_CUBES
or check the sys.process table on the realted sql server instance to see if it is running:
select spid, ecid, blocked, cmd, loginame, db_name(dbid) Db, nt_username, net_library, hostname, physical_io,
login_time, last_batch, cpu, status, open_tran, program_name
from master.dbo.sysprocesses
where spid > 50
and loginame <> 'sa'
and program_name like '%Analysis%'
order by physical_io desc
go

use this code to select running processes: (execute this in OLAP)
select *
from $system.discover_Sessions
where session_Status = 1
And this code to cancel running prossesess ! Please change PID to running SESSISONS_SPID
like in example:
<Cancel xmlns ="http://schemas.microsoft.com/analysisservices/2003/engine">
<SPID>92436</SPID>
<CancelAssociated>1</CancelAssociated>
</Cancel<

Probably a better approach to the ones already listed would be to use SQL Server Profiler to watch activity on the Analysis Server. As stated already, the current popular answer has two flaws, the first option only shows the LAST time the cube had been processed. And the second option only shows if something is running. But it doesn't tell you what is running and what if your cube was not processing from a SQL server but a different data source?
Utilizing SQL Server Profiler will tell you not only if something is processing but also details of what is processing. Most of the Events you can filter out. Watch Progress Report Current events if you want real time information... It's usually too much of a fire-hose of data to get real info out of it, but you'll know well that at least a process is going on. Watch Progress Report Begin and End events only to get better information like what is currently being processed, even down to the partition levels. Other events with good information include Command Begin/End and Query Begin/End.

You will see a job running in Task Manager called "MSDARCH" if a cube is processing. Not sure how you can tell which one though.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to run part of a stream without having to execute the whole stream? - spss-modeler

Can you provide a snapshot of your data? Do you use more than one column as a target in your time series? You can change the selection and just click somewhere behind the nugget and use run from here maybe? What's the error message or does it just rerun the model?

Related

Issue with Copy Activity Metadata

Azure Data Factory ForEach is seemingly not running data flow in parallel

Should mill tasks with non-file products produce something analogous to PathRef?

Logic App Blob Trigger for a group of blobs

How to determine which SSAS Cube is processing now?

Categories

Resources