Watson Conversation: What EXACTLY is Watson training when it is "training"? - dump

When the "try it out" box in the Watson Conversation Tool shows "training"
(for instance: when it is getting the violet status "training" label text, when for instance assigning a new intent to a user input text):
What exactly is Watson training then? What is Watson doing at that very moment? Does this have an impact on the Workspace Data and JSON Dump Files created from this workspace?
I am asking because I am wondering the following:
When I have a Workspace "A",
I do a lot of training on it,
I will dump this workspace to a JSON file and
will use this JSON file to upload it into a new created Workspace "B":
Do I have to retrain workspace "B" AGAIN (because all the training data is lost), or will the such new created workspace "B" have all the "trained" knowlege from its original source "A" workspace?
Is "training" something that will be reflected in the Workspace's dump JSON file?

The training is creating a model that understands what each of your intents are. So that it can take a question it has never seen before, and able to map it to an intent (or know it's not related).
Each workspace is independent, so you need to create the model again if you move it to a different workspace. The training only regenerates after that if you change an intent or entity in the workspace.
You can use the workspace API to query the workspace, and it will tell you if the training is completed or not.
See: https://stackoverflow.com/a/46785752/1167890

Related

Best Practice to Store Simulation Results

Dear Anylogic Community,
I am struggling with finding the right approach for storing my simulation results. I have datasets created that keep track of every value I am interested in. They live in Main (see below)
My aim is to do a parameter variation experiment. In every run, I change the value for p_nDrones (see below)
After the experiment, I would like to store all the datasets in one excel sheet.
However, when I do the parameter variation experiment and afterwards check the log of the dataset (datasets_log), the changed values do not even show up (2 is the value I did set up in the normal simulation).
Now my question. Do I need to create another type of dataset if I want to track the values that are produced in the experiments? Why are they not stored after executing the experiment?
I really would appreciate if someone could share the best way to set up this export of experiment results. I would like to store the whole time series for every dataset.
Thank you!
Best option would be to write the outputs to some external file at the end of each model run.
If you want to use Excel, which I personally would not advise, even though it has a nice excelFile.writeDataSet() function, you can.
I would rather write the data to a text file as you will have much for control over the writing, the file itself, it is thread-safe, and useable in many many more platforms than Microsoft Excel.
See my example below:
Setup parameters in your model that you will write the data to at the end of the model of type TextFile. Here I used the model on destroy code to write out the data from the data sets.
Here you can immediately see the benefit of using the text file! You can add the number of drones we are simulating (or scenario name or any other parameter) in a column, whereas with Excel this would be a pain...
Now you can pass your specific text file to the model to use by adding it to the parameter variation page, providing it to the model through the parameters.
You will see that I also set up some headers for the text file in the Initial Experiment setup part, and then at the very end of the experiment, I close the text files in the After experiment section so that the text files can be used.
Here is the result if you simply right-click on the text files and open them in Excel. (Excel will always have a purpose, even if it is just to open text files ;-) )

Is it possible to merge Azure Data Factory data flows

I have two separate Data flows in Azure Data Factory, and I want to combine them into a single Data flow.
There is a technique for copying elements from one Data flow to another, as described in this video: https://www.youtube.com/watch?v=3_1I4XdoBKQ
This does not work for Source or Sink stages, though. The Script elements do not contain the Dataset that the Source or Sink is connected to, and if you try to copy them, the designer window closes and the Data flow is corrupted. The details are in the JSON, but I have tried copying and pasting into the JSON and that doesn't work either - the source appears on the canvas, but is not usable.
Does anyone know if there is a technique for doing this, other than just manually recreating the objects on the canvas?
Thanks Leon for confirming that this isn't supported, here is my workaround process.
Open the Data Flow that will receive the merged code.
Open the Data Flow that contains the code to merge in.
Go through the to-be-merged flow and change the names of any transformations that clash with the names of transformations in the target flow.
Manually create, in the target flow, any Sources that did not already exist.
Copy the entire script out of the to-be-merged flow into a text editor.
Remove the Sources and Sinks.
Copy the remaining transformations into the clipboard, and paste them in to the target flow's script editor.
Manually create the Sinks, remembering to set all properties such as "Allow Update".
Be prepared that, if you make a mistake and paste in something that is not correct, then the flow editor window will close and the flow will be unusable. The only way to recover it is to refresh and discard all changes since you last published, so don't do this if you have other unpublished changes that you don't want to lose!
I have already established a practice in our team that no mappings are done in Sinks. All mappings are done in Derived Column transformations, and any column name ambiguity is resolved in a Select transformations, so the Sink is always just auto-map. That makes operations like this simpler.
It should be possible to keep the Source definitions in Step 6, remove the Source elements from the target script, and paste the new Sources in to replace them, but that's a little more complex and error-prone.

Can a Mapping Data Flow use a parameterized Parquet dataset?

thanks for coming in.
I am trying to develop a Mapping Data Flow in an Azure Synapse workspace (so I believe that this can also apply to ADFv2) that takes a Delta input and transforms it straight into a Parquet -formatted output, with the relevant detail of using a Parquet dataset pointing to ADLSGen2 with parameterized file system and folder, in opposition to a hard-coded file-system and folder, because this would take creating too many datasets as there are too many folders of interest in the Data Lake.
The Mapping Data Flow:
As I try to use it as a Source in my Mapping Data Flows, the debug configuration (as well as the parent pipeline configuration) will duly ask for my input on those parameters, which I am happy to enter.
Then, as soon I try to debug or run the pipeline I get this error in less than 1 second:
{
"Message": "ErrorCode=InvalidTemplate, ErrorMessage=The expression 'body('DataFlowDebugExpressionResolver')?.50_DeltaToParquet_xxxxxxxxx?.ParquetCurrent.directory' is not valid: the string character '_' at position '43' is not expected."
}
RunId: xxx-xxxxxx-xxxxxx
This error message is not very specific to know where I should look.
I tried replacing the parameterized Parquet dataset with a hard-coded one, and it works perfectly both in debug and pipeline -run modes. However, this does not gets me what I need which is the ability to reuse my Parquet dataset instead of having to create a specific dataset for each Data Lake folder.
There are also no spaces in the Data Lake file system. Please refer to these parameters that look a lot like my production environment:
File System: prodfs001
Directory: synapse/workspace01/parquet/dim_mydim
Thanks in advance to all of you, folks!
The directory name synapse/workspace01/parquet/dim_mydim has an _ in dim_mydim, can you try replacing the underscore, or maybe you can use dimmydim to test whether it works.

How do I use cloud dataprep to convert my excel file to target formet regularly?

I'd like to convert my excel to proper format using Google Cloud Dataprep. How do I save my convert flow and use it as a template? For example, if there were two excel files named A and B and I create a flow to merge these two, next time there are two other files named C.xlsx and D.xlsx, how can I use the flow I created to merge C and D?
You can copy and reuse recipes (using the right-click or ... context menus and selecting Make a copy > Without inputs), or you can swap the input dataset for the original recipe and select your new file without having to recreate the recipe.
If your goal is automation, this is a bit more difficult when your source is an Excel file (as these are only an allowed format when using the uploader).
If you're able to have the data output in a CSV and uploaded to Cloud Storage, it opens up additional opportunities to schedule and parameterize your process.

Change name of component

To create a new dev stream from an existing stream, I first created a snapshot of the existing stream and from this snapshot I created a new prod stream.
(similar to a ClearCase UCM rebase of a baseline from a parent Stream to a child Stream)
All of the new stream components are the same as previous. So 'dev-stream' & 'prod-stream' have the same components (the components have the same name and point to same baseline).
Should a copy of the components not instead have been created with the new baseline ?
Here is a snapshot of how my component appears in RTC for both 'dev-stream' & 'prod-stream' :
The baseline should not contain the word "prod" as this is a dev stream.
The problem is circled in red in screenshot: How or why has the word 'prod' appeared in the component name ? Can 'prod' be removed from the name ?
The component must be the same when you add a snapshot to a new stream : same name and same baseline name. (very similar to a ClearCase UCM rebase, where you would find the same baseline name used as a foundation baseline for the sub-stream).
The idea behind a stream is to list what you need in order to work : this is called a configuration, as in "scm" (which can stand for "source configuration management", not just "source code management").
The fact that your new stream starts working with a baseline named with "prod" in it has no bearing on the kind of development you are about to do in said new stream.
It is just a "starting point" (like "foundation baselines" in ClearCase are). Again, no copy or rename involved here.
In your previous question, you mentioned having the current stream as 'dev-stream', but that has no influence on the name of the baselines already delivered in that first Stream (whatever its name is). Those baselines keep their name, and if you snapshot them and reuse that snapshot in a new stream, you will get the exact same baseline name.
But the name of the first baseline you are using as a starting point doesn't matter, as long as its content allows you to start a separate development effort, isolated in its own stream.
Any baseline you will create and deliver on said new stream will be displayed on it, and you won't see anymore that first baseline name.