Input two datasets into Orange Python Script widget? - orange

Is there a way to input two data tables (test & train sets) into Orange3's Python Script widget?

Yes, since Orange 3.6.0 the Python Script widget accepts multiple inputs of the same type: connect multiple data files to the widget's "Data" input, and access them through the "in_datas" variable (it will be a list of data tables).

Related

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

Exporting the results of an Anylogic experiment

I have built my model and run the experiment. I cannot seem to find where the data is stored.
I now need to conduct several runs and compare the results, I am using normally distributed repair times so the results should vary between runs without modifying parameters.
How can I keep the results of each run and then present them all in the same data set?
There are two main options for getting data out of your simulation
Using the internal AnyLogic database
Using external files like Excel or txt
Step 1: Setup your objects
Internal Database
Create an empty table with the columns you require
External object
Setup either an Excel or text file using the objects provided by AnyLogic in the Connectivity palette
Step 2: Saving your data
For both cases you need to write your data to the object of your choosing, either as the data gets generated or at the end of the simulation model
using Internal DB
The best option is to write data using the following command
insertInto(table_name)
.columns(column_name)
.values(value);
This will just insert a new line into a database table that you created, you can save multiple values to multiple columns by adding comma-separated entries into the parameters for columns and values.
e.g
insertInto(temeprature_output_table)
.columns(scenario_name, time, temperature)
.values("sceanrio1", 10,5, 102);
External files
2.1) Using Excel
filename.setCellValue(value, sheetName, row, column);
or even better you can write out an entire dataset
excelFile.writeDataSet(dataset, sheetName, row, column);
2.2) Using a text file
fileName.println("value" + "\t" + " value 2");
You can use whatever separator you want "\t" for tab separated or "," for comma and so on
Step 3: Finish and export data
Internal Database
At the end of a simulation run, you can simply export the data
See help here https://anylogic.help/anylogic/connectivity/export-excel.html#exporting-data-to-ms-excel-workbook
P.S. It is possible to automate this with some effort
External file
On Excel you need to call .writeFile() to finish.
On both objects, you need to call .close() for them to be closed and saved to memory.
FYI
Excel has the option to save on termination.
Read more on using Excel here - https://anylogic.help/anylogic/connectivity/excel-file.html#writing-to-excel-file
And on text file here
https://anylogic.help/anylogic/connectivity/text-file.html#replicated
There is also an example model

Creating Custom Jupyter Widgets

I'm trying to create a custom jupyter widget that takes a pandas.dataframe as an input and simply renders a modified html version of the dataframe as an output. I'm stuck at the start in terms of defining a dataframe as the input for the widget
I have tried to follow the online examples and I think I would be fine with most string inputs to a widget, but I'm lost when trying a dataframe as an input
I just like to be able to pass a dataframe into my custom widget and validate that is is a dataframe
You can do this using jp_proxy_widget. In fact it is almost implemented in this notebook:
https://nbviewer.jupyter.org/github/AaronWatters/jp_doodle/blob/master/notebooks/misc/In%20place%20html%20table%20update%20demo.ipynb
The implementation is more complex than you requested because it supports
in-place updates of the table.
Please see https://github.com/AaronWatters/jp_proxy_widget
The example notebook is from https://github.com/AaronWatters/jp_doodle

Assigning bins to records in CHAID model

I built a custom CHAID tree in SPSS modeler. I would like to assign the particular terminal nodes to all of the records in the dataset. How would I go about doing this from within the software?
Assuming that you used the regular node called CHAID, if you select inside the diamond icon (created chaid model) in the tab configurations the rule identifyer, the output will add another variable called $RI-XXX that will classify all the records within the terminal nodes. Just check that option and then put a table node after that and all the records will be classified.
You just need to apply the algorithm to whatever data set you need, and you only need to inputs to be the same (type and eventually storage).
The diamond contains the algo and you can disconnect it and connects to whatever you want.
http://beyondthearc.com/blog/wp-content/uploads/2015/02/spss.png

Using Talend Open Studio DI to extract extract value from unique 1st row before continuing to process columns

I have a number of excel files where there is a line of text (and blank row) above the header row for the table.
What would be the best way to process the file so I can extract the text from that row AND include it as a column when appending multiple files? Is it possible without having to process each file twice?
Example
This file was created on machine A on 01/02/2013
Task|Quantity|ErrorRate
0102|4550|6 per minute
0103|4004|5 per minute
And end up with the data from multiple similar files
Task|Quantity|ErrorRate|Machine|Date
0102|4550|6 per minute|machine A|01/02/2013
0103|4004|5 per minute|machine A|01/02/2013
0467|1264|2 per minute|machine D|02/02/2013
I put together a small, crude sample of how it can be done. I call it crude because a. it is not dynamic, you can add more files to process but you need to know how many files in advance of building your job, and b. it shows the basic concept, but would require more work to suite your needs. For example, in my test files I simply have "MachineA" or "MachineB" in the first line. You will need to parse that data out to obtain the machine name and the date.
But here is how may sample works. Each Excel is setup as two inputs. For the header the tFileInput_Excel is configured to read only the first line while the body tFileInput_Excel is configured to start reading at line 4.
In the tMap they are combined (not joined) into the output schema. This is done for the Machine A Excel and Machine B excels, then those tMaps are combined with a tUnite for the final output.
As you can see in the log row the data is combined and includes the header info.