proc fastclus to calculate new seeds for proc cluster - cluster-analysis

I am using fastclus in SAS to use the final seeds for proc cluster/fastclus in for initial seed selection.please let me know the option available for that in sas

If you want proc fastclus to recalculate the seed of the cluster for each iteration you can use the drift option. You can read up about options in proc fastclus here
http://www.math.wpi.edu/saspdf/stat/chap27.pdf

Related

how to insert the data from delta table to a variable in order to apply drools rule on them

I am using spark with scala in which I am getting streaming datas from eventhubs and then storing them in delta table. In order to apply drools rule on them ,i need to pass them through variables...i am stuck where i have to get the data from delta table to variable.
It really depends what data you need to pass to that drools rules, and what you need to return. You can either use:
User defined function - you define a function that will receive one or more parameters (column values of specific rows). (more examples)
Use map function of Dataset / Dataframe class to process the whole Row (doc, and examples)
Delta Tables can be read into DataFrames. A variable can be assigned to point to the DataFrame.
df = spark.read.format("delta").load("some/delta/path")
Once the Delta Table is read, you can apply your custom transformations:
transformed_df = df.transform(first_transform).transform(second_transform)
Hope this helps point you in the right direction.

Azure Data Factory - Insert Sql Row for Each File Found

I need a data factory that will:
check an Azure blob container for csv files
for each csv file
insert a row into an Azure Sql table, giving filename as a column value
There's just a single csv file in the blob container and this file contains five rows.
So far I have the following actions:
Within the for-each action I have a copy action. I did give this a source of a dynamic dataset which had a filename set as a parameter from #Item().name. However, as a result 5 rows were inserted into the target table whereas I was expecting just one.
The for-each loop executes just once but I don't know to use a data source that is variable(s) holding the filename and timestamp?
You are headed in the right direction, but within the For each you just need a Stored Procedure Activity that will insert the FileName (and whatever other metadata you have available) into Azure DB Table.
Like this:
Here is an example of the stored procedure in the DB:
CREATE Procedure Log.PopulateFileLog (#FileName varchar(100))
INSERT INTO Log.CvsRxFileLog
select
#FileName as FileName,
getdate() as ETL_Timestamp
EDIT:
You could also execute the insert directly with a Lookup Activity within the For Each like so:
EDIT 2
This will show how to do it without a for each
NOTE: This is the most cost effective method, especially when dealing with hundred or thousands of files on a recurring basis!!!
1st, Copy the output Json Array from your lookup/get metadata activity using a Copy Data activity with a Source of Azure SQLDB and Sink of Blob Storage CSV file
-------SOURCE:
-------SINK:
2nd, Create another Copy Data Activity with a Source of Blob Storage Json file, and a Sink of Azure SQLDB
---------SOURCE:
---------SINK:
---------MAPPING:
In essence, you save the entire json Output to a file in Blob, you then copy that file using a json file type to azure db. This way you have 3 activities to run even if you are trying to insert from a dataset that has 500 items in it.
Of course there is always more than one way to do things, but I don't think you need a For Each activity for this task. Activities like Lookup, Get Metadata and Filter output their results as JSON which can be passed around. This JSON can contain one or many items and can be passed to a Stored Procedure. An example pattern:
This is the sort of ELT pattern common with early ADF gen 2 (prior to Mapping Data Flows) which makes use of resources already in use in your architecture. You should remember that you are charged by the activity executions in ADF (eg multiple iteration in an unnecessary For Each loop) and that generally compute in Azure is expensive and storage is cheap, so think about this when implementing patterns in ADF. If you build the pattern above you have two types of compute: the compute behind your Azure SQL DB and the Azure Integration Runtime, so two types of compute. If you add a Data Flow to that, you will have a third type of compute operating concurrently to the other two, so personally I only add these under certain conditions.
An example implementation of the above pattern:
Note the expression I am passing into my example logging proc:
#string(activity('Filter1').output.Value)
Data Flows is perfectly fine if you want a low-code approach and do not have compute resource already available to do this processing. In your case you already have an Azure SQL DB which is quite capable with JSON processing, eg via the OPENJSON, JSON_VALUE and JSON_QUERY functions.
You mention not wanting to deploy additional code which I understand, but then where did your original SQL table come from? If you are absolutely against deploying additional code, you could simply call the sp_executesql stored proc via the Stored Proc activity, use a dynamic SQL statement which inserts your record, something like this:
#concat( 'INSERT INTO dbo.myLog ( logRecord ) SELECT ''', activity('Filter1').output, ''' ')
Shred the JSON either in your stored proc or later, eg
SELECT y.[key] AS name, y.[value] AS [fileName]
FROM dbo.myLog
CROSS APPLY OPENJSON( logRecord ) x
CROSS APPLY OPENJSON( x.[value] ) y
WHERE logId = 16
AND y.[key] = 'name';

How to save Data factory stored procedure output

Whenever I execute a stored procedure in the ADFv2, it gives me an output as
{
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (Australia Southeast)",
"executionDuration": 34
}
even though I have set 2 variables as output in the procedure. Is there any way to map the output of the stored procedure in the ADFv2? Till now I can map the output of all the other activities but not of Stored procedures.
You could use a lookup activity to get the result.
Please reference this post. https://social.msdn.microsoft.com/Forums/azure/en-US/82e84ec4-fc40-4bd3-b6d5-b742f3cd1a33/adf-v2-how-to-check-if-stored-procedure-output-is-empty?forum=AzureDataFactory
Update by Gagan:
Instead of getting the output of SP (which is not possible in ADFv2 right now), I stored the output in the table and then apply lookup-foreach to the table to get the value.
Stored procedure call in Data factory (v2) does not capture the result data set. So you cannot use the stored procedure activity to get the result data set and refer it in next activities.
Workaround is to use lookup activity to call exact same stored procedure as lookup will get you the result data set from stored procedure. Replace your Stored procedure activity with lookup and it will work.

Redshift to dask DataFrame

Does anyone have a nice neat and stable way to achieve the equivalent of:
pandas.read_sql(sql, con, chunksize=None)
and/or
pandas.read_sql_table(table_name, con, schema=None, chunksize=None)
connected to redshift with SQLAlchemy & psycopg2, directly into a dask DataFrame ?
The solution should be able to handle large amounts of data
You might consider the read_sql_table function in dask.dataframe.
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table
>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
... npartitions=10, index_col='id') # doctest: +SKIP
This relies on the pandas.read_sql_table function internally, so should be able to operate with the same restrictions, except that now you're asked to provide a number of partitions and an index column.

How can I ensure Redshift Unload Copy columns are in correct order?

I am trying to use the UnloadCopyUtility to migrate an instance to an encrypted instance but some of the tables fail because it is trying to insert the values into the wrong columns. Is there are a way I can ensure the columns are mapped to the values correctly? I can adjust the python script locally if need be
I feel, this should be possible in UnloadCopy utility as well.
But here I'm trying to answer more of generic solution withput UnloadCopy utility, so that it may be helpful to others as an alternate solution.
In unload command you could specify the columns like C1,C2,C3,...
Use same sequence columns in copy command while loading data in RedShift.
Unload command example.
unload ('select C1,C2,C3,... from venue') to 's3://mybucket/tickit/unload/venue_' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' parallel off;
Copy command example with specific columns sequence of above unloaded files.
copy table (C1,C2,C3,...) from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;