How to use a dynamic comma separated String value as input for a List()? - scala

I'm building a Spark Scala application that dynamically lists all tables in a SQL Server database and then loads them to Apache Kudu.
I'm building a dynamic string variable that tracks the primary key columns for each table. The primary keys are comma separated within the variable. The following is an example of my variable value:
PrimaryKeys=storeId,storeNum,custId
The following is a required function that I must enter a List[String] as input (which primary keys is definitely not correct):
setRangePartitionColumns(List("storeId","storeNum","custId").asJava
If I just use the PrimaryKeys variable for the List input (like the following), it only works for a single column (and would fail in this example with 3 comma separated values):
setRangePartitionColumns(List(PrimaryKeys).asJava
The following is another example, but using a Seq(). I"m supposed to put the same Primary Key column names in the same format below. Manually typing the column names works fine, however I can not figure out how to dynamically input the variable values:
kuduContext.createTable(tableName, df.schema, Seq(PrimaryKey), kuduTableOptions)
Any idea how I can parse the variable PrimaryKey dynamically and feed it into either function regardless of the number of comma separated values included?
Any assistance is greatly appreciated.

Related

How to pass special character as parameter in Azure Data Factory?

I am trying to parametrize the Column Delimiter field in CSV dataset in Azure Data Factory.
(https://i.stack.imgur.com/JGkD5.png)
This unfortunately doesn't work when I pass a special character as a parameter.
When I hardcode the special character in the column delimiter field all works as expected.
This works
However, when I have \u0006 as a parameter in SQL DB (varchar(10) type)
(https://i.stack.imgur.com/GyTdr.png)
and I pass it in the pipeline
(https://i.stack.imgur.com/CUeRf.png)
The Copy Data activity doesn't detect this special character as a delimiter.
My guess is that when I use a parameter it passes \u0006 as a string, but I can't find anywhere how to bypass that.
I tried to pass \u0006 in column delimiter as a dynamic content. It didn't consider that as column delimiter. All data are shown as a single column.
Therefore, I tried to pass equivalent symbol of \u0006 ACK() as a dynamic value to that column delimiter and it worked. I tried to convert the \u0006 into the special character using SQL script. Below are the steps to do this.
File delimiters are stored in a table.
To convert this column into equivalent characters, \u is removed from the column and the resultant hexadecimal value is converted into an integer. Then nchar() function is used to the integer data.
select nchar(cast(right(file_delimiter,4) as int)) as file_delimiter from t5
The above SQL query is used in Lookup activity in ADF.
When this value is passed as a dynamic content to column delimiter to that dataset, values are properly delimited.
Once pipeline is run, data is copied successfully.

Extract table name from more than one schema length in Data Factory Expression

I need to extract table name from schema.table_name, I have more than one schema where length is unknown. e.g. Finance.Reporting or Setup.System. I want to extract Reporting and System from these string using Expression in Data Factory.
You can use the split() function to split the string based on delimiter which returns the array and get the second value from the string.
Note: Array index starts from 0.
#split('Finance.Reporting','.')[1]

MD5/SHA Field Dataset in Data Fusion

I need to concatanate a few string values in order to obtain the SHA256 encrypted string. I've seen Data Fusion has a plugin to do the job:
The documentation however is very poor and nothing I've tried seems to work. I created a table in BQ with the string fields I need to concatanate but the output is same as input. Can anyone provide with an example on how to use this plugin?
EDIT
Below I present the example,
This is how the workflow looks like:
For the testing purposes, I added one column with the following string:
2022-01-01T00:00:00+01:00
And here's the output:
You can use Wrangler to concatenate the string values.
I tried your scenario adding Wrangler to the Pipeline:
Joining 2 Columns:
I named the column new_col, using , as delimiter:
Output:
What you described can be achieved by 2 Wranglers:
The first Wrangler will be what #angela-b described. Use the merge directive to create a new column with the concatenation of two columns. Example directive that joins column a and b using , as the delimiter and stores the result in column a_b:
merge a b a_b ,
The second Wrangler will use the hash directive which will hash the column in place using a specified algorithm. Example of a directive that hashes column a_b using MD5:
hash :a_b 'MD5' true
Remember to set the last parameter encode to true so that you get a string output instead of a byte array.

How to pass an array for column pattern matching in mapping dataflow derived column from CSV file through pipeline?

I have a mapping data flow with a derived column, where I want to use a column pattern for matching against an array of columns using in()
The data flow is executed in a pipeline, where I set the parameter $owColList_md5 based on a variable that I've populated from a single-line CSV file containing a comma-separated string
If I have a single column name in the CSV file/variable encapsuled in single quotes and have the "Expression" checkbox ticked, it works.
The problem is to get it to work with multiple columns. There seems to be parsing problems having multiple items in the variable each encapsuled in single-quotes, or potentially with the comma separating them. This often causes errors executing the data flow with messages like "store is not defined" etc
I've tried having ''col1'',''col2'' and "col1","col2" (2x single quotes and double quotes) in the CSV file. I've also tried having the file without quotes, trying to replace the comma with escaped quotes (using ) in the derived column pattern expression with no luck.
How do you populate this array in the derived column based on the data flow parameter which is based on the comma-separated string in the CSV file / variable from the pipeline with column names in a working way?
While array types are not supported as data flow parameters, passing in a comma-separated string can work if you use the instr() function to match.
Say you have two columns, col1 and col2. Pass in a parameter with value '"col1","col2"'.
Then use instr($<yourparamname>, '"' + name + '"') > 0 to see if the column name exists within the string you pass in. Note: You do not need double quotes, but the can be useful if you have column names that are subsets of other columns names such as id1 and id11.
Hope this helps!

SSRS multi value parameter - can't get it to work

First off this is my first attempt at a multi select. I've done a lot of searching but I can't find the answer that works for me.
I have a postgresql query which has bg.revision_key in (_revision_key) which holds the parameter. A side note, we've named all our parameters in the queries with the underscore and they all work, they are single select in SSRS.
In my SSRS report I have a parameter called Revision Key Segment which is the multi select parameter. I've ticked Allow multi value and in Available Values I have value field pointing to revision_key in the dataset.
In my dataset parameter options I have Parameter Value [#revision_key]
In my shared dataset I also have my parameter set to Allow multi value.
For some reason I can't seem to get the multi select to work so I must be missing something somewhere but I've ran out of ideas.
Unlike with SQL Server, when you connect to a database using an ODBC connection, the parameter support is different. You cannot use named parameters and instead have to use the ? syntax.
In order to accommodate multiple values you can concatenate them into a single string and use a like statement to search them. However, this is inefficient. Another approach is to use a function to split the values into an in-line table.
In PostgreSQL you can use an expression like this:
inner join (select CAST(regexp_split_to_table(?, ',') AS int) as filter) as my on my.filter = key_column
Then in the dataset properties, under the parameters tab, use an expression like this to concatenate the values:
=Join(Parameters!Keys.Value, ",")
In other words, the report is concatenating the values into a comma-separated list. The database is splitting them into a table of integers then inner joining on the values.