How to pass schema file as Macros to BigQuery sink in data fusion

How to pass schema file as Macros to BigQuery sink in data fusion - google-cloud-data-fusion

I am creating a data fusion pipeline to load csv data from GCS to BigQuery for my use case i need to create a property macros and provide the value during runtime. Need to understand how we can pass the schema file as Macros to BigQuery sink.
If i simply pass the json schema file path to Macros values i am getting the following error.
java.lang.IllegalArgumentException: Invalid schema: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 1

There is currently no way to use the contents of a file as a macro value, though there is a jira open for something like this (https://issues.cask.co/browse/CDAP-15424). It is expected that the schema contents should be set as macro value. The UI currently doesn't handle these types of macro values very well (https://issues.cask.co/browse/CDAP-15423), so I would suggest setting it through the REST endpoint (https://docs.cdap.io/cdap/6.0.0/en/reference-manual/http-restful-api/preferences.html#H2290), where the app name is the pipeline name.
Alternatively, you can make your pipeline a little more generic by writing an Action plugin that looks something like:
#Override
public void run(ActionContext context) throws Exception {
String schema = readFileContents();
context.getArguments().setArgument(key, schema);
}
The plugin would be the first stage in your pipeline, and would allow subsequent stages in your pipeline to use ${key} as a macro that would be replaced with the actual schema.

If you are using BatchSink
You can read in the
#Override
public void prepareRun(BatchSinkContext context) {
by something like:
String token =
Objects.requireNonNull(
context.getArguments().get("token"),
"Argument Setter has failed in initializing the \"token\" argument.");
HTTPSinkConfig.setToken(token);

Related

How to cache a value returned by a Future?

For my Flutter app I'm trying to access a List of Notebooks, this list is parsed from a file by the Api. I would like to avoid reading/parsing the file every time I call my notebooks getter, hence I would like something like this:
if the file had already been parsed once, return the previously parsed List<Note> ;
else, read and parse the file, save the List<Note> and returns it.
I imagine something like this :
List<Notebook> get notebooks {
if (_notebooks != null) {
return _notebooks;
} else {
return await _api.getNotebooks();
}
}
This code is not correct because I would need to mark the getter async and return a Future, which I don't want to. Do you know any solution for this problem, like caching values ? I also want to avoid initializing this list in my app startup and only parse once the first time the value is needed.

you can create a singleton and you can store the data inside a property once you read it, from next time onwards you can use the same from the property.
You can refer this question to create a singleton
How do you build a Singleton in Dart?

Default or prevent ADF Pipeline Activity Parameters

How do you specify that an activity should not be parameterised in an exported ARM template, or ensure the parameter default value is whatever is already specified?
I have an ADF pipeline which contains a WebActivity. This WebActivity URL is set by an expression which concatenates some text with some pipeline parameters:
#concat(pipeline().parameters.URL,'path/',pipeline().parameters.ANOTHER_PARAMETER,'/morePath/', pipeline().parameters.YET_ANOTHER_PARAMETER,'/lastBitOfPath')
When I export the ADF template through the UI, there are some parameters added which look like: PIPELINE_NAME_properties_0_typeProperties, are type String, but are blank. These appear to correspond to the WebActivity URL fields in various activities.
If I then import that ARM template and parameter file into a new Data Factory, the WebActivity URL is blank. This means I need to override the parameter as normal, fine, but why....? I don't need a new parameter to specify a value that is already set by parameters... how do I ensure that this activity is imported with the same expression? It seems mad that to use a WebActivity means you have to parameterise the expression. I want the ARM Template > Export ARM Template to export what I've got, not add redundant parameters that I do not need.
I have also tried editing the pipeline JSON to add a default and defaultValue attribute for the URL activity, but they are removed and have no effect.

It seems the reason for this is that the parameterization template has been modified to include:
"Microsoft.DataFactory/factories/pipelines": {
"properties": {
...
"activities": [{
"typeProperties": {
"url": "-::string"
}
}
]
}
},
Which removes the default for all URL properties of all activities.
https://learn.microsoft.com/en-gb/azure/data-factory/continuous-integration-deployment#use-custom-parameters-with-the-resource-manager-template
This applies to all activites so it seems the only alternative is to specify
"url": "=::string"
Which will parameterise the URL (so any existing parameterization will continue to function) but keep the original value by default. Care must then be taken to override any other activity url properties that I do not wish to move.

DB2:LUW:Windows(OS) Unable to Call Java UDF

I am following the steps outlined in the URL
https://community.ibm.com/community/user/hybriddatamanagement/viewdocument/generating-universally-unique-ident?CommunityKey=ea909850-39ea-4ac4-9512-8e2eb37ea09a&tab=librarydocuments to call Java UDF from DB2 database.
import java.util.UUID; // for UUID class
public class UUIDUDF
{
public static String randomUUID()
{
return UUID.randomUUID().toString();
}
}
I am able to generate JAR and I called
call sqlj.install_jar('file:///C:/Users/XXX/Desktop/UDF/UUIDUDF.jar', 'UUIDUDFJAR')
and I am able to find Jar being deployed at
"C:\ProgramData\IBM\DB2\DB2COPY1\function\jar\DB2ADMIN"
I restarted database manager using db2stop and db2start command..when i call the function I am getting error
"Java‬‎ ‪stored‬‎ ‪procedure‬‎ ‪or‬‎ ‪user‬‎-‪defined‬‎ ‪function‬‎
‪‬‎"‪DB2ADMIN.RANDOMUUID"‬‎,‪‬‎ ‪specific‬‎ ‪name‬‎
‪‬‎"‪SQL200817125101637"‬‎ ‪could‬‎ ‪not‬‎ ‪load‬‎ ‪Java‬‎ ‪class‬‎
‪‬‎"‪C:\PROGRAMDATA\IBM\DB2\DB2COPY1"‬‎,‪‬‎ ‪reason‬‎ ‪code‬‎
‪‬‎"‪"‬‎.‪‬‎.‪‬‎ ‪SQLCODE‬‎=‪‬‎-‪4304‬‎,‪‬‎ ‪SQLSTATE‬‎=‪42724‬‎,‪‬‎
‪DRIVER‬‎=‪4‬‎.‪19‬‎.‪56"...
I created the function using below code
CREATE OR REPLACE FUNCTION RANDOMUUID()
RETURNS VARCHAR(36)
LANGUAGE JAVA
PARAMETER STYLE JAVA
NOT DETERMINISTIC NO EXTERNAL ACTION NO SQL
EXTERNAL NAME 'UUIDUDFJAR:UUIDUDF.randomUUID' ;
But when I generate DDL for my function in Db2 instance I am missing Jar reference in External name (EXTERNAL NAME 'UUIDUDF.randomUUID')
CREATE FUNCTION "DB2ADMIN"."RANDOMUUID" ()
RETURNS VARCHAR(36 OCTETS)
SPECIFIC "DB2ADMIN"."SQL200817125101637"
NO SQL
NO EXTERNAL ACTION
CALLED ON NULL INPUT
DISALLOW PARALLEL
LANGUAGE JAVA
EXTERNAL NAME 'UUIDUDF.randomUUID'
FENCED THREADSAFE
PARAMETER STYLE JAVA
NOT SECURED;
Could you please help me understand what i am missing here?
Thank you,
Pavan.

How to write streaming data to S3?

I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently.
I found this post, in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext. After this the author of the post does something like this:
myDstream.foreachRDD { rdd =>
rdd.toDF().write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")
}
What are another options besides spark-s3? Is it possible to append the file on S3 with the streaming data?

Files on S3 cannot be appended. An "append" means in S3 to replace the existing object with a new object that contains the additional data.

You should take a look into mode method for dataframewriter in Spark Documentation:
public DataFrameWriter mode(SaveMode saveMode)
Specifies the behavior when data or table already exists. Options
include: - SaveMode.Overwrite: overwrite the existing data. -
SaveMode.Append: append the data. - SaveMode.Ignore: ignore the
operation (i.e. no-op). - SaveMode.ErrorIfExists: default option,
throw an exception at runtime.
You can try somethling like this with Append savemode.
rdd.toDF.write
.format("json")
.mode(SaveMode.Append)
.saveAsTextFile("s3://iiiii/ttttt.json");
Spark Append:
Append mode means that when saving a DataFrame to a data source, if
data/table already exists, contents of the DataFrame are expected to
be appended to existing data.
Basically you can choose which format you want as an output format by passing "format" keyword to method
public DataFrameWriter format(java.lang.String source)
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
eg as parquet:
df.write().format("parquet").save("yourfile.parquet")
or as json:
df.write().format("json").save("yourfile.json")
Edit: Added details about s3 credentials:
there are two different options how to set credentials and we can see this in SparkHadoopUtil.scala
with environment variables System.getenv("AWS_ACCESS_KEY_ID") or with spark.hadoop.foo property:
SparkHadoopUtil.scala:
if (key.startsWith("spark.hadoop.")) {
hadoopConf.set(key.substring("spark.hadoop.".length), value)
}
so, you need to get hadoopConfiguration in javaSparkContext.hadoopConfiguration() or scalaSparkContext.hadoopConfiguration and set
hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)

Getting properties from AEM multifieldpanel dialog stops working when a second entry is added

I have created an AEM Dialog which prompts the user for a set of links and labels.
These links and labels are stored in a jcr node and are used to generate a menu.
To avoid having to create a custom xtype, I am using the acs-commons multifieldpanel solution, which enables me to nest children under the fieldConfig node.
This works great with only 1 Label/Link pair, but when I add a second one - the property cannot be fetched anymore, since instead of a String, it returns the String hashcode.
The property generated by the multifieldpanel in the jcr node is of type String and is filled correctly when inspecting in CRXDE. The problem occurs when I try to fetch the value from within a Sightly HTML file.
Code
Dialog:
Definitions.js:
"use strict";
use(function () {
var CONST = {
PROP_URLS: "definitions",
};
var json = granite.resource.properties[CONST.PROP_URLS];
log.error(json);
return {
urls: json
};
});
Log output
1 element in multifieldpanel
jcr node variable content
definitions: {"listText": "facebook", "listPath": "/content/en"}
log output
{"linkText":"facebook","linkPath":"/content/en"}
Multiple elements in multifieldpanel
jcr node variable content
definitions: {"listText": "facebook", "listPath": "/content/en"},{"listText": "google", "listPath": "/content/en"}
log output
[Ljava.lang.String;#7b086b97
Conclusion
Once the multifieldpanel has multiple components and stores it, when accessing the property the node returns the String hashcode instead of the value of the property.
A colleague has pointed out that I should use the MultiFieldPanelFunctions class to access the properties, but we are using HTML+Sightly+js and are trying to avoid .jsp files at all cost. In JavaScript, this function is not available. Does anyone have any idea how to solve this issue?

That is because, when there is a single item in the multifield, it returns a String, where as it returns a String[] when there is more than a single item configured.
Use the following syntax to read the property as a String array always.
var json = granite.resource.properties[CONST.PROP_URLS] || [];
Additionally, you can also use TypeHints to make sure your dialog saves the value as String[] always, be it single item or multiple items that is configured.

Don't forget that the use() in JS is compiled into Java Byte code and if you are reading Java "primitives", make sure you convert them to JS types. It's part of the Rhino subtleties.
On another note, I tend to not use the granite.* because they are not documented no where, I use the Sightly global objects instead https://docs.adobe.com/content/docs/en/aem/6-0/develop/sightly/global-objects.html
To access properties, I use properties.get("key")
Hope this help.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to pass schema file as Macros to BigQuery sink in data fusion - google-cloud-data-fusion

Related

How to cache a value returned by a Future?

Default or prevent ADF Pipeline Activity Parameters

DB2:LUW:Windows(OS) Unable to Call Java UDF

How to write streaming data to S3?

Getting properties from AEM multifieldpanel dialog stops working when a second entry is added

Categories

Resources