How can I differentiate the data coming from multiple HTTP client origins in StreamSets - streamsets

I have 6 pipelines each have the HTTP client origin connected to the SDCRPC destinations, my plan is to make another pipeline with SDCRPC origin and destination to hive tables.
My question is after connecting to the SDCRPC origin how can I differentiate the data because each HTTP pipeline pulls the data related to one identical table.
Any examples or online resources will be appreciated.

Add an expression evaluator to each of the 6 HTTP->SDCRPC pipelines, each of them adding a different header attribute value to the records. Now your SDCRPC origin pipeline can look at that header attribute to figure out which pipeline it came from.

Related

Timeout issue for http connector and web activity on adf

Timeout issue for http connector and web activity
Web activity and http connector on adf
We have tried loading data through Copy Activity using REST API with Json data some columns are getting skipped which is having no data at its first row. We have also tried REST API with cv data but it's throwing error. We have tried using Web Activity but its payload size is 4MB, so it is getting failed with timeout issue. We have tried using HTTP endpoint but its payload size is 0.5 MB, so it is also getting failed with timeout issue
In Mapping settings, Toggle on the advanced editor and give the respective value in collection reference to cross apply the value for nested Json data. Below is the approach.
Rest connector is used in source dataset. Source Json API is taken as in below image.
Then Sink dataset is created for Azure SQL database. Once the pipeline is run, few columns are not copied to database.
Therefore, In Mapping settings of copy actvity,
1. Schema is imported
2. Advanced editor is turned on
3. Collection reference given.
When pipeline is run after the above changes, all columns are copied in SQL database.

Azure Data Factory - REST Pagination rules

I'm trying to pull data from Hubspot to my SQL Server Database through an Azure Data Factory pipeline with the usage of a REST dataset. I have problems setting up the right pagination rules. I've already spent a day on Google and MS guides, but I find it hard to get it working properly.
This is the source API. I am able to connect and pull the first set of 20 rows. It gives an offset which is usable with vidoffset= which is returned in the body.
I need to return the result of vid-offset from the body to the HTTP request. Also the process needs to stop when has-more results in 'false'.
I tried to reproduce the same in my environment and I got the below results:
First I create a linked service with this URL: https://api.hubapi.com/contacts/v1/lists/all/contacts/all?hapikey=demo&vidOffset
Then after I created the pagination end condition rule with $.has-more and absolute URL.
For demo purpose, I took sink as a storage account.
The pipeline run success full look at the below image for reference.
For more information refer this Ms Document

Is there a way to prevent Spring Cloud Gateway from reordering query parameters?

Spring Cloud Gateway appears to be reordering my query parameters to put duplicate parameters together.
I'm trying to route some requests to one of our end points to a third party system. These requests include some query parameters that need to be in a specific order (including some duplicate parameters), or the third party system returns a 500 error, but upon receiving the initial request with the parameters in the proper order, the Spring Cloud Gateway reorders these parameters to put the duplicates together by the first instance of the parameter.
Example:
http://some-url.com/a/path/here?foo=bar&anotherParam=paramValue2&aThirdParam=paramValue3&foo=bar
Becomes:
http://some-url.com/a/path/here?foo=bar&foo=bar&anotherParam=paramValue2&aThirdParam=paramValue3
Where the last parameter was moved to be by the first parameter because they had the same name.
The actual request output I need is for the query parameters to be passed through without change.
The issue lays in the UriComponentsBuilder which is used in RouteToRequestFilter.
UriComponentsBuilder.fromUri(uri) is going to build up a map of query params. Because this is a LinkedMultiValueMap you see the reordering of the used query params.
Note that RFC3986 contains the following
The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI’s scheme and naming authority (if any).
Therefor I don’t think there needs to be a fix in Spring Cloud Gateway.
In order to fix this in your gateway, you'll need to add a custom filter which kicks in after the RouteToRequestFilter by setting the order to RouteToRequestUrlFilter.ROUTE_TO_URL_FILTER_ORDER + 1.
Take a look at the RouteToRequestUrlFilter how the exchange is adapted to go to the downstream URI.
Hope that helps! :)

REST API containing POST and PUT/PATCH calling a compute server generating results files

The server application I'm implementing generates calculation results and stores these in result files in directories on the server. For example, customer/project/scenario/resultfiles. I want to design and implement a resilient REST implementation to retrieve the result files for display in the client browser, delete results files, customers etc and to create result files within a scenario for calculation parameters sent to the server. And possibly to do sensitivity analysis to generate result files within a scenario by varying calculation parameters.
I can use GET to retrieve these files using a URL with query string appname/?customerId=xxx&projectId=xxx etc And DELETE on the directory structure and files also using query strings. What I'm unclear about is the best REST approach to call functions implementing various calculations on the server.
Perhaps this should be a POST for the initial calculation in a scenario as this is creating the results files? Maybe a PUT or a PATCH for the sensitivity analysis or other partial recalculations as this is modifying results in an existing scenario?
There's a fair bit of online discussion about PUT vs PATCH vs POST used for database related activities. I could work up a REST approach based on what I've read for REST database interactions but if there's already standard practice on how to do calculations through a REST API I'd rather use that.
Perhaps this should be a POST for the initial calculation in a scenario as this is creating the results files? Maybe a PUT or a PATCH for the sensitivity analysis or other partial recalculations as this is modifying results in an existing scenario?
You can always just use POST. If we were using HTML representations of resources to guide the client through the protocol, we'd be doing that by following links and submitting forms. In HTML, submitting forms is limited to GET and POSt.
PUT and PATCH have more tightly constrained semantics than POST. Specifically, they are methods that request that the server make its representation match the clients representation (for PUT, we send the entire replacement representation; for PATCH, we just send the changes made by the client).
Technically, there's nothing wrong with the server not accepting the offered edits as is:
A successful PUT of a given representation would suggest that a subsequent GET on that same target resource will result in an equivalent representation being sent in a 200 (OK) response. However, there is no guarantee that such a state change will be observable, since the target resource might be acted upon by other user agents in parallel, or might be subject to dynamic processing by the origin server, before any subsequent GET is received. A successful response only implies that the user agent's intent was achieved at the time of its processing by the origin server.
So the server could accept the client's edits, and then immediately apply additional edits of its own.

Dynamic routing to IO sink in Apache Beam

Looking at the example for ClickHouseIO for Apache Beam the name of the output table is hard coded:
pipeline
.apply(...)
.apply(
ClickHouseIO.<POJO>write("jdbc:clickhouse:localhost:8123/default", "my_table"));
Is there a way to dynamically route a record to a table based on its content?
I.e. if the record contains table=1, it is routed to my_table_1, table=2 to my_table_2 etc.
Unfortunately the ClickHouseIO is still in development does not support this. The BigQueryIO does support Dynamic Destinations, so it is possible with Beam.
The limitation in the current ClickHouseIO is around transforming data to match the destination table schema. As a workaround, if your destination tables are known at pipeline creation time you could create a ClickHouseIO per table, then use the data to route to the correct instance of the IO.
You might want to file a feature request in the Beam bug tracker for this.