Spark notebook can't find table that exists in synapse dedicated pool - scala

I have a really strange issue that I'm having a difficult time trying to get to the bottom of. I'm working on a solution that allows a user to enter parameters into a pipeline that in turn calls a Notebook and passes the parameter; scala is our language of choice. I've set the parameter cell and am using string interpolation to pass the parameter which is the table name of a dataset that a researcher/analyst have created in our dedicated pool. See the following code:
val datain = "table1"
val datain:DataFrame = spark.read.sqlanalytics(s"dedicatedp1.datacha.$datain")
Which results in the following error:
com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: The specified table does not exist. Please provide a valid table.
The table does exist and this code has worked previously, so other than a security issue that I'm working with the platform team to investigate. I'm curious if the community has any other thoughts on what may be causing this issue.

Related

Using VSTS.Feed() in Power BI to access odata

I am trying to use the VSTS.Feed() function in Power BI to read WorkItemSnapshot data. There are multiple problems. If I build the entire URL into a single string and call VSTS.Feed () with that, I get the correct information in Power BI desktop, but it will not refresh in Power BI online. I have been told to use the (undocumented) Query parameter, as shown below, but it is clear that this parameter is ignored. I can see that the select parameter is ignored on smaller projects, because all columns are returned. I can see that the filter parameter is ignored because the query fails on larger projects.
Does anyone have a working example of using the Query parameter with VSTS.Feed()?
let
BaseURL = "https://server.analytics.visualstudio.com/DefaultCollection/project/_odata/WorkItemSnapshot",
Select = "DateSK,WorkItemId,State,WorkItemType",
Filter = "WorkItemType eq Bug and State ne Closed and State ne Removed and DateSK ge 20180517 and DateSK le 20180615",
Source = VSTS.Feed(BaseURL, [Query=[select=#"Select",filter=#"Filter"]])
in
Source
Update:
With the query above, the message I get is shown below. As I said earlier, it is clearly not using the Filter parameter, and I'm assuming it is not using the Select parameter, either. I can't query everything because there is too much data, and I can't use a filter because I can't figure out a way to get the Options parameter to work. With VSTS.AccountContents, the options parameter works well, but those API endpoints don't use $ in parameter names.
Error: Query result contains 36,788,023 rows and it exceeds maximum allowed size of 300,000. Please reduce the number of records by applying additional filters
Details:
DataSourceKind=Visual Studio Team Services
ActivityId=881f7988-9863-4e03-8375-0489028f28f3
Url=https://server.analytics.visualstudio.com/DefaultCollection/Project/_odata/WorkItemSnapshot
error=Record
The query that started this whole line of questioning is simply one with a variable for a start date.
let
startDate = DateTimeZone.ToText (Date.AddDays(DateTimeZone.UtcNow(), -45), "yyyyMMdd"),
URL = "https://server.analytics.visualstudio.com/DefaultCollection/project/_odata/WorkItemSnapshot?$select=DateSK,WorkItemId,State,WorkItemType&$filter=WorkItemType eq 'Bug' and State ne 'Closed' and State ne 'Removed' and DateSK gt " & startDate,
Source = VSTS.Feed(URL)
in
Source
While this query mostly works in Power BI desktop (the select clause is ignored), the message I get when the data source is refreshed online is:
You can't schedule refresh for this dataset because one or more sources currently don't support refresh.
Discover Data Sources
Query contains unknown or unsupported data sources.
The documentation for VSTS.Feed() contradicts itself, saying both
The VSTS.Feed function has the same arguments, options and return value format as OData.Feed.
and
'VSTS.Feed' provides a subset of the Arguments and Options available through 'OData.Feed'.
To to summarize, I know that I can't combine data sources in Power BI. Does VSTS.Feed() support the options parameter? If so, how do I pass a Filter and Select clause to it?
To get WorkItemSnapshot by vsts.feed, please refer below query:
let
Source = OData.Feed("https://account.analytics.visualstudio.com/project/_odata/v1.0-preview", null, [Implementation="2.0"]),
WorkItemSnapshot_table = Source{[Name="WorkItemSnapshot",Signature="table"]}[Data]
in
WorkItemSnapshot_table
Note: the URL format should be https://account.analytics.visualstudio.com/project/_odata/v1.0-preview, or https://account.analytics.visualstudio.com/_odata/v1.0-preview.
And you can refer below documents:
Connect to VSTS using the Power BI OData feed
Connect using Power Query and Visual Studio Team Services (VSTS) functions

Informatica SQ returns different result

I am trying to pull data from DB2 via informatica, I have a SQ query that pulls few fields based on joins for 4 different tables.
When I run the query directly in the database, it returns the expected result, however when I run it in informatica and run a debugger, I see something else.
Please note all the columns data perfectly match, except one single column.
Weird thing is, this is a calculated field from the table based on a case statement:
CASE WHEN Column1='3' THEN 'N' ELSE 'Y' END.
Since this is a calculated field with a length of one string, I have connected from the source to SQ from one of the sources having 1 character length.
This returns 'Y' when executed in the database, the same query when I copy paste in SQ of information and run it, I get a data 'E', and this data can never be possible as I expect only a N or a Y. I have verified the column order, that its in the right place. This is very strange, is something going wrong because of the CASE Statement?
Save yourself the hassle, put an expression transformation after tge source qualifier and calculate, port value there then forget about it
I think i got the issue. We use Informatica PowerExchange to connect to a as400 system(DB2), and it seems that when we are trying to set a flag information in AS400, and pass it to informatica via PowerExchange, it converts it to binary, and to solve this, there needs to be an entry in the PowerExchange configuration file.
Unfortunately, i myself was not aware that it could be related to PowerExchange instead of powercenter itself.!!
Thanks for your assistance! Below is the KB about it.
https://kb.informatica.com/solution/4/Pages/17498.aspx

Postgresql function failed with "relation with OID xxxxx does not exist"

I am trying to extend a item profile table by parsering the its part_number column further down into properties. It works fine outside a function.
ALTER TABLE tbl_item_info
ADD prop1 varchar(2),
ADD prop2 varchar(1),
ADD prop3 numeric(4,3);
UPDATE tbl_item_info
SET prop1 = substr(part_num,5,2)
, prop2 = substr(part_num,7,1)
, prop3 = to_number( substr(part_num,8,5) , '9G999')
WHERE ARRAY[left(part_num,3)] <# ARRAY['NTX','EXC'] ;
But when I try to put the statements into a function. It always fail with error "relation with OID xxxxx does not exist" pointing to the UPDATE statements.
I have no clue what it is trying to say. Any idea why ?
I wish I had a definitive answer, but this seems to be related to a known bug in PostgreSQL as described here:
https://github.com/greenplum-db/gpdb/issues/1094
Bear in mind that the greenplum implementation of PostgreSQL is proprietary to Dell EMC, however, the core code issue is likely the same for all major PostgreSQL distributions. I am still researching this to determine if there is a good resolution to the problem. The database in which I experienced a markedly similar error is not the greenplum implementation of PostgreSQL. The error was thrown when I called the pg_relation_filepath() function in a query on an oid that was dynamically obtained from a record in the pg_class table that should have had an associated external file in a subdirectory of the ./base/ path. The error that was thrown was:
ERROR: relation "pg_toast_34474_index" does not exist
The point here is that for a toast entity to exist, it is supposed to be tied to another relation and acts as a reference to additional files created out on the storage media to accommodate additional data that does not fit into the owning relation's top level file - in this case, most likely a table. But when I search for the owning relation's oid (34474), the owner doesn't exist. Since the owner doesn't exist I think the logic assumes that the toast entity doesn't either, even though it has a record in the pg_class table.
This is as close as I can get to a root cause for now. Although the above link suggests code to improve the issue is supposed to have been released in version 8.3, my database has been upgraded from version 8.1 to version 9.4.7, so it appears that even though code may have improved between those two version to prevent new occurrences of the problem, if the problem was created before the database was upgraded, the newer code does not know how to reassemble the tinker toys left behind from issues created by this apparent bug before the fix was implemented.
At present I am investigating if a PLPGSQL function can wrap and trap the error for all relations so I can identify which ones have the problem (as well as to solve my original problem of determining which relation is hosted in a specific file that the server.postmaster log tells me it is unable to read from - hopefully it is just an index I can drop/create).
I found this issue at server 13.7. It was not at server 14.3.
It happened when I changed the signature (parameters) of the stored procedure:
SQL Error [42883]: ERROR: function with OID 894070 does not exist
I removed the old procedure and created new one.
But when I called a function which used that procedure it triggered the error.
To fix it I recreated the function which used changed object.
So general rule:
look where error happens, make sure to recreate object that triggers error, and recompile the code which uses it.
Hope it will help.

Oracle DB link - where clause evaluation

i have a DB2 data source and an Oracle 12c target.
The Oracle has a DB link to the DB2 defined which is working in general.
Now i have a huge table in the DB2 which has a timestamp column (lets call it ROW_CHANGED) for row changes. I want to retrieve rows which have changed after a particular time.
Running
SELECT * FROM lib.tbl WHERE ROW_CHANGED >'2016-08-01 10:00:00'
on the DB2 returns exactly 1 row after ca. 90 secs which is fine.
Now i try the same query from the Oracle via the db link:
SELECT * FROM lib.tbl#dblink_name WHERE ROW_CHANGED >TO_TIMESTAMP('2016-08-01 10:00:00')
This runs for hours and ends up in a timeout.
I read some Oracle docs and found distributed query optimization tips but most of them refer to joining a local to a remote table which is not my case.
In my desperation, i have tried the DRIVING_SITE hint, without effect.
Now i wonder when the WHERE part of the query will be evaluated. Since i have to use Oracle syntax and not DB2 syntax for the query, is it possible the Oracle will try to first copy the full table and apply the where clause afterwards? I did some research but did not find anything which would help me in this direction.
The ROW_CHANGED is a hidden column in the DB2, if that matters.
Thx for any hint in advance.
Update
Thanks#all for help. I'll share what did the trick for me.
First of all i have used TO_TIMESTAMP since the DB2 column is also Timestamp (not date) and i had expected to circumvent implicit conversions by this.
Without the explicit conversion i ran into ORA-28534: Heterogeneous Services preprocessing error and i have no hope of touching the DB config within reasonable time.
The explain plan btw did not bring much. It showed a FULL hint and no conversion on the predicates. Indeed it showed the ROW_CHANGED column as Date, i wonder why.
I have tried Justins suggestion to use a bind variable, however i got ORA-28534 again. Next thing i did was to wrap it into a pl/sql block (will run in a SP anyway later).
declare
v_tmstmp TIMESTAMP := 01.08.16 10:00:00;
begin
INSERT INTO ORAUSER.TMP_TBL (SRC_PK,ROW_CHANGED)
SELECT SRC_PK,ROW_CHANGED
FROM lib.tbl#dblink_name
WHERE ROW_CHANGED > v_tmstmp;
end;
This was executing in the same time as in DB2 itself. The date format is DD.MM.YY here since it is the default unfortunately.
When changing the variable assignment to
v_tmstmp TIMESTAMP := TO_TIMESTAMP('01.08.16 10:00:00','DD.MM.YY HH24:MI:SS');
I got the same problem as before.
Meanwhile the DB2 operators have created an index in the ROW_CHANGED column which i requested earlier that day. This has solved the problem in general it seems. Even my original query finishes in no time now.
If you are actually using an Oracle-specific conversion function like to_timestamp, that forces the predicate to be evaluated on the Oracle side. Oracle isn't going to know how to convert a built-in function like to_timestamp into an exactly equivalent function call in DB2.
If you used a bind variable, that would be more likely to get evaluated on the DB2 side. But that may be complicated by the data type mapping between different databases-- there may not be a perfect mapping between one engine's date and another engine's timestamp data type. If this was a numeric column, a bind variable would be almost certain to get pushed. In this case, it probably involves playing around a bit to figure out exactly what data type to use for your variable that works for your framework, Oracle, and DB2.
If using a bind variable doesn't work, you can force the predicate to be evaluated on the remote server using the dbms_hs_passthrough package. That lets you send a query verbatim to the remote server which allows you to do things like use functions defined in your DB2 database. That's a bit of overkill in this situation, hopefully, but it's nice to have the hammer as your backup if the simpler solution doesn't work quickly enough.

Issue with a numeric field in SSIS dtsx package

I've an SSIS dtsx package which is used to load data from a remote MAS db server using a DSN based connection. We load data from many tables into their replica tables in SQL-Server. Everything was working fine until we made some changes to a table in MAS. The dtsx has been failing with the following error:
Error: 0xC02090F8 at Data Flow Task, Import Data, DataReader Source
[28866]: The value was too large to fit in the output column
"UDF_TREAD_DEPTH" (29160).
Actually I believe it might be related to a single table field "UDF_TREAD_DEPTH" which is a decimal field. This field is shown in the DataReader source as "numeric [DT_NUMERIC]" with Length:0, Precision:4 & Scale:2.
In past we had simple data in format xx.xx. And now I see after the issue that we have data like xx.xx, xxx, .. however, still the data type didn't change after I refreshed the Data Reader source.
I believe the "Precision shud be updated to 5" for the data we have
based on this description.
I'm unable to change the data type as visible in the attached screen (Data Source Output column.png). When I debug this dtsx package, it errs while loading the Data Reader Source. If I'm nailing it right - how can I fix it. If there're any other possibilities then kindly let me know.
Have you tried to edit the source with the advanced editor? (Right click and select "Show Advanced Editor...") You can navigate to the Input and output parameters section (generally the last tab), go into the output columns section (for OLE DB, click the + next to OLE DB Source Output, then the plus next to Output Columns, then highlight the column name you want to change) and change the properties of the column in question (look for Data Type Properties and change Precision and scale as needed.). If you are not able to do that, you can try deleting the source and replacing it with a new source to the same data (ie the recreation of this object will requery the connection for column properties).
I got the data to be updated with the xxx.xx mask so 100 became 100.00. And this helped the DataReader in SSIS infer the type correctly.
In addition to it I also found another easy way of doing so which didn't require support of any cast / convert function -
UDF_TREAD_DEPTH * 1.00 as UDF_TREAD_DEPTH
This also allowed the DataReader to infer the type (i.e. precision & scale) correctly.