BigQuery - create table via UI from cloud storage results in integer error - unicode

I am trying to test out BigQuery but am getting stuck on creating a table from data stored in google cloud storage. I am able to reduce the data down to just one value, but it is not making sense.
I have a text file I uploaded to google cloud storage with just one integer value in it, 177790884
I am trying to create a table via the BigQuery web UI, and go through the wizard. When I get to the schema definition section, I enter...
ID:INTEGER
The load always fails with...
Errors:
File: 0 / Line:1 / Field:1: Invalid argument: 177790884 (error code: invalid)
Too many errors encountered. Limit is: 0. (error code: invalid)
Job ID trusty-hangar-120519:job_LREZ5lA8QNdGoG2usU4Q1jeMvvU
Start Time Jan 30, 2016, 12:43:31 AM
End Time Jan 30, 2016, 12:43:34 AM
Destination Table trusty-hangar-120519:.onevalue
Source Format CSV
Allow Jagged Rows true
Ignore Unknown Values true
Source URI gs:///onevalue.txt
Schema
ID: INTEGER
If I load with a schema of ID:STRING it works fine. The number 177790884 is not larger than a 64 bit signed int, I am really unsure what is going on.
Thanks,
Craig

Your input file likely contains a UTF-8 byte order mark (3 "invisible" bytes at the beginning of the file that indicate the encoding) that can cause BigQuery's CSV parser to fail.
https://en.wikipedia.org/wiki/Byte_order_mark
I'd suggest Googling for a platform-specific method for view and remove the byte order mark. (A hex editor would do.)

The issue is definitely with file's encoding. I was able to reproduce error.
And then "fixed" it by saving "problematic" file as ANSI (just for test) and now it was loaded successfully.

Related

What could be causing this 'invalid host' error on kdb query?

I get an odd error when trying to query too many dates from a date-partitioned historical database:
q)eod: h"select from eod where date within 2018.01.01 2018.04.22"
'/tablepath/2018.04.04/eod/somecolumn: invalid host
q)eod: h"select from eod where date within 2018.01.17 2018.04.20"
'/tablepath/2018.04.20/eod/othercolumn: invalid host
q)eod: h"select from eod where date within 2018.01.18 2018.04.20"
q)
Note that both dates mentioned in the error messages are within the date range that we manage to extract in the end, and that it fails on a different column each time. This seems to indicate it's something to do with the size of the table being pulled, but when we check the size of the largest table we managed to get:
q)(-22!eod) % 1024 * 1024
646.9043
q)count eod
2872546
we find that it's not particularly large by either memory size nor by number of rows.
Googling for "invalid host" errors doesn't seem to turn up anything relevant, and I'm not seeing anything in the kdb docs about size limits that would be relevant. Anyone got any ideas?
Edit:
When loading the table in a session and making the queries directly, we get what appears to be the same error, but with a different message. For instance:
q)jj: select from eod where date within 2018.01.01 2018.04.22
Too many compressed files open
k){0!(?).#[x;0;p1[;y;z]]}
'./2018.04.04/eod/settlecab: No such file or directory
.
?
(+`exch`date`class..
q.Q))
Note that the file ./2018.04.04/eod/settlecab does in fact exist, and contains data:
I have no problem loading the data for just the date mentioned in the error, and the column mentioned has meaningful values:
q)jj: select from eod where date=2018.04.04
q)select count i by settlecab from jj
settlecab| x
---------| -----
0 | 41573
1 | 2269
The key point seems to be the Too many compressed files open message, but what can I do about this?
Edit for Summary/Solutions:
The table in question had many columns, all stored in a compressed format. When issuing a query against too many dates at once, kdb would try to mmap all of those columns at once, running into a limit on how many compressed files could be open at once.
Once I understood the problem, several solutions were available:
I could pull only certain columns from the database, reducing the number of files that kdb needed to keep open,
I could force kdb to pull all the data into memory by adding a dummy where clause to the query, such as (null column) | not null column (hacky, but it works),
I could have upgraded the kdb version and lifted OS limits (not practical in my case).
I still have no idea why this resulted in an invalid host error when querying the database remotely.
First off, can we just clarify the database structure you're working with. It seems from the filepaths returned in your errors that you've got a date-partitioned database. Did you mean non-segmented database when you said non-partitioned in your original query?
In terms of a fix for your issue, have you tried loading your database into a session, and making those queries directly? If so do you get the same issues?
If that seems to be working alright, the problem might lie with how you're defining your database handle. How is h defined in your original example?
It might also be worth trying to select individual dates from your database, to try and isolate the problem, and to determine if it lies with your on-disk data. Try specifically querying the dates that are mentioned in your errors.
You could also try performing your original queries with a subset a columns, again to try and pinpoint where your issue is coming from.
Let us know if you get any further with this.
Joseph

boolean field in redshift copy

I am producing a comma-separated file in S3 that needs to be copied to a staging table in a redshift database using the postgres COPY command.
It has one boolean field. With every sensible way I can think of to represent the boolean value in the file, redshift copy complains, usually with "Unknown boolean format".
I'm going to give up and change the staging table field to a smallint so that I can proceed with the copy and translate the value on the load from staging to the final redshift table, but I'm curious if anyone knows the correct incantation.
A zero or one works just fine for us.
Check your loads carefully, it may well be another issue that's 'pushing' invalid data into your boolean column.
For instance, we had all kinds of crazy characters embedded in our data that would cause errors like that. I eventually settled on using the US character for the record separator.
Check to make sure you're excluding the headers during the COPY command.
I ran into the same problem, but adding the ignoreheader 1 option (ignores 1 header line during import) solved the issue.

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:

Team,
I am using redshift version *(8.0.2 ). while loading data using COPY command, I get an error: - "String contains invalid or unsupported UTF8 codepoints, Bad UTF8 hex sequence: bf (error 3)".
It seems COPY trying to load UTF-8 "bf" into VARCHAR field. As per Amazon redshift, this error code 3 defines below:
error code3:
The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255
or any character between 128 and 191 (inclusive).
Amazon recommnds this as solution - we need to go replace the character with a valid UTF-8 code sequence or remove the character.
could you please help me how to replace the character with valid UTF-8 code ?
when i checked database properties in PG-ADMIN, it shows the encoding as UTF-8.
Please guide me how to replace the character in the input delimited file.
Thanks...
I've run into this issue in RedShift while loading TPC-DS datasets for experiments.
Here is the documentation and forum chatter I found via AWS:https://forums.aws.amazon.com/ann.jspa?annID=2090
And here is the explicit commands you can use to solve data conversion errors:http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-acceptinvchars
You can explicitly replace the invalid UTF-8 characters or disregard them all together during the COPY phase by stating ACCEPTINVCHARS.
Try this:
copy table from 's3://my-bucket/my-path
credentials 'aws_iam_role=<your role arn>'
ACCEPTINVCHARS
delimiter '|' region 'us-region-1';
Warnings:
Load into table 'table' completed, 500000 record(s) loaded successfully.
Load into table 'table' completed, 4510 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.
0 rows affected
COPY executed successfully
Execution time: 33.51s
Sounds like the encoding of your file might not be utf-8. You might try this technique that we use sometimes
cat myfile.tsv| iconv -c -f ISO-8859-1 -t utf8 > myfile_utf8.tsv
For many people loading CSVs into databases, they get their files from someone using Excel or they have access to Excel. If so, this problem is quickly solved by:
First saving the file out of Excel using the Save As and selecting CSV UTF-8 (Comma Delimited) (*.csv) format, by requesting/training those giving you the files to use this export format. Note many people by default export to csv using the CSV (Comma delimited) (*.csv) format and there is a difference.
Loading the csv into Excel and then immediately Saving As to the UTF-8 csv format.
Of course it wouldn't work for files unusable by Excel, ie. larger than 1 million rows, etc. Then I would use the iconv suggestion by mike_pdb
Noticed Athena external table is able to parse data which Redshift copy command unable to do. We can use below alternative approach when encountering - String contains invalid or unsupported UTF8 codepoints Bad UTF8 hex sequence: 8b (error 3).
Follow below steps, if you want to load data into redshift database db2 and table table2.
Have a Glue crawler IAM role ready which has access to S3.
Run crawler.
Validate table and database in Athena created by Glue crawler, say external db1_ext, table1_ext
Login to redshift and create linking with Glue Catalog by creating Redshift schema (db1_schema) using below command.
CREATE EXTERNAL SCHEMA db1_schema
FROM DATA CATALOG
DATABASE 'db1_ext'
IAM_ROLE 'arn:aws:iam:::role/my-redshift-cluster-role';
Load from external table
INSERT INTO db2.table2 (SELECT * FROM db1_schema.table1_ext)

SSIS Convert Between Unicode and Non-Unicode Error

I have an ssis package where I am using an OLEDB source linking to SQL Server 2005 table. All columns except a date column are NVARCHAR(255). I am using an Excel destination and using a SQL statement to create the sheet in the Excel workbook, the SQL is in the excel connection manager (effectively a create table statement that creates a sheet) and is derived from the mapping of the columns from the DB.
No matter what I have done I keep getting this unicode --> non-unicode conversion error between my source and destination. Tried conversion to string[DT_STR] between S > D, removed it, changed SQL Table VARCHAR to NVARCHAR and still get this flippin error.
Because I am creating the sheet in Excel with a SQL statement I do not see any way to actually pre-define what the data types of the columns will be in the Excel sheet. I imagine it would be a default meta data but I do not know.
So between my SQL table destination and the creation of my Excel sheet with this SSIS sql statement how can I stop this error coming up?
My error is:
Error at Data Flow Task [OLE DB Source [1]]: Column "MyColumn" cannot convert between unicode and non-unicode string data types.
And for all nvarchar columns.
Appreciate any help
Thanks
Andrew
The below Steps worked for me:
right click on source task.
click on "Show Advanced editor".
Go to "Input and Output Properties" tab.
select the output column for which you are getting the error.
Its data type will be "String[DT_STR]".
Change that data type to "Unicode String[DT_WSTR]".
save and close.
Add Data Conversion transformations to convert string columns from non-Unicode (DT_STR) to Unicode (DT_WSTR) strings.
You need to do this for all the string columns...
The missing piece here is Data Conversion object. It should be in between OLE DB Source and Destination object.
First, add a data conversion block into your data flow diagram.
Open the data conversion block and tick the column for which the error is showing. Below change its data type to unicode string(DT_WSTR) or whatever datatype is expected and save.
Go to the destination block. Go to mapping in it and map the newly created element to its corresponding address and save.
Right click your project in the solution explorer.select properties. Select configuration properties and select debugging in it. In this, set the Run64BitRunTime option to false (as excel does not handle the 64 bit application very well).
Instead of adding an earlier suggested Data Conversion you can cast the nvarchar column to a varchar column. This prevents you from having an unnecessary step and has a higher performance then the alternative.
In the select of your SQL statement replace date with CAST(date AS varchar([size])). For some reason this does not yet change the output data type. To do this do the following:
Right click your OLE DB Source step and open the advanced editor.
Go to Input and Output Properties
Select Output Columns
Select your column
Under Data Type Properties change DataType to string [DT_STR]
Change Length to the length you specified in your CAST statement
After doing this your source data will be output as a varchar and your error will disappear.
Source
I have been having the same issue and tried everything written here but it was still giving me the same error.
Turned out to be NULL value in the column which I was trying to convert.
Removing the NULL value solved my issue.
Cheers,
Ahmed
No-one seems to mention this but, converting varchar to nvarchar in the source query also solves the issue.
On the above example I kept losing the values, I think that delaying the Validation will allow the new data types to be saved as part of the meta data.
On the connection Manager for 'Excel Connection Manager' set the Delay Validation to False from the Properties.
Then on the data flow Destination task for Excel set the ValidationExternalMetaData to False, again from the properties.
This will now allow you to right click on the Excel Destination Task and go to Advanced Editor for Excel Destination --> far right tab - Input and Output Properties. In the External Columns folder section you will be able to now change the Data Types and Length values of the problematic columns and this can now be saved.
Good Luck!
I experienced this condition when I had installed Oracle version 12 client 32 bit client connected to an Oracle 12 Server running on windows.
Although both of Oracle-source and SqlServer-destination are NOT Unicode, I kept getting this message, as if the oracle columns were Unicode.
I solved the problem inserting a data conversion box, and
selecting type DT-STR (not unicode) for varchar2 fields and DT-WSTR (unicode) for numeric fields, then I've dropped the 'COPY OF' from the output field name.
Note that I kept getting the error because I had connected the source box arrow with the conversion box BEFORE setting the convertion types. So I had to switch source box and this cleaned all the errors in the destination box.
When creating table in SQL Server make your table columns NVARCHAR instead of VARCHAR.
I think people are missing this. In my case I had 100 character columns to convert between Oracle and MS Sql. All this stuff about Data Conversion and Advanced Editor is incredibly tedious if you have a 100 separate character columns to assign. Plus SSIS being SSIS, it will sometimes reset all your 100 advanced editor changes even if you set VALIDATEEXTERNALMETADATA to false, incredibly obnoxious. I wouldn't mind doing the Data Conversion if there was some value to it but 20 years ago ETL tools used to take oracle character to ms sql characters without fussing. What Bakalolo and Zafer say is the answer if you have a lot of character columns and you can live with nvarchar, just declare all your output ms sql columns (nvarchar) and your data task will automatically assign your oracle fields into ms sql fields with no manual overrides. I have also found that the new Oracle Source (2021) doesn't complain about a unicode conversion to varchar in ms sql. A colleague just told me that the ssis wizard (it may be only in vs 2019+) to assign oracle character to ms sql varchar will do the assignments automatically with no override, but I haven't tried that personally.
2022 update - I think this is just vs 2019 created packages and later. An ado.net task reading a varchar ms sql table going to oledb (and ado.net I think) ms sql varchar will throw the unicode error. If you switch the input task to oledb reading ms sql varchar table you won't have to do the advanced editor overrides for the varchar fields. If you don't want to do advanced editor overrides (who does?) try different tasks and more oledb tasks.
I just encounter same issue, I solve it in my SQL request : using convert directly
CONVERT(NVARCHAR(50),'') AS MyVarName
I need to put an empty (or fix size string) into excel file. Converting force type of MyVarName from DT-STR to DT-WSTR (unicode)
I know this is a very old post but I ran into the same issue and found that I had to manually select the conversion component output alias as the mapping in the excel destination component. Since the names of the OLE DB Source match the excel column names it was mapping it to the OLE DB and not to the Output Alias. Such as SourceID column from the OLE DB component being named Copy of SourceID after conversion. I don't see the original question saying they specifically selected the new alias name just that they mapped to DB columns. #Serge Voloshenko post comes the closest but also does not mention to make sure the mapping happens. To a new SSIS user this might be overlooked.

Issue with a numeric field in SSIS dtsx package

I've an SSIS dtsx package which is used to load data from a remote MAS db server using a DSN based connection. We load data from many tables into their replica tables in SQL-Server. Everything was working fine until we made some changes to a table in MAS. The dtsx has been failing with the following error:
Error: 0xC02090F8 at Data Flow Task, Import Data, DataReader Source
[28866]: The value was too large to fit in the output column
"UDF_TREAD_DEPTH" (29160).
Actually I believe it might be related to a single table field "UDF_TREAD_DEPTH" which is a decimal field. This field is shown in the DataReader source as "numeric [DT_NUMERIC]" with Length:0, Precision:4 & Scale:2.
In past we had simple data in format xx.xx. And now I see after the issue that we have data like xx.xx, xxx, .. however, still the data type didn't change after I refreshed the Data Reader source.
I believe the "Precision shud be updated to 5" for the data we have
based on this description.
I'm unable to change the data type as visible in the attached screen (Data Source Output column.png). When I debug this dtsx package, it errs while loading the Data Reader Source. If I'm nailing it right - how can I fix it. If there're any other possibilities then kindly let me know.
Have you tried to edit the source with the advanced editor? (Right click and select "Show Advanced Editor...") You can navigate to the Input and output parameters section (generally the last tab), go into the output columns section (for OLE DB, click the + next to OLE DB Source Output, then the plus next to Output Columns, then highlight the column name you want to change) and change the properties of the column in question (look for Data Type Properties and change Precision and scale as needed.). If you are not able to do that, you can try deleting the source and replacing it with a new source to the same data (ie the recreation of this object will requery the connection for column properties).
I got the data to be updated with the xxx.xx mask so 100 became 100.00. And this helped the DataReader in SSIS infer the type correctly.
In addition to it I also found another easy way of doing so which didn't require support of any cast / convert function -
UDF_TREAD_DEPTH * 1.00 as UDF_TREAD_DEPTH
This also allowed the DataReader to infer the type (i.e. precision & scale) correctly.