How can I truncate data to fit into a field using SQL*Loader? (ORA-12899) - substring

Using Oracle SQL*Loader, I am trying to load a column that was a variable length string (lob) in another database into a varchar2(4000) column in Oracle. We have strings much longer than 4000 characters, but everyone has agreed that these strings can and should be truncated in the migration (we've looked at the data that goes beyond 4000 characters, it's not meaningful). To do so, I specified the column this way in the control file:
COMMENTS CHAR(65535) "SUBSTR(:COMMENTS, 1, 4000)",
However, SQL*Loader still rejects any row where this record is longer than 4000 characters in the data file:
Record 6484: Rejected - Error on table LOG_COMMENT, column COMMENTS.
ORA-12899: value too large for column COMMENTS (actual: 11477, maximum: 4000)
Record 31994: Rejected - Error on table LOG_COMMENT, column COMMENTS.
ORA-12899: value too large for column COMMENTS (actual: 16212, maximum: 4000)
Record 44063: Rejected - Error on table LOG_COMMENT, column COMMENTS.
ORA-12899: value too large for column COMMENTS (actual: 62433, maximum: 4000)
I tried taking a much smaller substring and still got the same error. How can I change my control file to truncate string data longer than 4000 characters into a varchar2(4000) column?

Check to make sure your data ENCODING and Oracle ENCODING are not conflict.
In this case, use CHARACTERSET option when loading.

by all accounts
COMMENTS CHAR(65535) "SUBSTR(:COMMENTS, 1, 4000)",
is the correct syntax.
using sqlldr 11.2.0.1 it works successfully for me up until the point where the input record column is > 4000 where i get a
ORA-01461: can bind a LONG value only for insert into a LONG column
if i switch to a directpath load then i get the smae error as you.
ORA-12899: value too large for column COMMENTS (actual: 4005, maximum: 4000)
in the end i have split it into a 2 stage load.. i now have a staging table with a column of type CLOB which i load with
COMMENTS CHAR(2000000000)
which then gets inserted to eth main table with a
insert into propertable
select dbms_lob.substr(comments,1,4000)
from staging_table;
hope thats helpful

Related

What's the impact of TOAST on performance? (adding a hundred varchar columns)

Consider a table with the following data:
id bigint Auto Increment
name character varying(255) NULL
category character varying(255) NULL
english character varying(255) NULL
french character varying(255) NULL
pivot character varying(255) NULL
credits character varying(255) NULL
hash character varying(20) NULL
The english column contains data of the following size (in bytes): max 116, min 5, average 42, median: 40.
The number of rows in the table is around 30,000 and will hardly change.
The new 107 columns will be translations of the English.
Will adding 107 columns hurt performance?
The Postgres site says the maximum number of columns on a Postgres table is
250-1600 depending on column types
and
The maximum number of columns for a table is further reduced as the tuple being stored must fit in a single 8192-byte heap page
Will the data fall under that limit?
Size of the largest row
What is the actual storage size of the table's rows? pg_column_size is the
Number of bytes used to store a particular value (possibly compressed)
SELECT id, pg_column_size(t.*) FROM table as t ORDER BY pg_column_size DESC
-- Some stats derived from the query:
-- Min 87 bytes
-- Max 514 bytes
-- Average 216 bytes
-- Median: 209 bytes
But no compression is actually happening here, because:
When a row that is to be stored is "too wide" (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn't enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this "out of line" data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that's good enough and the row can be stored successfully.
Compression would start to kick in once the table gets bigger and those new columns are added.
It's unclear to me what the compression ratio would be for such data?
I wonder how effective it'll be on lots of short multilingual sentences. Also, tried to find the exact name of the compression algorithm used by Postgres: the docs say "the LZ family of compression techniques", but which one – LZ77? LZ78? A twist on one of them?
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
TOAST'ed?
If the size of the table goes beyond the page size limit, then Posgres will rely on TOAST not just to compress but also to split the data for "out-of-line" rows.
I understand this will increase fetch times for those rows that don't fit… But what's the impact of TOAST on performance? Is it negligible for such a use case?
Bottom-line
At the end of the day…
Is adding those 107 columns a good idea, or should I use a different approach?
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
Or am I approaching this the wrong way, i.e. is it a case of premature optimization where I'd have been better off just adding the columns and only investigate later if faced with problems?
Using Postgres 9.6. Upgrading is an option if needed.
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
I'd just copy the English version into each of the 107 columns. That should be good enough to get some useful findings. You might worry that the repetition would cause the compression to be idiosyncratic; but each value is compressed in isolation so won't "know" it is identical to some other value.
It's unclear to me what the compression ratio would be for such data?
Not very much. For example, the paragraph of yours I quoted first doesn't get any benefit from compression (when I copied it into 107 other columns). Short segments of ordinary text do not have enough repetition in them to be very compressible. Translating them to other languages is unlikely to change this.
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
This question has a very clear answer. You should absolutely select only what you need. Assembling a row from 100+ toasted columns, just to throw most of them away, will slow you down.
I don't know if this falls under "premature optimization" so much as falling under poor design. In one way or another you will need some method of know which of the 108 versions you need. But what happens when you need to add the 108th translation, or you delete say the 93rd. So use this information to form a key to a translation table. Something like Translation_Test (for_ref_in bigint, language text, translation text). Then access the necessary text (including perhaps the English version) from that table.

Is there a way to set the max width of a column when displaying JSONB results in psql?

I have a problem that is somewhat similar to this question: Is there a way to set the max width of a column when displaying query results in psql?. I have a number of tables in Postgres with large JSONB documents. When I use psql from Emacs, it grinds to a halt trying to display fields with these documents. Ideally, I just want to see the first X characters of a document when I select * from a given table. I tried:
\pset columns 20
To no avail. Is there some permutation of columns and format that could achieve this? At present I am manually casting columns like so:
cast("PAYLOAD" as varchar(50))
This works, but it does mean I need to remember to cast before selecting which is a major pain.

Does varchar's length have any effect on performance

We have discussions with our development staff over the use of VARCHAR columns as they define every varchar fileds as varchar(255),varchar(500),... and much bigger than the maximum length of the field,
does varchar's length have any effect on performance in db2? We have find that it is recommended to use char instead of varchar for column of 30 bytes or less and our concern is about varchar fileds that are greater than 30 bytes.
Allowing excessive column length is not a good idea. If you allow, let’s say, a FirstName column to have maximum length 500, you may find quite a long irrelevant story there eventually, because why not if it’s allowed :)
As for performance implications.
The only problem may be here, if Extended row size is turned on for the database (you simply can’t create too “wide” table otherwise), and the total length of the row exceeds the tablespace page size. Some varchar column value gets out from the data page, and more IO will be needed to access such a row in future. You should keep in mind such a behavior. And the probability of such events is higher in case of uncontrolled varchar columns length.
This can have an performance hit with ORGANIZE BY COLUMN tables. There is a limit in the total declared width that can be processed within the Columnar Data Engine, if this limit is breached in a query plan, the remainder of the query will be processed in the Row Data Engine.

Keep leading zeros when joining data sources in tableau

I am trying to create a data source in Tableau (10.0) where I am joining a table from SQL with an Excel file. The join happens on a site id but when reading the id from the excel source, Tableau strips the leading zeros (and SQL keeps leading zeros). I see this example
to add the leading zeros back as a new, calculated field. But the join is still dropping rows because the id is not properly formatted when making the join.
How do I get the excel data source to read the column with the leading zeros so I can do the join?
Launch Excel and choose to open a new blank workbook.
Click the Data tab and select From Text.
Browse to the saved CSV file and select Import.
Ensure that Delimited is selected and click Next.
Leave Tab as the delimiter and click Next.
Select the column containing the data with leading zeros and click
Text.
Repeat for each column which contains leading zeros.
Click Finish.
Click OK.
Never heard of or used tableau, but it sounds as though something (jet/ace database driver being used to read excel file?) is determining the column to be numeric and parsing the data as numbers, losing leading zeroes
If your attempts at putting them back are giving you grief, I'd recommend trying the other direction instead; get sqlserver to convert its strings to numbers. Number matching should be more reliable than String matching, so long as the two systems don't handle rounding differently :)
If your Excel file was read in from a CSV and the Site ID is showing "Number Stored as Text", I think you can solve your problem by telling Tableau on the Data Source entry that the field is actually a string. On the preview data source view, change the "#" (designating number) to string so that both the SQL source and the Excel source are both strings before doing the join.
This typically has to do with the way Excel stores values as mentioned above. I would play around with the number formatting for the Site ID column in Excel itself, not Tableau, and changed that two "Text" in Excel. You can verify if Tableau will read it properly with the leading 0s by exporting your excel file to csv and looking in the csv files to see if the leading 0s are still there.

How do you insert part of an exsisting column record into a new table in postgreSQL?

I'm in the process of formatting a database and have found a column I'd like to format. It has 3 types of information for every record in the column. For example, a record in my history column shows as (American, born Estonia. 19011974). What I want to do is put this data into new individual columns to make them atomic. I want to extract data such as 'American' into a country column, 'Estonia' into a born column and '1901' into a born column and '1974' into a death column. UPDATE: However, some of the columns hold nullls, for example, another record in the same column might be (German, 19242004), so a normal regular expression wouldn't work for all data would it? Any help is appreciated!
What PostgreSQL statements would I use to obtain specific parts of this data from the individual records? I understand it would be insert and have already came up with:
INSERT INTO historian (id,url)
SELECT object_id, url FROM maintable;
that statement allowed me to get those values into new columns, but those were atomic already so I could easily transition them. Thanks for any help! :)