This is really strange, this code works when i replace the \n and \r characters with no space. But when i use a space, either " ", or "\s", or "\s", or "[\s]", or "-" (I've tried everything) it then causes the string to exceed length according to Redshift. So stl_load_errors says exceeded ddl length, but when i grab the text from the dataframe, or even from the stl_load_errors table, it shows its only 1024 characters. The field is set at varchar(1026).
Works:
rootTable.withColumn("firstfield",substring(regexp_replace("firstvalue","[\\r\\n]", ""), 1,1026)) \
.withColumn("secondfield",substring(regexp_replace("secondvalue","[\\r\\n]", ""), 1,1026))
Does not work:
rootTable.withColumn("firstfield",substring(regexp_replace("firstvalue","[\\r\\n]", " "), 1,1026)) \
.withColumn("secondfield",substring(regexp_replace("secondvalue","[\\r\\n]", " "), 1,1026))
Am i confusing the chars with bytes thing, like the RS table is 1026 bytes, not characters?
The data sample has some \n \r stuff in it: ...Non-applicable\nSubstances...
If I change the substring to around 1014, then it inserts ok, no ddl length exceeded.
Thanks
EDIT:
It seems like this is due to special characters or those larger than 1 byte present in the text. Will share when i find a solution
From a lot of research I cannot find an elegant way to handle this. Removing the non-ascii characters is the only way. With non-ascii, there is an unpredictable length. And this can cause ddl length exceeded errors in Redshift in this case.
The pyspark substring method doesn't handle it properly for me. If i were dealing with binary data i guess it would, but to convert it back and forth and try to replace each value with a 1-byte value seems like way too much work for the need in this case.
def ascii_ignore(x):
return x.encode('ascii', 'ignore').decode('ascii') if x else None
Related
I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.
All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself
I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:
Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;
I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")
However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:
[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)
The expected result is:
[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)
(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)
Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)
My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work [got an error saying
java.io.IOException: EOF whilst processing escape sequence].
Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?
Thanks!
EDIT: Forgot to mention, i'm using spark 1.6
wholeTextFiles api should be a rescuer api in your case. It read files as key, value pairs : key as the path of the file and value as the whole text of the file. You will have to do some replacements and splittings to get the desired output though
val rdd = sparkSession.sparkContext.wholeTextFiles("path to the file")
.flatMap(x => x._2.replace("\\\n", "").replace(";\n", "\n").split("\n"))
.map(x => x.split(";"))
the rdd output is
[Line1field1,Line1field2.1 Line1field2.2,Line1field3]
[Line2FIeld1,Line2field2,Line2field3]
i am facing issue while converting unicode data into national characters.
When i convert the Unicode data into national using national-of function, some junk character like # is appended after the string.
E.g
Ws-unicode pic X(200)
Ws-national pic N(600)
--let the value in Ws-Unicode is これらの変更は. getting from java end.
move function national-of ( Ws-unicode ,1208 ) to Ws-national.
--after converting value is like これらの変更は #.
i do not want the extra # character added after conversion.
please help me to find out the possible solution, i have tried to replace N'#' with space using inspect clause.
it worked well but failed in some specific scenario like if we have # in input from user end. in that case genuine # also converted to space.
Below is a snippet of code I used to convert EBCDIC to UTF. Before I was capturing string lengths, I was also getting # symbols:
STRING
FUNCTION DISPLAY-OF (
FUNCTION NATIONAL-OF (
WS-EBCDIC-STRING(1:WS-XML-EBCDIC-LENGTH)
WS-EBCDIC-CCSID
)
WS-UTF8-CCSID
)
DELIMITED BY SIZE
INTO WS-UTF8-STRING
WITH POINTER WS-XML-UTF8-LENGTH
END-STRING
SUBTRACT 1 FROM WS-XML-UTF8-LENGTH
What this code does is string the UTF8 representation of the EBCIDIC string into another variable. The WITH POINTER clause will capture the new length of the string + 1 (+ 1 because the pointer is positioned to the next position after the string ended).
Using this method, you should be able to know exactly how long second string is and use that string with the exact length.
That should remove the unwanted #s.
EDIT:
One thing I forgot to mention, in my case, the # signs were actually EBCDIC low values when viewing the actual hex on the mainframe
Use inspect with reverse and stop after first occurence of #
I want to replace "\\" from a bytestring sequence (Data.ByteString)
with "\", but due to the internal escaping of "\" it won't work.
Consider following example:
The original bytestring:
"\159\DEL*\150\222/\129vr\205\241=mA\192\184"
After storing in and re-reading from a database I obtain following
bytestring:
"\"\\159\\DEL*\\150\\222/\\129vr\\205\\241=mA\\192\\184\""
Imagine that the bytestring is used as a cryptographic key, which
is now a wrong key due to the invalid characters in the sequence.
This problem actually arises from the wrong database representation
(varchar instead of bytea) because it's otherwise considered as an invalid utf-8 sequence.
I have tried to replace the invalid characters using some sort of
split-modify-concat approach, but all I get is something without
any backslash inside the sequence, because I can't insert a single backslash
into a bytestring.
I really ask for your help.
Perhaps using read will work for you:
import Data.ByteString.Char8 as BS
bad = BS.pack "\"\\159\\DEL*\\150\\222/\\129vr\\205\\241=mA\\192\\184\""
good = read (BS.unpack bad) :: BS.ByteString
-- returns: "\159\DEL*\150\222/\129vr\205\241=mA\192\184"
You can also use readMaybe instead for safer parsing.
possibly you want the postgresql expression
substring(ByteString from e'^\\"(.*)\\"$')::bytea
that will give a bytea result that can be used in queries or in an alter table-using DDL
I need to buff out a line of text with a varying but large number of whitespace. I can figure out a janky way of doing a loop and adding whitespace to $foo, then splicing that into the text, but it is not an elegant solution.
I need a little more info. Are you just appending to some text or do you need to insert it?
Either way, one easy way to get repetition is perl's 'x' operator, eg.
" " x 20000
will give you 20K spaces.
If have an existing string ($s say) and you want to pad it out to 20K, try
$s .= (" " x (20000 - length($s)))
BTW, Perl has an extensive set of operators - well worth studying if you're serious about the language.
UPDATE: The question as originally asked (it has since been edited) asked about 20K spaces, not a "lot of whitespace", hence the 20K in my answer.
If you always want the string to be a certain length you can use sprintf:
For example, to pad out $var with white space so it 20,000 characters long use:
$var = sprintf("%-20000s",$var);
use the 'x' operator:
print ' ' x 20000;