I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');" \
| \
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8 already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)), but it had no effect.
I'm using cqlsh 5.0.1, installed using pip. (weirdly it was pip install cqlsh==5.0.4)
I also tried the cqlsh from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einp\xc3\xb6ster ........`
So there we are, \xc3 in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8. I tried overriding the default with utf-8 but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh parsed those as ascii and not utf-8, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh understands stdin as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Update: utf8: Print invalid UTF-8 character position
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.
validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
https://github.com/scylladb/scylla/commit/ffd8c8c505b92a71df7e34d5196c7545f11cb12f
I download an HTML page and its files via Wget on Windows:
wget -m -k -p -np --html-extension
That HTML content has a lot of URLs with special characters (example: Chp1).
There are two issues:
Inside the HTML content, URLs (including special character's) become some random words:
Expectation:
Chp1
Actual:
Chp1
Filename is random words.
The second issue can be solved by adding --restrict-file-names=nocontrol.
How do I solve the first one? Is this Windows version a problem?
Obviously, inside HTML, it converts URLs with special characters to something...
Your problem comes from the fact that Windows will still treat your UTF-8 characters as Latin-1 characters, even with the --restrict-file-names=nocontrol command line argument.
GNU's site documents this bug here, and it is still unfortunately an issue for Windows users to this day. Your command would work inside a Linux environment however.
I tried to implement k means by MATLAB. However, when I use csvread('Filename'); in my program. It reminds me the Warning The encoding 'GB2312' is not supported. and the program can't read the csv data. Can anybody tell me what is wrong?
data=csvread('ClusterSamples.csv');
plot(data(:,1),data(:,2),'r+');
[m,n]=size(data);
The character encoding is not supported.
If you're using Mac or Linux you can use the iconv(1) tool.
cp ClusterSamples.csv ClusterSamples.csv.old && \
iconv -f GB2312 -t UTF-8 < ClusterSamples.csv.old > ClusterSamples.csv`
If not, you can use a text editor to change the character encoding and resave
I am desperately trying to replace certain unicode characters (graphemes) from a file using sed. However I keep failing for some of them, namely the ones from unicode blocks:
\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
I tried (in a sed config file loaded via the -f switch):
s/\p{InHigh_Surrogates}/###/ --> no effect at all
s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'
Anybody got a suggestion? Also, I am not necessarily focused on using the blocks - but I also failed trying to define a character range of the form \xd800-\xdfff.
Thanks,
Thomas
Try using the -r flag for sed:
$ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file
###: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF
From man sed:
-r, --regexp-extended
use extended regular expressions in the script.
Update: 07/12/13
The script works through command line.
"------extra line" is to show an extra return key stroke in editor.
XAMPP: 1.8.2
Server: Apache 2.4
Issue:
I keep receiving the error "End of script output before headers: hello.pl" for a simple hello world perl script. I'm trying to execute the script via a web server "xampp".
Curious Note:
I can use another Perl script which will initially work. However when I make a simple change such as a space, return or comment "#", the script will no longer function. However if I remove the change and save it the script will work again.
Check List
Confirm correct path to perl
Output header (see perl code below)
Extra line at end of script (I heard this could resolve issue)
Confirmed correct privileges in httpd.config
Transferred file via ftp in ASCII
Perl Script:
#!"C:\xampp\perl\bin\perl.exe"
print "Content-Type: text/html\n\n";
print "hello world";
------extra line
httpd.config
<Directory "C:/xampp/htdocs">;
Options Indexes FollowSymLinks Includes ExecCGI
AllowOverride All
Require all granted
</Directory>;
Maybe your editor changes the line ending characters to windows one.
CGI output needs to be started with HTTP.., two \n then header, then the body between the right HTML codes (Why doesn't my Perl CGI program work on Windows?)
Check the actual chars in a editor that shows you the line endings (like notepad++).
To my best knowledge, the shebang (#!) line is ignored in windows.
The probable cause:
http://perl.baczynski.com/wtf/solved-mystery-perl-on-xampp-wont-run-modified-scripts
tl;dr: turn off your COMODO antivirus, or it's sandbox feature.
Might be a known PHP bug (https://bugs.php.net/bug.php?id=66474). Try different versions of PHP?
Probably this is SELinux blocks.
try this
setsebool -P httpd_enable_cgi 1
chcon -R -t httpd_sys_script_exec_t cgi-bin/your_script.cgi