Gsutil with rsync: Invalid Unicode Path Enountered - encoding

I am using gsutil combined with the "rsync" command to upload a business critical files to google storage as a backup. Unfortunately most of the archives and filenames are Greek for example "αντιγραφο.txt". On english files, rsync is ok, but when gsutil tries to sync greek files, it encounters an exception.
The command is:
gsutil -m rsync -d -r H:\Test gs://myserver.com/data
Building synchronization state...
Caught non-retryable exception while listing file://H:\Test: CommandException: Invalid Unicode path encountered
('H:\Test\\xe1\xed\xf4\xe9\xe3\xf1\xe1\xf6\xef (1).txt'). gsutil
cannot proceed with such files present. Please remove or rename this
file and try again. NOTE: the path printed above replaces the
problematic characters with a hex-encoded printable representation.
For more details (including how to convert to a gsutil-compatible
encoding) see gsutil help encoding.
CommandException: Caught non-retryable exception - aborting rsync
I tried to convert the filenames to UTF-8 but I can't find anything that works on my windows cmd. I've searched many sites for iconv native2asciii but I can't locate something useful. The server is Windows 2012 so I cannot use "convmv" to convert filenames to UTF-8.Is there another way to convert all filenames to utf8 in an automated manner before I upload the to the cloud? The archive is 600GB so i can't just zip it and send it, i also want this to run automaticaly through task scheduler.
Thank you very much!

Related

How does the pgadmin encodes the file path in backups?

I'm trying to restore dump files from locations that contain character from other languages besides English.
So here is what I did:
From inside the pgadmin I used the backup tool like:
And inside the FileName input provided an actual real folder named "א":
C:\א\toc.dump
The actual file argument (-f file) has been auto decoded into:
pg_dump.exe --file "C:\\0F04~1\\TOC~1.DUM"
My question is what is the decoding system pgadmin uses in order to decode the file path argument?
How did it came up with 0F04~1 from א?
I'm asking it because pg_restore is not supporting file path that contains not English chars (from cmd):
pg_dump.exe --file "C:\\0F04~1\\TOC1.DUMP" .... WORKS OK!
pg_dump.exe --file "C:\\א\\TOC1.DUMP" ... Not Working!
pg_restore: [custom archiver] could not open input file "..."
As in this question, so if I'll find the encoding system for pgadmin I'll use it from code.
My goal is to encode the path that contain not-English chars from a batch code so it will work.
This is not something weird pgadmin does, but rather it is something weird Windows itself does when needing to represent such file names in a DOS-like setting. Like when the name is more than 8 chars, or extension more than 3.
In my hands the weird presentation is only there in the logs and status messages. If I use the GUI file chooser, the file names look normal, and replay successfully.
If you really want to know what Windows is doing, I think that is a better question for superuser with a Windows tag. I don't know why you can't restore these files. Are you using the pgAdmin GUI file chooser or trying to type the names in directly to something?

How to clone/copy files that contain in your name, reserved characters of a server to a local storage using Wget?

I can not download files from a server of my work because the names of the files have reserved characters (error not controlled by the company and by the erroneous named by the clients that uploads attachments) and for some reason the 404 error even though the files exist on the server, by the way I use wget for this task.
This is the executing line that starts the download (list.txt contains url lines from the server to the file in question- example: https://example.com/files/122301/8+.pdf)
wget.exe -x -i "C:\clon\list.txt" -P "C:\clon\destino" -nv -o "C:\clon\log.txt"
I do not know the functionality of the parameters given in wget in addition to the source / destination routes such as the log but some files contain '}' or '+' in their file names and therefore (I think) the missing files are not downloaded ( I have 93% downloaded from all files)
Examples of files including these characters:
/FC04-6198}+.pdf
/8+.pdf
/PT05+2236.pdf
Try placing these parameters "--content-disposition" or "--restrict-file-names" but nothing.
I expect to get a way to ignore the reserved characters to be able to download them.

Possible issue with international characters in objects and/or paths when copying recursively

I've run into a weird problem after uploading a lot of images with gsutil - the uploaded files cannot be seen via the Google Cloud Console and gsutil itself complains if I try to do a 'gsutil ls '. I am 99% sure it is related to the use of "å" or "Å" together with spaces in the directory name.
All uploads were done recursively from a root folder (large image collection in multiple levels of subdirectories). If I try to upload the files again, gsutil skips them since they are already there, so the upload feature does something - it just isn't working in the same way as the list and download.
An example:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Skipping existing item: gs://digitalfotografen/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2
...
OK - so the files are there, but browsing the directory through the Google Cloud Console shows "No results".
Also:
gsutil ls gs://digitalfotografen/2009/2009-06-27 Søgården - reklamefotos/20090627_IMG_0128.CR2
CommandException: "ls" command does not support "file://" URIs. Did you mean to use a gs:// URI?
I tried escaping spaces and used quotation marks in different ways with no luck.
Now, here is the interesting thing:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Copying file:///Volumes/Photos/digitalfotografen.dk/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2 [Content-Type=application/octet-stream]...
Here I copied the folder specifically with escaped spaces on the source side, and now the files are uploaded again. This creates a second folder with the same name (at least it appears so in the Cloud Console) and the files are now visible in both folders.
We use three different characters that are outside the standard US ASCII in the Danish character set ("æøå" and the capital "ÆØÅ") but the problem only seems to affect "å" and "Å" - the two others alone or in combination works fine. My hunch is that "å" and "Å" may translate into something entirely different in ASCII that throws things off track when gsutil is allowed to handle the directory naming on its own based on the name of the root folder (doing a multiple level recursion) but works when the user specifies the escaped name of the root folder.
This may be a python issue rather than a gsutil issue, but I am in no way qualified to identify this since I have very close to zero knowledge of programming outside a bit of hodgepodge shell scripts.
We got a trouble with gsutil into ubuntu wsl version windows 10.
The command gsutil work perfectly into the shell but not working when is included into a shell script:
gsutil -m ls -lr gs://project.appspot.com/
Error:
commandexception: "ls" command does not support "file://" urls. did you mean to use a gs:// url?
A workaround cloud be by calling directly the script /usr/lib/google-cloud-sdk/platform/gsutil/gsutil and not calling the link /usr/bin/gsutil:
/usr/lib/google-cloud-sdk/platform/gsutil/gsutil -m ls -lr gs://project.appspot.com/
I don't know why but it's working.
Thank Marion to provide us a such uncommon bug :-)
I know this here is a old error, but nevertheless I had a similar issue as described above.
CommandException: "ls" command does not support "file://" URLs. Did you mean to use a gs:// URL?
Using gsutil from scala code.
import sys.process._
object Main {
def main(args: Array[String]): Unit = {
val clients = s"gsutil ls gs://<bucket name>".!!
val beforeDate: String = "date +%Y-%m-%d -d '-8 days'".!!
val clientList = clients.split("\n").map(f => f.split('/').apply(1)).toList
for (x <- clientList) {
val countImg = (s"gsutil -m ls gs://<bucket name>/$x/${beforeDate.stripLineEnd}" #| "wc -l").!!
println(countImg)
}
}
}
So what I found was that there was a LineEnd character on the beforeDate, when I striped that the error went away. So the error occurs when there is a "special" character in the gs://... path. So be sure to strip variables for any "special" characters.
And all this happened just because I was to lazy to use java.time.LocalDate to generate the beforeDate variable. Hope this here will help others that encounter the same error.

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran
perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2
I got an error
enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^
I guess there are some non-utf8 characters contained in the dump. So I ran
iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2
And indeed, I got some errors
BZh91AY&SYiconv: illegal input sequence at position 10
So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.
Many thanks
-- [solved] I should first unzip the file first.
You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.
(Posting borrible's answer so that this resolved question is not listed as unanswered.)

How to debug Postgres copy command failure

I have around 75k records which I am loading to a Postgres table using copy command which is failing. I get an exception
ERROR: invalid byte sequence for encoding "UTF8": 0xbd
Now i need to find which line is having this entry. Is there any way to do this? I am thinking in lines of enabling some postgres logging that might help or any other solution
Note: I am getting the issue with only one particular file. Other files are getting loaded without issues
I always seem to get a line-number in my error, no matter whether I use COPY or \copy and feed a file via redirection or -f.
ERROR: invalid byte sequence for encoding "UTF8": 0xa3
CONTEXT: COPY z, line 3
If there are only a couple of bad chars and you just want to strip them you can use iconv (assuming you're on a unix-like system).
iconv -c --from=utf8 --to=utf8 /tmp/badchars.txt > /tmp/stripped.txt
You could always run diff against the before + after versions if you wanted to see what was stripped out.