Encoding Option in Scala - scala

I have data file which contains some Chinese data. I am not able to read/write data properly. I have used Encoding/Charset option while reading and writing but no luck. I have to set encoding/charset option while reading and writing csv file.
I have tried the following two options:
.option("encoding", "utf-16")
.option("charset","UTF-16")
How should the encoding be set?

I have had some trouble reading files with Chinese before with Scala, although not with the Spark platform. Are you sure the encoding used is UTF-16? You can open the file with notepad or equivalent to check. In my case, I finally succeeded to read the files with the GB2312 encoding.
If it doesn't work I would recommend to try using a pure Scala or Java application (without Spark) to see if reading/writing works for the UTF-16 encoding.

Related

UFT: How to modify a file with encoding "UCS-2 LE BOM"

I'm having a little problem automating a test with UFT (vbscript). I need to open a csv file, modify it, and then save it again. The problem is that when I open the file in Notepad++, it shows the encoding as "UCS-2 LE BOM". This file is then injected into our system for processing and if I change the encoding to ANSI, the injection will fail because the file seems to lose its column structure, and I'm not sure it is readable for the system anymore.
From what I understand, it's not possible to do it directly with vbscript but any idea how I could do it with powershell for example? Is there a notepad++ command line to change the encoding of a file for example?
Thanks

Convert EDI format to csv using scala spark?

How to convert EDI format file to CSV file using spark or scala?
You can use a tool like this to create a mapping from EDI format to CSV and then generate a code in that tool. This code then can be used to convert EDI to CSV in Spark.
For open source solutions, I think your best bet is EDI Reader from BerryWorks. Haven't tried it myself, but apparently this is what Hortonworks recommends, and I'd trust their judgement in the Big Data area. I'm not involved with either, for the matters of disclosure.
From there, it's still a matter of converting EDI XML representation to CSV. Given that XML processing is not part of vanilla Spark, again, your options are rather limited here. Try Databricks spark-xml maybe?

Postgres documentation in GNU Info format

Postgres is an open-source project and it has DocBook as default format for its documentation. At a first glance it looks like a tree of *.sgml files in the doc directory inside a repository.
There are several pre-defined convertion output formats, but unfortunately Emacs' native one is ignored.
Does it possible to get Postgres documentation as a postgres.info.gz file?
That's basically nothing more than a text conversion problem. I believe the right solution here would be to write an XSL that converts the XML in your SGML files to TeXinfo source code, but the next best thing:
pandoc is a parser for different textual document file formats. It has a reader for Docbook and a writer for texinfo. That should get you started.

Can SAP detect encoding and line endings?

How to read ASCII files with mixed line endings (Windows and Unix) and UTF-16 Big Endian files in SAP?
Background: our ABAP application must read some of our configuration files. Most of them are ASCII files (normal text files) and one is Unicode Big Endian. So far, the files were read using ASCII mode and things were fine during our test.
However, the following happened at customers: the configuration files are located on a Linux terminal, so it has Unix Line Endings. People read the configuration files via FTP or similar and transport it to the Windows machine. On the Windows machine, they adapt some of the settings. Depending on the editor, our customers now have mixed line endings.
Those mixed line ending cause trouble when reading the file in ASCII mode in ABAP. The file is read up to the point where the line endings change plus a bit more but not the whole file.
I suggested reading the file in BINARY mode, remove all the CRs, then replace all the remaining LFs by CR LF. That worked fine - except for the UTF-16 BE file for which this approach results in a mess. So the whole thing was reverted.
I'm not an ABAP developer, I just have to test this. With my background in other programming languages I must assume there is a solution and I tend to decline a "CAN'T FIX" resolution of this bug.
you can use CL_ABAP_FILE_UTILITIES=>CHECK_FOR_BOMto determine which encoding the file has and then use the constants of class CL_ABAP_CHAR_UTILITIES to process further.

Importing SPSS file in SAS - Discrepancies in Language

I am having trouble importing an SPSS file into SAS. The code I am using is:
proc import datafile = "C:\SAS\Germany.sav"
out=test
dbms = sav
replace;
run;
All the data are imported, but the problem is that some of the values of the variables have slightly different names. So, for instance in the SPSS file, the value of variable "A", is "KÖL", but when imported in SAS it becomes "KÖL".
What I am thinking is that the problem might be based on the fact that the .sav file has some German Words, that SAS cannot understand.
Is there a command that loads a library or something in SAS so that it can understand language-specific values?
P.S. I have also found a similar post here: Importing Polish character file in SAS
but the answer is not really clear.
SAS by default is often installed using the standard windows-latin-1 codepage, often called "ASCII" (incorrectly). SAS itself can handle any encoding, but if it by default uses Windows-Latin-1, it won't handle some Unicode translations.
If you're using SAS 9.3 or 9.4, and possibly earlier versions of v9, you probably have a Unicode version of SAS installed. Look in
\SasFoundation\9.x\nls\
In there you'll probably find "en" (if you're using it in English, anyway), which is usually using the default Windows-latin-1 codepage. You'll also find (possibly, if it was installed) Unicode compatible versions. This is really just a configuration setting, but it's important enough to get them right that they supply a pre-baked config file.
In my case I have a "u8" folder under nls, which I can then use to enable Unicode character encoding on my datasets and when I read in data.
One caveat: I don't know for sure how well the SPSS import engine handles Unicdoe/MBCS characters. This is a separate issue; if you run the unicode version of SAS and it still has problems, that may be the issue, and you may need to either export your SPSS file differently or talk to SAS tech support.