Spacy Convert and Train UTF-8 Encoding CLI issues

Spacy Convert and Train UTF-8 Encoding CLI issues - encoding

I am training a NER on a foreign language which has a lot of unicode items in there. I create a IOB file and use the cli spacy convert functionality to make it spacy compatible so I can train on that set. However the file is turned into a us-ascii file format. I saw a thread here talking about that Spacy automatically solves this, but how does this work on inference? I also couldn't find where Spacy loads in the data using ujson anymore.
So my question is, does Spacy handle this automatically? And how do I feed my text the best to the Spacy inference?

The escaped UTF-8 will be read in correctly by the library srsly (which is using a fork of ujson internally). If you're worried, you can double-check with srsly.read_json("file.json"). You provide python strings (python3 str) as input.

Related

Convert EDI format to csv using scala spark?

How to convert EDI format file to CSV file using spark or scala?

You can use a tool like this to create a mapping from EDI format to CSV and then generate a code in that tool. This code then can be used to convert EDI to CSV in Spark.

For open source solutions, I think your best bet is EDI Reader from BerryWorks. Haven't tried it myself, but apparently this is what Hortonworks recommends, and I'd trust their judgement in the Big Data area. I'm not involved with either, for the matters of disclosure.
From there, it's still a matter of converting EDI XML representation to CSV. Given that XML processing is not part of vanilla Spark, again, your options are rather limited here. Try Databricks spark-xml maybe?

Encoding Option in Scala

I have data file which contains some Chinese data. I am not able to read/write data properly. I have used Encoding/Charset option while reading and writing but no luck. I have to set encoding/charset option while reading and writing csv file.
I have tried the following two options:
.option("encoding", "utf-16")
.option("charset","UTF-16")
How should the encoding be set?

I have had some trouble reading files with Chinese before with Scala, although not with the Spark platform. Are you sure the encoding used is UTF-16? You can open the file with notepad or equivalent to check. In my case, I finally succeeded to read the files with the GB2312 encoding.
If it doesn't work I would recommend to try using a pure Scala or Java application (without Spark) to see if reading/writing works for the UTF-16 encoding.

Importing SPSS file in SAS - Discrepancies in Language

I am having trouble importing an SPSS file into SAS. The code I am using is:
proc import datafile = "C:\SAS\Germany.sav"
out=test
dbms = sav
replace;
run;
All the data are imported, but the problem is that some of the values of the variables have slightly different names. So, for instance in the SPSS file, the value of variable "A", is "KÖL", but when imported in SAS it becomes "KÃ–L".
What I am thinking is that the problem might be based on the fact that the .sav file has some German Words, that SAS cannot understand.
Is there a command that loads a library or something in SAS so that it can understand language-specific values?
P.S. I have also found a similar post here: Importing Polish character file in SAS
but the answer is not really clear.

SAS by default is often installed using the standard windows-latin-1 codepage, often called "ASCII" (incorrectly). SAS itself can handle any encoding, but if it by default uses Windows-Latin-1, it won't handle some Unicode translations.
If you're using SAS 9.3 or 9.4, and possibly earlier versions of v9, you probably have a Unicode version of SAS installed. Look in
\SasFoundation\9.x\nls\
In there you'll probably find "en" (if you're using it in English, anyway), which is usually using the default Windows-latin-1 codepage. You'll also find (possibly, if it was installed) Unicode compatible versions. This is really just a configuration setting, but it's important enough to get them right that they supply a pre-baked config file.
In my case I have a "u8" folder under nls, which I can then use to enable Unicode character encoding on my datasets and when I read in data.
One caveat: I don't know for sure how well the SPSS import engine handles Unicdoe/MBCS characters. This is a separate issue; if you run the unicode version of SAS and it still has problems, that may be the issue, and you may need to either export your SPSS file differently or talk to SAS tech support.

Generate a libsvm formatted data from text file

Firstly, i'm very poor in data pre-processing. I was looking for WebKB data in libsvm format. Later after searching a lot over the internet, i came across this data obtained after stemming and stop-word removal. The format is as follows,
Each line represents a vector and the first word in each file contains the class name followed by some list of words which forms the feature delimited by spaces.
How do i convert such a text file to lib-svm format? Is there any Weka or Matlab tool to construct it?

libshorttext1.1 is a python module having utilities for this purpose with so many extra features. try it, or i think scikit learn packages also have this functionality

XMP toolbox for Matlab

Has anyone ever heard of something that might facilitate the work with XMP metadata in Matlab?
For instance, EXIF metadata can be read simply by using the exifread command -
output = exifread(filename);
I've found this thread, but it seems to be dead.
Currently I am thinking about the following options:
Writing MEX file using C++ XMP SDK
Calling Java routines using JAVA XMP SDK
To summarize, the question is:
Do you have any idea on how XMP can be read/written in Matlab?

XMP is just XML, so you can use any MATLAB XML toolbox. My personal favourite is xml_io_tools.
If you want to use the SDK to avoid having to manually interpret what bits of the XML means, then of your two options the Java one sounds preferable. Calling Java from MATLAB is straightforward, and you avoid the hassle of building things that MEX entails.

I have found the answer. The best way is to download ExifTool and any Matlab JSON parser. It is possible to extract it from any file format, including .DNG, .XMP, .JPEG, .TIFF.
Step 1: Extract the info into temporary JSON file by using
system(['exiftool -struct -j ' fileName '>' tempFile]);
Step 2: Call the JSON parser on the tempFile
Step 3: You have the data in Matlab struct.