How to convert EDI format file to CSV file using spark or scala?
You can use a tool like this to create a mapping from EDI format to CSV and then generate a code in that tool. This code then can be used to convert EDI to CSV in Spark.
For open source solutions, I think your best bet is EDI Reader from BerryWorks. Haven't tried it myself, but apparently this is what Hortonworks recommends, and I'd trust their judgement in the Big Data area. I'm not involved with either, for the matters of disclosure.
From there, it's still a matter of converting EDI XML representation to CSV. Given that XML processing is not part of vanilla Spark, again, your options are rather limited here. Try Databricks spark-xml maybe?
Related
I have data file which contains some Chinese data. I am not able to read/write data properly. I have used Encoding/Charset option while reading and writing but no luck. I have to set encoding/charset option while reading and writing csv file.
I have tried the following two options:
.option("encoding", "utf-16")
.option("charset","UTF-16")
How should the encoding be set?
I have had some trouble reading files with Chinese before with Scala, although not with the Spark platform. Are you sure the encoding used is UTF-16? You can open the file with notepad or equivalent to check. In my case, I finally succeeded to read the files with the GB2312 encoding.
If it doesn't work I would recommend to try using a pure Scala or Java application (without Spark) to see if reading/writing works for the UTF-16 encoding.
I have a set of large xml files, zipped together in a singe file and many such zip files. I was using Mapreduce earlier to parse the xml using custom inputformat and recordreader setting the splittable=false and reading the zip and xml file.
I am new to Spark. Can someone help me how can I prevent spark from splitting the zip file and process multiple zips in parallel as I am able to do in MR.
AFAIk ! The answer to your question is provided here by #holden :
Please take a look ! Thanks :)
I am getting zero values while using xlsread command in MATLAB.I am using a real world dataset taken from UCI repository which has got both integer and float values.
[Train,textData,rawData] = `xlsread('C:\Users\pooja\Documents\project\breastcancer.csv');`
I have tried with xls format too..
[Train,textData,rawData] = xlsread('C:\Users\pooja\Documents\project\breastcancer.xls');
Thanx in Advance..!
In the wide world of computers, there are a lot of data formats. You need to remember that data formats are different from each other. Generally software like Matlab allows you to open different types of data formats. Each one of course with its own function.
You can guess that the function xmlread is to read XML files. If you want to read csv files or any other type of file in the world, please (I think this is obvious) do not use xmlread!
Specifically to open csv files matlab has csvread. Please, do not use csv read to open files that are not CSV.....
Firstly, i'm very poor in data pre-processing. I was looking for WebKB data in libsvm format. Later after searching a lot over the internet, i came across this data obtained after stemming and stop-word removal. The format is as follows,
Each line represents a vector and the first word in each file contains the class name followed by some list of words which forms the feature delimited by spaces.
How do i convert such a text file to lib-svm format? Is there any Weka or Matlab tool to construct it?
libshorttext1.1 is a python module having utilities for this purpose with so many extra features. try it, or i think scikit learn packages also have this functionality
Can I read an excel file without using any module?
I tried like just reading a normal file and it printed binary characters; maybe because of encoding?
But reading csv files is working normally.
Excel files are binary files, and the format of the pre-2007 ones is apparently quite hairy. I believe .xlsx files are actually zipped XML, so unzipping them should yield something human-readable, but I've never tried it. Why do you want to not use a module though?
Some further reading, if you're interested:
http://joelonsoftware.com/items/2008/02/19.html
http://en.wikipedia.org/wiki/Office_Open_XML_file_formats
Can I read an excel file without using any module?
In theory yes. In practice no.
An Excel XLS file is a binary file within a binary file. The first step would be to parse the Excel BIFF data out of the OLE COM document container. This data isn't necessarily in sequential order.
Then you have to parse the Excel BIFF data, allowing for differences between versions, a shared string table with different encodings and CONTINUE blocks that map large data records in a parser unfriendly way.
The Excel XLSX format is a little easier since it is a collection of XML files in a Zip container. However, if you aren't using modules then even that would be a pain.
The Perl modules that deal with Excel files represent hundreds of man hours of work. Expect to invest a similar amount of work to avoid them.
And why can't you use modules?
You can try figuring out the format of what an Excel spreadsheet looks like, code for that, and then use that in your program. Maybe write it as a module and submit it to CPAN. Wait a second! There's already a module like that there!
The whole purpose of CPAN is to prevent you from having to reinvent the wheel. You need to read an Excel spreadsheet, and someone has done the hard work to figure out how to do this, and is giving it to you free of charge. A $40,000 value1, and it's yours for free! The CPAN system makes installing modules fairly simple. You run the cpan command. There's no real reason to avoid modules that can save you hundreds of hours of work.
And, what type of modules do you avoid? Is it all modules, or is it only modules that are not included in the standard distribution. I hate to think you don't use things like File::Copy or Data::Dumper just because they're modules even though they're included by default in most Perl distributions.
1 Imagine hiring a team to write code to convert an Excel file, so it can be read by a Perl program. They'd have to figure the ins and outs of the file format, code for all sorts of edge cases, and run it through all sorts of tests to make sure it really works. A rough estimate if we don't include things like charts, embedded content, and remote data access would be about 200 man-hours, but only because it's actually has been documented.