Spark - split csv file using scala - scala

I have the following schema of csv file
(Id, OwnerUserId, CreationDate, ClosedDate, Score, Title, Body)
And I would like to split the data using:
val splitComma = file.map(x => x.split (",")
val splitComma = file.map(x => x.split (",(?![^<>]*</>)(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
both of them didn't work, the below is sample of my csv file :
90,58,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for TortoiseSVN?,"<p>Are there any really good tutorials explaining branching and merging with Apache Subversion? </p>
<p>All the better if it's specific to TortoiseSVN client.</p>
"
120,83,2008-08-01T15:50:08Z,NA,21,ASP.NET Site Maps,"<p>Has anyone got experience creating <strong>SQL-based ASP.NET</strong> site-map providers?</p>
<p>I've got the default XML file <code>web.sitemap</code> working properly with my Menu and <strong>SiteMapPath</strong> controls, but I'll need a way for the users of my site to create and modify pages dynamically.</p>
<p>I need to tie page viewing permissions into the standard <code>ASP.NET</code> membership system as well.</p>
"
180,2089740,2008-08-01T18:42:19Z,NA,53,Function for creating color wheels,"<p>This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parameter.</p>
"
What's the best way to work with this ?

You can't load CSVs with multi-line values (i.e. newlines within cells) using Spark: the underlying HadoopInputFormat will split the file based on newlines, disregarding the CSV's encapsulating double-quotes, so there isn't much Spark can do about it (see discussion here).
Unfortunately that means you'll have to find some why of "cleaning" your data (e.g. replacing newlines with some placeholder) before writing it to disk or loading it using Spark.

Related

Migrating from itext2 to itext7

Years ago, I wrote a small app in itext2 to gather reports on a weekly basis and concatenate them into one PDF. The app used com.lowagie.text.pdf.PdfCopy to copy and merge the PDFs. And it worked fine. Performed exactly as expected.
A few weeks ago I looked into migrating the application to itex7. To that end, I used the copyPagesTo method of com.itextpdf.kernel.pdf.PdfDocument. When run on the same file set, this produces warnings like:
WARN PdfNameTree - Name "section.1" already exists in the name tree; old value will be replaced by the new one.
When I click on the link to "section.1" in the first document of the merged PDF, I am taken to "section.1" of the last document. Not what I expected and not what happens when using the itext2 app. In the PDF's produced by itext2, if I click on the link to "section.1" of the first document in the combined PDF, I am taken to section 1 of the first document.
There is a hint in Javadocs for copyPagesTo saying
If outlines destination names are the same in different documents, all
such outlines will lead to a single location in the resultant
document. In this case iText will log a warning. This can be avoided
by renaming destinations names in the source document.
There is however, no explanation of how this should be done. I find it odd that this should be necessary in itext7, although it wasn't in itext2.
Is there a simple way to get around his problem?
I've also tried the Sejda desktop app and it produces correct results, but I would prefer to automate the process through a batch script.
My guess is iText 2 didn't even know it might be a problem.
If iText can't deduplicate destination names, the procedure is roughly:
Follow /Catalog -> /Names -> /Dests in each document to find the destination name tree.
Deduplicate the names, by adding suffixes. Remember that a name with a suffix added might be equal to an existing name in the same or another document. Be careful!
Now you can rewrite the destination name trees. Since you have only used suffixes, you can do this in place - the lexicographic ordering of the names is unaltered so the search tree structure is not broken.
Now, rewrite destination links in each PDF for the new names. For example any dictionary entry with key /Dest, or any /D in a /GoTo action.
Now, after all this preprocessing, the files will merge without name clashes.
(I know all this because I've just implemented it for my own PDF software. It's slightly hairy stuff, but not intractable.)
If you like, I can provide a devel version of cpdf with this functionality, if you would like to test it.

Scala Spark not reading ignoring first line header and loading all data from 2nd line onwards

I have a Scala Spark notebook on an AWS EMR cluster that loads data from an AWS S3 bucket. Previously, I had standard code like the following:
var stack = spark.read.option("header", "true").csv("""s3://someDirHere/*""")
This loaded multiple directories of files (.txt.gz) into a Spark DataFrame object called stack.
Recently, there were new files added to this directory. The content of the new files look the same (I downloaded a couple of them and opened them using both Sublime Text and Notepad++). I tried two different text editors to see if there were perhaps some invisible, non-unicode characters that was disrupting the interpretation of the first line as a header. The new data files causes my code above to ignore the first header line and instead interpret the second line as the header. I have tried a few variations without luck, here are a few examples of things I tried:
var stack = spark.read.option("quote", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected
var stack = spark.read.option("escape", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected
var stack = spark.read.option("escape", "\"").option("quote", "\"").option("header", "true").csv("""s3://someDirHere/*""") // header not detected
I wish I could share the files but it contains confidential information. Just wondering if there are some ideas as to what I can try.
how many files are there? if its to much to check manually you could try to read them withouth the header option. Your expectation is that the header matches everywhere right?
If thats truly the case that should have a count of 1:
spark.read.csv('path').limit(1).dropDuplicates().count()
If not you could see like this what different headers there are
spark.read.csv('path').limit(1).dropDuplicates().show()
Remember its important not to use the header option, so you can operate on it

How to make a section optional when mapped to optional data in a Word OpenXml Part?

I'm using OpenXml SDK to generate word 2013 files. I'm running on a server (part of a server solution), so automation is not an option.
Basically I have an xml file that is output from a backend system. Here's a very simplified example:
<my:Data
xmlns:my="https://schemas.mycorp.com">
<my:Customer>
<my:Details>
<my:Name>Customer Template</my:Name>
</my:Details>
<my:Orders>
<my:Count>2</my:Count>
<my:OrderList>
<my:Order>
<my:Id>1</my:Id>
<my:Date>19/04/2017 10:16:04</my:Date>
</my:Order>
<my:Order>
<my:Id>2</my:Id>
<my:Date>20/04/2017 10:16:04</my:Date>
</my:Order>
</my:OrderList>
</my:Orders>
</my:Customer>
</my:Data>
Then I use Word's Xml Mapping pane to map this data to content control:
I simply duplicate the word file, and write new Xml data when generating new files.
This is working as expected. When I update the xml part, it reflects the data from my backend.
Thought, there's a case that does not works. If a customer has no order, the template content is kept in the document. The xml data is :
<my:Data
xmlns:my="https://schemas.mycorp.com">
<my:Customer>
<my:Details>
<my:Name>Some customer</my:Name>
</my:Details>
<my:Orders>
<my:Count>0</my:Count>
<my:OrderList>
</my:OrderList>
</my:Orders>
</my:Customer>
</my:Data>
(see the empty order list).
In Word, the xml pane reflects the correct data (meaning no Order node):
But as you can see, the template content is still here.
Basically, I'd like to hide the order list when there's no order (or at least an empty table).
How can I do that?
PS: If it can help, I uploaded the word and xml files, and a small PowerShell script that injects the data : repro.zip
Thanks for sharing your files so we can better help you.
I had a difficult time trying to solve your problem with your existing Word Content Controls, XML files and the PowerShell script that added the XML to the Word document. I found what seemed to be Microsoft's VSTO example solution to your problem, but I couldn't get this to work cleanly.
I was however able to write a simple C# console application that generates a Word file based on your XML data. The OpenXML code to generate the Word file was generated code from the Open XML Productivity Tool. I then added some logic to read your XML file and generate the second table rows dynamically depending on how many orders there are in the data. I have uploaded the code for you to use if you are interested in this solution. Note: The xml data file should be in c:\temp and the generated word files will be in c:\temp also.
Another added bonus to this solution is if you were to add all of the customer data into one XML file, the application will create separate word files in your temp directory like so:
customer_<name1>.docx
customer_<name2>.docx
customer_<name3>.docx
etc.
Here is the document generated from the first xml file
Here is the document generated from the second xml file with the empty row
Hope this helps.

Trying to figure out what {s: ;} tags mean and where they come from

I am working on migrating posts from the RightNow infrastructure to another service called ZenDesk. I noticed that whenever users added files or even URL links, when I pull the xml data from RightNow it gives me a lot of weird codes like this:
{s:3:""url"";s:45:""/files/56f5be6c1/MUG_presso.pdf"";s:4:""name"";s:27:""MUG presso.pdf"";s:4:""size"";s:5:""2.1MB"";}
It wasn't too hard to write something that parses them and makes normal urls and links, but I was just wondering if this is something specific to the RightNow service, or if it is a tag system that is used. I tried googling for this but am getting some weird results so, thought stack overflow might have someone who has run into this one.
So, anyone know what these {s ;} tags are called and if there are any particular tools to use to read them?
Any answers appreciated!
This resembles partial PHP serialized data, as returned by the serialize() call. It looks like someone may have turned each " into "", which could prevent it from parsing properly. If it's wrapped with text like this before the {s: section, it's almost definitely PHP.
a:6:{i:1;a:10:{s:
These letters/numbers mean things like "an array with six elements follows", "a string of length 20 follows", etc.
You can use any PHP instance with unserialize() to handle the data. If those double-quotes are indeed returned by the API, you might need to replace :"" and ""; with " before parsing.
Parsing modules exist for other languages like Python. You can find more information in this answer.

What is the best file parsing solution for converting files?

I am looking for the best solution for custom file parsing for our enterprise import routines. I want to basically change one file format into a standard file format and have one routine that imports that data into the database. I need to be able to create custom scripts for each client since its difficult to get the customer to comply with a standard or template format. I have looked at PowerShell and Iron Python to do this so far but I am not sure this is the route I want to go. I have also looked at some tools such as Talend which is a drag and drop style tool which may or may not give me what I want as far as flexibility. We are a .NET shop and have created custom code to do this in the past but I need something that is quicker to create then coding custom parsing functions each time we get a new file format in.
Depending on the complexity and variability of your work, you should consider an ETL tool like SSIS (SQL Server Integration Services).
Python is wonderful for this kind of thing. That's why we use. Each new customer transfer is a new adventure and Python gives us the flexibility to respond quickly.
Edit. All python scripts that read files are "custom file parsers". Without an actual example, it's not sensible to provide a detailed example.
with open( "some file", "r" ) as source:
for line in source:
process( line )
That's about all there is to a "custom file parser". If you're parsing .csv or .xml files, then Python has modules for that. If you're parsing fixed-format files, you'd use string slicing operations. If you're parsing other files (X12? JSON? YAML?) you'll need appropriate parsers.
Tab-Delim.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
record = RecordLayout( aLine.split('\t') )
...
Fixed Layout.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
fields = ( aLine[:10], aLine[10:20], aLine[20:30], ... )
record = RecordLayout( fields )
...