How to replace double quotes with a newline character in spark scala - scala

I am new to spark. I have a huge file which has data like-
18765967790#18765967790#T#20130629#00#31#2981546 " "18765967790#18765967790#T#20130629#19#18#3240165 " "18765967790#18765967790#T#20130629#18#18#1362836
13478756094#13478756094#T#20130629#31#26#2880701 " "13478756094#13478756094#T#20130629#19#18#1230206 " "13478756094#13478756094#T#20130629#00#00#1631440
40072066693#40072066693#T#20130629#79#18#1270246 " "40072066693#40072066693#T#20130629#79#18#3276502 " "40072066693#40072066693#T#20130629#19#07#3321860
I am trying to replace " " with new line character so that my output looks like this-
18765967790#18765967790#T#20130629#00#31#2981546
18765967790#18765967790#T#20130629#19#18#3240165
18765967790#18765967790#T#20130629#18#18#1362836
13478756094#13478756094#T#20130629#31#26#2880701
13478756094#13478756094#T#20130629#19#18#1230206
13478756094#13478756094#T#20130629#00#00#1631440
40072066693#40072066693#T#20130629#79#18#1270246
40072066693#40072066693#T#20130629#79#18#3276502
40072066693#40072066693#T#20130629#19#07#3321860
I have tried with-
val fact1 = sc.textFile("s3://abc.txt").map(x=>x.replaceAll("\"","\n"))
But this doesn't seem to be working. Can someone tell what I am missing?
Edit1- My final output will be a dataframe with schema imposed after splitting with delimeter "#".
I am getting below o/p-
scala> fact1.take(5).foreach(println)
18765967790#18765967790#T#20130629#00#31#2981546
18765967790#18765967790#T#20130629#19#18#3240165
18765967790#18765967790#T#20130629#18#18#1362836
13478756094#13478756094#T#20130629#31#26#2880701
13478756094#13478756094#T#20130629#19#18#1230206
13478756094#13478756094#T#20130629#00#00#1631440
40072066693#40072066693#T#20130629#79#18#1270246
40072066693#40072066693#T#20130629#79#18#3276502
40072066693#40072066693#T#20130629#19#07#3321860
I am getting extra blank lines which is further troubling me to create dataframe. This might seem simple here, but the file is huge, also the rows containing " " are long. In the question I have put only 2 double quotes but they can be more than 40-50 in numbers.

There are more than one quote in between textes, which is creating multiple line breaks. You either need to remove additional quotes before replace or empty lines after replace:
.map(x=>x.replaceAll("\"","\n").replaceAll("(?m)^[ \t]*\r?\n", ""))
Reference: Remove all empty lines

You might be missing implicit Encoders and you try the code as below
spark.read.text("src/main/resources/doubleQuoteFile.txt").map(row => {
row.getString(0).replace("\"","\n") // looking to replace " " with next line
row.getString(0).replace("\" \"","\n") // looking to replace " " with next line
})(org.apache.spark.sql.Encoders.STRING)

Related

How to modify this code in Scala by using Brackets

I have a spark dataframe in Databricks, with an ID and 200 other columns (like a pivot view of data). I would like to unpivot these data to make a tall object with half of the columns, where I'll end up with 100 rows per id. I'm using the Stack function and using specific column names.
Question is this: I'm new to scala and similar languages, and unfamiliar with best practices on how to us Brackets when literals are presented in multiple rows as below. Can I replace the Double quotes and + with something else?
%scala
val unPivotDF = hiveDF.select($"id",
expr("stack(100, " +
"'cat1', cat1, " +
"'cat2', cat2, " +
"'cat3', cat3, " +
//...
"'cat99', cat99, " +
"'cat100', cat100) as (Category,Value)"))
.where("Value is not null")
You can use """ to define multiline strings like:
"""
some string
over multiple lines
"""
In your case this will only work assuming that the string you're writing tolerates new lines.
Considering how repetitive it is, you could also generate the string with something like:
(1 to 100)
.map(i => s"'cat$i', cat$i")
.mkString(",")
(To be adapted by the reader to exact needs)
Edit: and to answer your initial question: brackets won't help in any way here.

Issue with eval_in_page - Trying to interpolate an array

my #para_text = $mech->xpath('/html/body/form/table/tbody/tr[2]/td[3]/table/tbody/tr[2]/td/table/tbody/tr[3]/td/div/div/div', type => $mech->xpathResult('STRING_TYPE'));
#BELOW IS JUST TO MAKE SURE THE ABOVE CAPTURED THE CORRECT TEXT
print "Look here: #para_text";
$mech->click_button( id => "lnkHdrreplyall");
$mech->eval_in_page('document.getElementsByName("txtbdy")[0].value = "#para_text"');
In the last line of my code I need to put the contents of the #para_text array as the text to output into a text box on a website however from the "document" till the end of the line it needs to be surrounded by ' ' to work. Obviously this doesnt allow interpolation as that would require " " Any ideas on what to do?
To define a string that itself contains double quotes as well as interpolating variable values, you may use the alternative form of the double quote qq/ ... /, where you can choose the delimiter yourself and prevent the double quote " from being special
So you can write
$mech->eval_in_page(qq/document.getElementsByName("txtbdy")[0].value = "#para_text"/)

how to create comma separated value in progress openEdge

newbie question here.
I need to create a list. but my problem is what is the best way to not start with a comma?
eg:
output to /usr2/appsrv/test/Test.txt.
def var dTextList as char.
for each emp no-lock:
dTextList = dTextList + ", " + emp.Name.
end.
put unformatted dTextList skip.
output close.
then my end result is
, jack, joe, brad
what is the best way to get rid of the leading comma?
thank you
Here's one way:
ASSIGN
dTextList = dTextList + ", " WHEN dTextList > ""
dTextList = dTextList + emp.Name
.
This does it without any conditional logic:
for each emp no-lock:
csv = csv + emp.Name + ",".
end.
right-trim( csv, "," ).
or you can do this:
for each emp no-lock:
csv = substitute( "&1,&2" csv, emp.Name ).
end.
trim( csv, "," ).
Which also has the advantage of playing nicely with unknown values (the ? value...)
TRIM() trims both sides, LEFT-TRIM() only does leading characters and RIGHT-TRIM() gets trailing characters.
My vanilla list:
output to /usr2/appsrv/test/Test.txt.
def var dTextList as char no-undo.
for each emp no-lock:
dTextList = substitute( "&1, &2", dTextList, emp.Name )
end.
put unformatted substring( dTextList, 3 ) skip.
output close.
substitute prevents unknowns from wiping out list
keep list delimiter checking outside of loop
generally leave the list delimiter prefixed unless the prefix really needs to go as in the case when outputting it
When using delimited lists often you may want to consider a creating a list class to remove this irrelevant noise out of your code so that you can functionally just add an item to a list and export a list without tinkering with these details every time.
I usually do
ASSIGN dTextList = dTextList + (if dTextList = '' then '' else ',') + emp.name.
I come up (well my colleague did) he come up with this:
dTextList = substitute ("&1&3&2", dTextList, emp.Name, min(dTextList,",")).
But it is cool to see various ways to do this. Thank you for all the response
This results in no leading comma (delimiter) and no fiddling with trim/substring/etc
def var cDelim as char.
def var dTextList as char.
cDelim = ''.
for each emp no-lock:
dTextList = dTextList + cDelim + emp.Name.
cDelim = ','.
end.

Format a variable in iReport in a string with multiple fields

I have a text field that has the following expression:
$F{casNo} + " Total " + $P{chosenUom} + ": " + $V{total_COUNT}
casNo is a string, chosenUom is a string. total_COUNT is a sum variable of doubles. The total_COUNT variable displays, but it's got 8 - 10 decimal places (1.34324255234), all I need is something along the lines of 1.34.
Here's what I tried already:
$F{casNo} + " Total " + $P{chosenUom} + ": " + new DecimalFormat("0.00").format($V{total_COUNT}).toString()
Any help would be appreciated
For now I'm just doing basic math, but I'm hoping for a real solution, not a workaround
((int)($V{total_COUNT}*100.0))/100.0
You can format the in lline numbers by using:
new DecimalFormat("###0.00").format(YOUR NUMBER)
You might split the text field into two, one containing everything but the $V{total_COUNT}, and the second containing only $V{total_COUNT}, but with the Pattern property set to something like "#0.00".
You'd have to get a bit creative with layout, though, to prevent unwanted word-wrapping and spacing; for example, first text field could be wide and right-aligned, while text field containing the count could be left-aligned and wide enough to accommodate the formatted number.

How Do I Convert Pipe Delimited to Comma Delimited with Escaping

I am fairly new to scala and I have the need to convert a string that is pipe delimited to one that is comma delimited, with the values wrapped in quotes and any quotes escaped by "\"
in c# i would probably do this like this
string st = "\"" + oldStr.Replace("\"", "\\\\\"").Replace("|", "\",\"") + "\""
I haven't validated that actually works but that is the basic idea behind what I am trying to do. Is there a way to do this easily in scala?
Similarly:
val st = "\"" + oldStr.replaceAll("\"", "\\\\\"").replaceAll("\\|", "\",\"") + "\""
Could also be:
val st = oldStr.replaceAll("\"","\\\\\"").split("\\|").mkString("\"","\",\"","\"")