val name = "Cory"
"""
|Hi! My name is " + name + " how are you?
""".stripMargin
The portion + name + doesn't get interpreted as code, but just as text. How can I print the value of a variable inside a multiline string?
If you're on 2.10 or later, you can use string interpolation:
scala> s"""
| |Hi! My name is $name how are you?
| """.stripMargin
res0: String =
"
Hi! My name is Cory how are you?
"
For 2.9 or earlier you're stuck with something like this:
scala> ("""
| |Hi! My name is """ + name + """ how are you?
| """).stripMargin
res1: String =
"
Hi! My name is Cory how are you?
"
Note that there are several flavors of string interpolation in Scala—s"..." is the simplest.
Related
I have a spark dataframe as below:
+------------------------------------------------------------------------+
| domains |
+------------------------------------------------------------------------+
|["0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", |
| "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", |
| "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9", |
| "ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3", |
| "aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38", |
| "806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf", |
| "c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39"] |
+------------------------------------------------------------------------+
It contains column as list. I want to convert into Array[String].
eg:
Array("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9",
"ab192f73b9db26ec2aca2b776c4398d2","ff9cf0599ae553d227e3f1078957a5d3",
"aa717380213450746a656fe4ff4e4072","f3346928db1c6be0682eb9307e2edf38",
"806a006b5e0d220c2cf714789828ecf7","9f6f8502e71c325f2a6f332a76d4bebf",
"c0cb38016fb603e89b160e921eced896","56ad547c6292c92773963d6e6e7d5e39")
I tried the following code but I am not getting intended results:
DF.select("domains").as[String].collect()
Instead I get this:
[Ljava.lang.String;#7535f28 ...
Any ideas how can I achieve this ?
You can first explode your domains column before collecting it, as follows:
import org.apache.spark.sql.functions.{col, explode}
val result: Array[String] = DF.select(explode(col("domains"))).as[String].collect()
You can then print your result array using mkString method:
println(result.mkString("[", ", ", "]"))
Here you are getting the Array[String] only as expected.
[Ljava.lang.String;#7535f28 --> this is a kind of type descriptor we use internally in byte code. [ represents an array and Ljava.lang.String represents the Class java.lang.String.
If you want to print the array values as a string, you can use .mkString() function.
import spark.implicits._
val data = Seq((Seq("0b3642ab5be98c852890aff03b3f83d8","4d7a5a24426749f3f17dee69e13194a9", "9d0f74269019ad82ae82cc7a7f2b5d1b","0b113db8e20b2985d879a7aaa43cecf6", "d095db19bd909c1deb26e0a902d5ad92","f038deb6ade0f800dfcd3138d82ae9a9")))
val df = spark.sparkContext.parallelize(data).toDF("domains")
// df: org.apache.spark.sql.DataFrame = [domains: array<string>]
val array_values = df.select("domains").as[String].collect()
// array_values: Array[String] = Array([0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9])
val string_value = array_values.mkString(",")
print(string_value)
// [0b3642ab5be98c852890aff03b3f83d8, 4d7a5a24426749f3f17dee69e13194a9, 9d0f74269019ad82ae82cc7a7f2b5d1b, 0b113db8e20b2985d879a7aaa43cecf6, d095db19bd909c1deb26e0a902d5ad92, f038deb6ade0f800dfcd3138d82ae9a9]
This if you create normal arrays also, can see the same.
scala> val array_values : Array[String] = Array("value1", "value2")
array_values: Array[String] = Array(value1, value2)
scala> print(array_values)
[Ljava.lang.String;#70bf2681
scala> array_values.foreach(println)
value1
value2
I need to clean a column from a Dataframe which contains tailing whitespaces. Something like this:
'17063256 '
'17403492 '
'17390052 '
First, I tried to remove white spaces using trim:
df.withColumn("col1_cleansed", trim(col("col1")))
Then I though it may be tailing "tabs", so I tried as well with:
df.withColumn("col1_cleansed", regexp_replace(col("col1"), "\t", ""))
However none of these two solutions seems to be working.
What is the correct way to remove "tab" characters from a string column in Spark?
Method trim or rtrim does seem to have problem handling general whitespaces. To remove trailing whitespaces, consider using regexp_replace with regex pattern \\s+$ (with '$' representing end of string), as shown below:
val df = Seq(
"17063256 ", // space
"17403492 ", // tab
"17390052 " // space + tab
).toDF("c1")
df.withColumn("c1_trimmed", regexp_replace($"c1", "\\s+$", "")).show
// Output (prettified)
// +------------+----------+
// | c1|c1_trimmed|
// +------------+----------+
// | 17063256 | 17063256|
// | 17403492 | 17403492|
// |17390052 | 17390052|
// +------------+----------+
Try below udf & change as per your needs.
val normalize = udf((in: String) => {
import java.text.Normalizer.{normalize ⇒ jnormalize, _}
val cleaned = in.trim.toLowerCase
val normalized = jnormalize(cleaned, Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}\\p{IsLm}\\p{IsSk}]+", "")
normalized.replaceAll("'s", "")
.replaceAll("ß", "ss")
.replaceAll("ø", "o")
.replaceAll("[^a-zA-Z0-9-]+", " ")
})
df.withColumn("col1_cleansed", normalize(col("col1")))
You can regex_replace to and replace with ""
df.withColumn("new", regexp_replace($"id", " ",""))
.show(false)
Output:
+------------------------------------------------------+----------+
|id |new |
+------------------------------------------------------+----------+
|'17063256 ' |'17063256'|
|'17403492 '|'17403492'|
|'17390052 ' |'17390052'|
+------------------------------------------------------+----------+
Another way to look at the problem. Extract only the required portion from the column .This will work if you are expecting only alphanumeric values and nothing else.
Feel free to modify it to accept numbers only if required.
df.withColumn("cleansed_col",regexp_extract(col("input"),"[a-z0-9]+",0))
How to handle if my delimiter is present in data when loading a file using spark RDD.
My data looks like below:
NAME|AGE|DEP
Suresh|32|BSC
"Sathish|Kannan"|30|BE
How to convert this column into 3 columns like below.
NAME AGE DEP
suresh 32 Bsc
Sathish|Kannan 30 BE
Please refer the how i tried to load the data.
scala> val rdd = sc.textFile("file:///test/Sample_dep_20.txt",2)
rdd: org.apache.spark.rdd.RDD[String] = hdfs://Hive/Sample_dep_20.txt MapPartitionsRDD[1] at textFile at <console>:27
rdd.collect.foreach(println)
101|"Sathish|Kannan"|BSC
102|Suresh|DEP
scala> val rdd2=rdd.map(x=>x.split("\""))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:29
scala> val rdd3=rdd2.map(x=>
| {
| var strarr = scala.collection.mutable.ArrayBuffer[String]()
| for(v<-x)
| {
| if(v.startsWith("\"") && v.endsWith("\""))
| strarr +=v.replace("\"","")
| else if(v.contains(","))
| strarr ++=v.split(",")
| else
| strarr +=v
| }
| strarr
| }
| )
rdd3: org.apache.spark.rdd.RDD[scala.collection.mutable.ArrayBuffer[String]] = MapPartitionsRDD[3] at map at <console>:31
scala> rdd3.collect.foreach(println)
ArrayBuffer(101|, Sathish|Kannan, |BSC)
ArrayBuffer(102|Suresh|DEP)
Maybe you need to explicitly define " as a quote character (it is by default for csv reader but maybe not in your case?). So adding .option("quote","\"") to the options when reading your .csv file should work.
scala> val inputds = Seq("Suresh|32|BSC","\"Satish|Kannan\"|30|BE").toDS()
inputds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val outputdf = spark.read.option("header",false).option("delimiter","|").option("quote","\"").csv(inputds)
outputdf: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
scala> outputdf.show(false)
+-------------+---+---+
|_c0 |_c1|_c2|
+-------------+---+---+
|Suresh |32 |BSC|
|Satish|Kannan|30 |BE |
+-------------+---+---+
Defining makes DataFrameReader ignore the delimiters found inside quoted strings, see Spark API doc here.
EDIT
If you want to play hard and still use plain RDDs, then try modifying your split() function like this:
val rdd2=rdd.map(x=>x.split("\\|(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))
It uses positive look-ahead to ignore | delimiters found inside quotes, and saves you from doing string manipulations in your second .map.
I am a scala beginner. I am trying to find count of null values in a column of a table and add column name and count as key value pair in Map. The below code doesn't work as expected. Please guide me how I can modify this code to make it work
def nullCheck(databaseName:String,tableName:String) ={
var map = scala.collection.mutable.Map[String, Int]()
validationColumn = Array(col1,col2)
for(i <- 0 to validationColumn.length) {
val nullVal = spark.sql(s"select count(*) from $databaseName.$tableName where validationColumn(i) is NULL")
if(nullval == 0)
map(validationColumn(i)) = nullVal
map
}
The function should return ((col1,count),(col2,count)) as Map
This can be done with creating a dynamic sql string and then mapping it. Your approach reads same data multiple times
Here is the solution. I used an "example" DataFrame.
scala> val inputDf = Seq((Some("Sam"),None,200),(None,Some(31),30),(Some("John"),Some(25),25),(Some("Harry"),None,100)).toDF("name","age","not_imp_column")
scala> inputDf.show(false)
+-----+----+--------------+
|name |age |not_imp_column|
+-----+----+--------------+
|Sam |null|200 |
|null |31 |30 |
|John |25 |25 |
|Harry|null|100 |
+-----+----+--------------+
and our ValidationColumns Are name and age where we shall count Nulls
we put them in a List
scala> val validationColumns = List("name","age")
And We Create a SQL String that will be driving this whole calculation
scala> val sqlStr = "select " + validationColumns.map(x => "sum(" + x + "_count) AS " + x + "_sum" ).mkString(",") + " from (select " + validationColumns.map(x => "case when " + x + " = '$$' then 1 else 0 end AS " + x + "_count").mkString(",") + " from " +" (select" + validationColumns.map(x => " nvl( " + x +",'$$') as " + x).mkString(",") + " from example_table where " + validationColumns.map(x => x + " is null ").mkString("or ") + " ) layer1 ) layer2 "
It resolves to ==>
"select sum(name_count) AS name_sum,sum(age_count) AS age_sum from (select case when name = '$$' then 1 else 0 end AS name_count,case when age = '$$' then 1 else 0 end AS age_count from (select nvl( name,'$$') as name, nvl( age,'$$') as age from example_table where name is null or age is null ) layer1 ) layer2 "
now we create a temporary view of our dataframe
inputDf.createOrReplaceTempView("example_table")
only thing left to do is execute the sql and creating a Map which is done by
validationColumns zip spark.sql(sqlStr).collect.map(_.toSeq).flatten.toList toMap
and result
Map(name -> 1, age -> 2) // obviously you can make it type safe
EDITED according to WayneH's grammar
Here's what i have in my grammar file.
grammar pfinder;
options {
language = Java;
}
sentence
: ((words | pronoun) SPACE)* ((words | pronoun) ('.' | '?'))
;
words
: WORDS {System.out.println($text);};
pronoun returns [String value]
: sfirst {$value = $sfirst.value; System.out.println($sfirst.text + '(' + $sfirst.value + ')');}
| ssecond {$value = $ssecond.value; System.out.println($ssecond.text + '(' + $ssecond.value + ')');}
| sthird {$value = $sthird.value; System.out.println($sthird.text + '(' + $sthird.value + ')');}
| pfirst {$value = $pfirst.value; System.out.println($pfirst.text + '(' + $pfirst.value + ')');}
| psecond {$value = $psecond.value; System.out.println($psecond.text + '(' + $psecond.value + ')');}
| pthird{$value = $pthird.value; System.out.println($pthird.text + '(' + $pthird.value + ')');};
sfirst returns [String value] : ('i' | 'me' | 'my' | 'mine') {$value = "s1";};
ssecond returns [String value] : ('you' | 'your'| 'yours'| 'yourself') {$value = "s2";};
sthird returns [String value] : ('he' | 'she' | 'it' | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself') {$value = "s3";};
pfirst returns [String value] : ('we' | 'us' | 'our' | 'ours') {$value = "p1";};
psecond returns [String value] : ('yourselves') {$value = "p2";};
pthird returns [String value] : ('they'| 'them'| 'their'| 'theirs' | 'themselves') {$value = "p3";};
WORDS : LETTER*;// {$channel=HIDDEN;};
SPACE : (' ')?;
fragment LETTER : ('a'..'z' | 'A'..'Z');
and here,s what i have on a java test class
import java.util.Scanner;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import java.util.List;
public class test2 {
public static void main(String[] args) throws RecognitionException {
String s;
Scanner input = new Scanner(System.in);
System.out.println("Eter a Sentence: ");
s=input.nextLine().toLowerCase();
ANTLRStringStream in = new ANTLRStringStream(s);
pfinderLexer lexer = new pfinderLexer(in);
TokenStream tokenStream = new CommonTokenStream(lexer);
pfinderParser parser = new pfinderParser(tokenStream);
parser.pronoun();
}
}
what do I need to put in the test file so that the it will display all the pronouns in a sentence and their respective values(s1,s2,...)?
In case you are trying to do some sort of high-level analysis of spoken/written language, you might consider using some sort of natural language processing tool. For example, TagHelper Tools will tell you which elements are pronouns (and verbs, and nouns, and adverbs, and other esoteric grammatical constructs). (THT is the only tool of that sort that I'm familiar with, so don't take that as a particular endorsement of awesomeness).
fragments don't create tokens, and placing them in parser rules will not give desirable results.
On my test box, this produced (I think!) the desired result:
program :
PRONOUN+
;
PRONOUN :
'i' | 'me' | 'my' | 'mine'
| 'you' | 'your'| 'yours'| 'yourself'
| 'he' | 'she' | 'it' | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself'
| 'we' | 'us' | 'our' | 'ours'
| 'yourselves'
| 'they'| 'them'| 'their'| 'theirs' | 'themselves'
;
WS : ' ' { $channel = HIDDEN; };
WORD : ('A'..'Z'|'a'..'z')+ { $channel = HIDDEN; };
In Antlrworks, a sample "i kicked you" returned the tree structure: program -> [i, you].
I feel compelled to point out that Antlr is overkill for stripping the pronouns out of a sentence. Consider using a regular expression. This grammar is not case insensitive. Expanding WORD to consume everything except your dictionary of PRONOUNs (such as puncuation, etc) may be a bit tedious. Will require sanitization of input.
--- Edit: In response to the second OP:
I have altered the original grammar to make ease of parsing. The new grammar is:
grammar pfinder;
options {
backtrack=true;
output = AST;
}
tokens {
PROGRAM;
}
program :
(WORD* p+=PRONOUN+ WORD*)*
-> ^(PROGRAM $p*)
;
PRONOUN :
'i' | 'me' | 'my' | 'mine'
| 'you' | 'your'| 'yours'| 'yourself'
| 'he' | 'she' | 'it' | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself'
| 'we' | 'us' | 'our' | 'ours' | 'yourselves'
| 'they'| 'them'| 'their'| 'theirs' | 'themselves'
;
WS : ' ' { $channel = HIDDEN; };
WORD : ('A'..'Z'|'a'..'z')+;
I'll explain the changes:
Backtracking is now required to solve the parser rule program. Perhaps there's a better way to write it which doesn't require backtracking but this is the first thing that popped in to my mind.
An imaginary token PROGRAM has been defined to group our pronouns.
Each matched program is added to Antlr var $p and is rewritten in AST under the imaginary rule.
The interpreter code may now use a CommonTree to collect matched pronouns
The following is written in C# (I don't know Java) but I wrote it with the intent that you'll be able to read and understand it.
static object[] ReadTokens( string text )
{
ArrayList results = new ArrayList();
pfinderLexer Lexer = new pfinderLexer(new Antlr.Runtime.ANTLRStringStream(text));
pfinderParser Parser = new pfinderParser(new CommonTokenStream(Lexer));
// syntaxTree is imaginary token {PROGRAM},
// its children are the pronouns collected by $p in grammar.
CommonTree syntaxTree = Parser.program().Tree as CommonTree;
if ( syntaxTree == null ) return null;
foreach ( object pronoun in syntaxTree.Children )
{
results.Add(pronoun.ToString());
}
return results.ToArray();
}
Calling ReadTokens("i kicked you and them") returns array ["i", "you", "them"]
I think you need to learn more about lexer rules within ANTLR, lexer rules start with uppercase letter and generate tokens for the stream the parser will look at. Lexer fragment rules will not generate a token for the stream but will help other lexer rules generate tokens, look at lexer rules WORDS and LETTER (LETTER is not a token but does help WORDS create a token).
Now, when a text literal is put into a parser rule (rule name will start with a lowercase letter) that text literal is also a valid token that the lexer will identify and pass (at least when you use ANTLR - I have not used any other tools similar to ANTLR to answer for them).
The next thing I was noticing is that your 's' and 'pronoun' rules appear to be the same thing. I commented out the 's' rule and put everything into the 'pronoun' rule
And then the last thing is to learn how to put actions into the grammer, you have some in the 's' rule setting the return value. I made the pronoun rule return a string value so that if you wanted the actions in your 'sentence' rule you would easily be able to accomplish your "-i pronoun" comment/answer.
Now since I do not know what your exact results are, I played with your grammer and made some slight modifications and reorganized (moving what I thought were parser rules to the top with keep all lexer rules at the bottom) and put in some actions that I think will show you what you need. Also, there could be several different ways to accomplish this and I don't think my solution is perfect for any of your possible wanted results, but here is a grammer I was able to get working in ANTLRWorks:
grammar pfinder;
options {
language = Java;
}
sentence
: ((words | pronoun) SPACE)* ((words | pronoun) ('.' | '?'))
;
words
: WORDS {System.out.println($text);};
pronoun returns [String value]
: sfirst {$value = $sfirst.value; System.out.println($sfirst.text + '(' + $sfirst.value + ')');}
| ssecond {$value = $ssecond.value; System.out.println($ssecond.text + '(' + $ssecond.value + ')');}
| sthird {$value = $sthird.value; System.out.println($sthird.text + '(' + $sthird.value + ')');}
| pfirst {$value = $pfirst.value; System.out.println($pfirst.text + '(' + $pfirst.value + ')');}
| psecond {$value = $psecond.value; System.out.println($psecond.text + '(' + $psecond.value + ')');}
| pthird{$value = $pthird.value; System.out.println($pthird.text + '(' + $pthird.value + ')');};
//s returns [String value]
// : exp=sfirst {$value = "s1";}
// | exp=ssecond {$value = "s2";}
// | exp=sthird {$value = "s3";}
// | exp=pfirst {$value = "p1";}
// | exp=psecond {$value = "p2";}
// | exp=pthird {$value = "p3";}
// ;
sfirst returns [String value] : ('i' | 'me' | 'my' | 'mine') {$value = "s1";};
ssecond returns [String value] : ('you' | 'your'| 'yours'| 'yourself') {$value = "s2";};
sthird returns [String value] : ('he' | 'she' | 'it' | 'his' | 'hers' | 'its' | 'him' | 'her' | 'himself' | 'herself') {$value = "s3";};
pfirst returns [String value] : ('we' | 'us' | 'our' | 'ours') {$value = "p1";};
psecond returns [String value] : ('yourselves') {$value = "p2";};
pthird returns [String value] : ('they'| 'them'| 'their'| 'theirs' | 'themselves') {$value = "p3";};
WORDS : LETTER*;// {$channel=HIDDEN;};
SPACE : (' ')?;
fragment LETTER : ('a'..'z' | 'A'..'Z');
I think the end result is this grammer will show you how to accomplish what you are trying to do and will require modification no matter what that end result is.
Good luck.
I think you only have to change one line in your test class,
parser.pronoun();
to:
parser.sentence();
You might want to change a few other things in the grammer as well:
SPACE : ' ';
sentence: (words | pronoun) (SPACE (words | pronoun))* ('.' | '?'); // then you might want to put a rule between sentence and words/pronoun.