Scala: read user input as single string

Scala: read user input as single string - scala

I need get all user input in stdin as single string in scala 2.12 (supposed the data would copy-pasted by single action), something like this:
please copy data:
word1
word2
word3
And I need get string with following data:
val str = "word1\nword2\nword3"
my current approach is not working, just hanging forever:
import scala.collection.JavaConverters._
val scanner: Iterator[String] = new Scanner(System.in).asScala
val sb = new StringBuilder
while (scanner.hasNext) {
sb.append(scanner.next())
}
val str = sb.toString()
Although this can print the input:
import scala.collection.JavaConverters._
val scanner: Iterator[String] = new Scanner(System.in).asScala
scanner foreach println
I'm looking for idiomatic way of doing the job

Try
LazyList
.continually(StdIn.readLine())
.takeWhile(_ != null)
.mkString("\n")
as inspired by https://stackoverflow.com/a/18924749/5205022
On my machine I could terminate the input with ^D.
In Scala 2.12 replace LazyList with Stream.

Related

How do you parse string from arraybuffer to double using Scala?

I'm trying to map string to double from an ArrayBuffer that I parsed through Playframework but I keep getting the following error:
Exception in thread "main" java.lang.NumberFormatException: For input string: ""0.04245800""
I'm not sure why it's doing this and I'm new to Scala coming from Python background.
import org.apache.http.client.methods.HttpGet
import play.api.libs.json._
import org.apache.http.impl.client.DefaultHttpClient
object main extends App {
val url = "https://api.binance.com/api/v1/aggTrades?symbol=ETHBTC"
val httpClient = new DefaultHttpClient()
val httpResponse = httpClient.execute(new HttpGet(url))
val entity = httpResponse.getEntity()
val content = ""
if (entity !=null) {
val inputStream = entity.getContent()
val result = io.Source.fromInputStream(inputStream).getLines.mkString
inputStream.close
println("REST API: " + url)
val json: JsValue = Json.parse(result)
var prices = (json\\"p")
println(prices.map(_.toString()).map(_.toDouble))
}
}

If you know for sure your list contains only strings you can cast them like this, and use the 'original' value to get the Double value from:
println(prices.map(_.as[JsString].value.toDouble))
As JsString is not a String you cannot call toDouble on that.
Just for completeness: If you are not certain your list contains only strings you should add an instanceof check or pattern matching.

the generation of parse tree of StanfordCoreNLP is stuck

When I use the StanfordCoreNLP to generate the parse using bigdata on Spark, one of the tasks had stuck for a long time. I looked for the error, it showed as follows:
at edu.stanford.nlp.ling.CoreLabel.(CoreLabel.java:68)
  at edu.stanford.nlp.ling.CoreLabel$CoreLabelFactory.newLabel(CoreLabel.java:248)
  at edu.stanford.nlp.trees.LabeledScoredTreeFactory.newLeaf(LabeledScoredTreeFactory.java:51)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:27)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
  at edu.stanford.nlp.parser.lexparser.Debinarizer.transformTreeHelper(Debinarizer.java:34)
the relevant codes I think are as follows:
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation
import edu.stanford.nlp.util.CoreMap
import scala.collection.JavaConversions._
object CoreNLP {
def transform(Content: String): String = {
val v = new CoreNLP
v.runEnglishAnnotators(Content);
v.runChineseAnnotators(Content)
}
}
class CoreNLP {
def runEnglishAnnotators(inputContent: String): String = {
var document = new Annotation(inputContent)
val props = new Properties
props.setProperty("annotators", "tokenize, ssplit, parse")
val coreNLP = new StanfordCoreNLP(props)
coreNLP.annotate(document)
parserOutput(document)
}
def runChineseAnnotators(inputContent: String): String = {
var document = new Annotation(inputContent)
val props = new Properties
val corenlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties")
corenlp.annotate(document)
parserOutput(document)
}
def parserOutput(document: Annotation): String = {
val sentences = document.get(classOf[SentencesAnnotation])
var result = ""
for (sentence: CoreMap <- sentences) {
val tree = sentence.get(classOf[TreeAnnotation])
//output the tree to file
result = result + "\n" + tree.toString
}
result
}
}
My classmate said the data used to test is recurse and thus the NLP is endlessly run. I don't know whether it's true.

If you add props.setProperty("parse.maxlen", "100"); to your code that will set the parser to not parse sentences longer than 100 tokens. That can help prevent crash issues. You should experiment with the best max sentence length for your application.

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong

Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()

For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD

What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}

foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]

Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail

How to close enumerated file?

Say, in an action I have:
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows)
}
Ok.feed(linesEnu).as(HTML)
How to close readers/streams?

There is a onDoneEnumerating callback that functions like finally (will always be called whether or not the Enumerator fails). You can close the streams there.
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows).onDoneEnumerating {
is.close()
// ... Anything else you want to execute when the Enumerator finishes.
}
}

The IO tools provided by Enumerator give you this kind of resource management out of the box—e.g. if you create an enumerator with fromStream, the stream is guaranteed to get closed after running (even if you only read a single line, etc.).
So for example you could write the following:
import play.api.libs.iteratee._
val splitByNl = Enumeratee.grouped(
Traversable.splitOnceAt[Array[Byte], Byte](_ != '\n'.toByte) &>>
Iteratee.consume()
) compose Enumeratee.map(new String(_, "UTF-8"))
def fileLines(path: String): Enumerator[String] =
Enumerator.fromStream(new java.io.FileInputStream(path)).through(splitByNl)
It's a shame that the library doesn't provide a linesFromStream out of the box, but I personally would still prefer to use fromStream with hand-rolled splitting, etc. over using an iterator and providing my own resource management.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala: read user input as single string - scala

Try LazyList .continually(StdIn.readLine()) .takeWhile(_ != null) .mkString("\n") as inspired by https://stackoverflow.com/a/18924749/5205022 On my machine I could terminate the input with ^D. In Scala 2.12 replace LazyList with Stream.

Related

How do you parse string from arraybuffer to double using Scala?

the generation of parse tree of StanfordCoreNLP is stuck

String filter using Spark UDF

Scala function does not return a value

How to close enumerated file?

Categories

Resources