We have a csv file having a column(name:Host) that has data like mb-web-scp-01, kl-mem-cpp-01 .
Having split by dash, I need to make a new column(name: Host2) having second one (web, mem from the above data)
import scala.io.Source._
If you want to handle errors as options, you can do this (output from Ammonite REPL):
# "mb-web-scp-01".split("-").drop(1).headOption
res1: Option[String] = Some(web)
Related
I'm learning purescript and trying to log a directory content.
module Main where
import Prelude
import Data.Traversable (traverse)
import Effect (Effect)
import Effect.Console (log)
import Node.FS.Sync (readdir)
fnames = readdir "."
main = do
travere (\a -> log $ show a) fnames
I want to get folder entries printed in console output.
I can not get rid of (or pass through) the Effect which I get from Node.FS.Sync (readdir) (I get Effect (Array String)). And traverse or log or show can not work with Effect in front of Array String.
I get No type class instance was found for Data.Traversable.Traversable Effect.
Effect is a program, not a value. Effect (Array String) is a program that, when executed, will produce an Array String. You cannot get the Array String out of that program without executing it.
One way to execute this program is to make it part of a larger program, such as, for example, your main program. Like this:
main = do
ns <- fnames
traverse (\a -> log $ show a) ns
Of course, there is really no need to put it in a global variable fnames before making it part of the main program. You can include readdir "." directly:
main = do
ns <- readdir "."
traverse (\a -> log $ show a) ns
How can I perform word count of multiple files present in a directory using Apache Spark with Scala?
All the files have newline delimiter.
O/p should be:
file1.txt,5
file2.txt,6 ...
I tried using below way:
val rdd= spark.sparkContext.wholeTextFiles("file:///C:/Datasets/DataFiles/")
val cnt=rdd.map(m =>( (m._1,m._2),1)).reduceByKey((a,b)=> a+b)
O/p I'm getting:
((file:/C:/Datasets/DataFiles/file1.txt,apple
orange
bag
apple
orange),1)
((file:/C:/Datasets/DataFiles/file2.txt,car
bike
truck
car
bike
truck),1)
I tried sc.textFile() first, but didn't give me the filename.
wholeTextFile() returns key-value pair, in which the key is the filename, but couldn't get the desired output.
You are starting in the right track, but need to work out in your solution a bit more.
The method sparkContext.wholeTextFiles(...) gives you a (file, contents) pair, so when you reduce it by key you get (file, 1) because that's the amount of whole file contents that you have per pair-key.
In order to count the words of each file, you need to break the contents of each file into those words so you can count them.
Let's do it here, let's start reading the file directory:
val files: RDD[(String, String)] = spark.sparkContext.wholeTextFiles("file:///C:/Datasets/DataFiles/")
That gives one row per file, alongside the full file contents. Now let's break the file contents into individual items. Given the fact your files seem to have one word per line, this is really easy using line breaks:
val wordsPerFile: RDD[(String, Array[String])] = files.mapValues(_.split("\n"))
Now we just need to count the number of items that are present in each of those Array[String]:
val wordCountPerFile: RDD[(String, Int)] = wordsPerFile.mapValues(_.size)
And that's basically it. It's worth mentioning though the the word counting is not being distributed at all (it's just using an Array[String]) because you are loading the whole contents of your files at once.
The below code does not add the double quotes which is the default. I also tried adding # and single quote using option quote with no success. I also used quoteMode with ALL and NON_NUMERIC options, still no change in the output.
s2d.coalesce(64).write
.format("com.databricks.spark.csv")
.option("header", "false")
.save(fname)
Are there any other options I can try? I am using spark-csv 2.11 over spark 2.1.
Output it produces:
d4c354ef,2017-03-14 16:31:33,2017-03-14 16:31:46,104617772177,340618697
Output I am looking for:
“d4c354ef”,”2017-03-14 16:31:33”,”2017-03-14 16:31:46”,104617772177,340618697
tl;dr Enable quoteAll option.
scala> Seq(("hello", 5)).toDF.write.option("quoteAll", true).csv("hello5.csv")
The above gives the following output:
$ cat hello5.csv/part-00000-a0ecb4c2-76a9-4e08-9c54-6a7922376fe6-c000.csv
"hello","5"
That assumes the quote is " (see CSVOptions)
That however won't give you "Double quotes around all non-numeric characters." Sorry.
You can see all the options in CSVOptions that serves as the source of the options for the CSV reader and writer.
p.s. com.databricks.spark.csv is currently a mere alias for csv format. You can use both interchangeably, but the shorter csv is preferred.
p.s. Use option("header", false) (false as boolean not String) that will make your code slightly more type-safe.
In Spark 2.1 where the old CSV library has been inlined, I do not see any option for what you want in the csv method of DataFrameWriter as seen here.
So I guess you have to map over your data "manually" to determine which of the Row components are non-numbers and quote them accordingly. You could utilize a straightforward isNumeric helper function like this:
def isNumeric(s: String) = s.nonEmpty && s.forall(Character.isDigit)
As you map over your DataSet, quote the values where isNumeric is false.
Sample Dataset:
$, Claw "OnCreativity" (2012) [Himself]
$, Homo Nykytaiteen museo (1986) [Himself] <25>
Suuri illusioni (1985) [Guests] <22>
$, Steve E.R. Sluts (2003) (V) <12>
$hort, Too 2012 AVN Awards Show (2012) (TV) [Himself - Musical Guest]
2012 AVN Red Carpet Show (2012) (TV) [Himself]
5th Annual VH1 Hip Hop Honors (2008) (TV) [Himself]
American Pimp (1999) [Too $hort]
I have created a Key-Value Pair RDD as using the following code:
To split data: val actorTuple = actor.map(l => l.split("\t"))
To make KV pair: val actorKV = actorTuple.map(l => (l(0), l(l.length-1))).filter{case(x,y) => y != "" }
The Key-Value RDD output on console:
Array(($, Claw,"OnCreativity" (2012) [Himself]), ($, Homo,Nykytaiteen museo (1986) [Himself] <25>), ("",Suuri illusioni (1985) [Guests] <22>), ($, Steve,E.R. Sluts (2003) (V) <12>).......
But, a lot of lines have this "" as key i.e blank (see the RDD output above), because of the nature of dataset, So, I want to have a function that copies the actor of the previous line to this line if it's empty.
How this can be done.
New to Spark and Scala. But perhaps it would be simpler to change your parsing of the lines, and first create a pair RDD with values of type list, eg.
($, Homo, (Nykytaiteen museo (1986) [Himself] <25>,Suuri illusioni (1985) [Guests] <22>) )
I don't know your data, but perhaps if a line doesn't begin with "$" you append onto the value list.
Then depending on what you want to do, perhaps you could use flatMapValues(func) on the pair RDD described above. This applies a function which returns an iterator to each value of a pair RDD, and for each element returned, produces a key-value entry with the old key.
ADDED:
What format is your input data ("Sample Dataset") in? Is it a text file or .tsv?
You probably want to load the whole file at once. That is, use .wholeTextFiles() rather than .textFile() to load your data. This is because your records are stored across more than one line in the file.
ADDED
I'm not going to download the file, but it seems to me that each record you are interested in begins with "$".
Spark can work with any of the Hadoop Input formats, so check those to see if there is one which will work for your sample data.
If not, you could write your own Hadoop InputFormat implementation that parses files into records split on this character instead of the default for TextFiles, which is the '\n' character.
Continuing from the idea xyzzy gave, how about you try this after loading in the file as a string:
val actorFileSplit = actorsFile.split("\n\n")
val actorData = sc.parallelize(actorsFileSplit)
val actorDataSplit = actorsData.map(x => x.split("\t+",2).toList).map(line => (line(0), line(1).split("\n\t+").toList))
To explain what I'm doing, I start by splitting the string up every time we find a line break. Consecutively I parallelize this into a sparkcontext for mapping functions. Then I split every entry into two parts which is delimited by the first occurance of a number of tabs (one or more). The first part should now be the actor and the second part should still be the string with the movie titles. The second part may once again be split at every new line followed by a number of tabs. This should create a list with all the titles for every actor. The final result is in the form:
actorDataSplit = [(String, [String])]
Good luck
I have written a short Scala program to read a large file, process it and store the result in another file. The file contains about 60000 lines of numbers, and I need to extract from each third line only the first number. Eventually I save those numbers to a different file. Although numbers, I treat them as strings all along the way.
Here is the Scala code:
import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter
object Analyze {
def main(args: Array[String]) {
val fname = "input.txt"
val counters = Source.fromFile(fname).mkString.split("\\n").grouped(3)
.map(_(2).split("\\s+")(0))
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters.reduceLeft(_ + "\n" + _))
f.close()
}
}
I like very much the Scala's capability of powerful one liners. The one-liner in the above code reads the entire text from the file, splits it into lines, groups the lines to groups of 3 lines, and then takes from each group the third line, splits it and takes the first number.
Here is the equivalient python script:
fname = 'input.txt'
with file(fname) as f:
lines = f.read().splitlines()
linegroups = [lines[i:i+3] for i in range(0, len(lines), 3)]
nums = [linegroup[2].split()[0] for linegroup in linegroups]
with file('output2.txt', 'w') as f:
f.write('\n'.join(nums))
Python is not capable of such one liners. In the above script the first line of code reads the file into a list of lines, the next one groups the lines into groups of 3, and the next one creates a list consisting of the first number of every last line of each group. It's very similar to the Scala code, only it runs much much faster.
The python script runs in a fraction of a second on my laptop, while the Scala program runs for 15 seconds! I commented out the code that saves the result to the file, and the duration fell to 5 seconds, which is still way too slow. Also I don't understand why it takes so long to save the numbers to the file. When I dealt with larger files, the python script ran for a few seconds, while the Scala program running time was in order of minutes, which I couldn't use to analyze my files.
I'll appreciate you advice for this issue.
Thanks
I took the liberty of cleaning up the code, this should run more efficiently by avoiding the initial mkString, not needing a regex to perform the whitespace split, and not pre-aggregating the results before writing them out. I also used methods that are better self-documenting:
val fname = "input.txt"
val lines = (Source fromFile fname).getLines
val counters =
(lines grouped 3 withPartial false) map { _.last takeWhile (!_.isWhitespace) }
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters mkString "\n")
f.close()
Warning, untested code
This is largely irrelevant though, depending on how you're profiling. If you're including the JVM startup time in your metrics, then all bets are off - no amount of code optimisation could help you there.
I'd normally also suggest pre-warming the JVM by running the routine a few hundred times before you time it, but this isn't so practical in the face of file I/O.
I timed the version provided by Kevin with minor edits (removed withPartial since the python version doesn't handle padding either):
import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter
object A extends App {
val fname = "input.txt"
val lines = (Source fromFile fname).getLines
val counters =
(lines grouped 3) map { _.last takeWhile (!_.isWhitespace) }
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters mkString "\n")
f.close()
}
With 60,000 lines here are the timing:
$ time scala -cp classes A
real 0m2.823s
$ time /usr/bin/python A.py
real 0m0.437s
With 900,000 lines:
$ time scala -cp classes A
real 0m5.226s
$ time /usr/bin/python A.py
real 0m3.319s
With 2,700,000 lines:
$ time scala -cp classes A
real 0m9.516s
$ time /usr/bin/python A.py
real 0m10.635s
The scala version outperforms the python version after that. So it seems some of the long timing is due to JVM initialization and JIT compilation time.
In addition to #notan3xit's answer, you could also write counters to files without concatenating them first:
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters.head.toString)
counters.tail.foreach(c => f.write("\n" + c.toString))
f.close()
Though you could do the same in Python...
Try this code for write to file:
val f = new java.io.PrintWriter(new java.io.File("output1.txt"))
f.write(counters.reduce(_ + "\n" + _))
f.close()
Much faster .