Split file text by newline Scala - scala

I want to read 100 numbers from a file which are stored in such a fashion:
Each number is on the different line. I am not sure which data structure should be used here because later I will need to sum all these numbers altogether and extract first 10 digits of the sum.
I only managed to simply read the file, but I want to split all the text by newline separators and get each number as a list or array element:
val source = Source.fromFile("pathtothefile")
val lines = source.getLines.mkString
I would be grateful for any advice on a data structure to be used here!
Update on approach:
val lines = Source.fromFile("path").getLines.toList

you almost have it there, just map to BigInt, then you have a list of BigInt
val lines = Source.fromFile("path").getLines.map(BigInt(_)).toList
(and then you can use .sum to sum them all up, etc)

Related

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

How can count words of multiple files present in a directory using spark scala

How can I perform word count of multiple files present in a directory using Apache Spark with Scala?
All the files have newline delimiter.
O/p should be:
file1.txt,5
file2.txt,6 ...
I tried using below way:
val rdd= spark.sparkContext.wholeTextFiles("file:///C:/Datasets/DataFiles/")
val cnt=rdd.map(m =>( (m._1,m._2),1)).reduceByKey((a,b)=> a+b)
O/p I'm getting:
((file:/C:/Datasets/DataFiles/file1.txt,apple
orange
bag
apple
orange),1)
((file:/C:/Datasets/DataFiles/file2.txt,car
bike
truck
car
bike
truck),1)
I tried sc.textFile() first, but didn't give me the filename.
wholeTextFile() returns key-value pair, in which the key is the filename, but couldn't get the desired output.
You are starting in the right track, but need to work out in your solution a bit more.
The method sparkContext.wholeTextFiles(...) gives you a (file, contents) pair, so when you reduce it by key you get (file, 1) because that's the amount of whole file contents that you have per pair-key.
In order to count the words of each file, you need to break the contents of each file into those words so you can count them.
Let's do it here, let's start reading the file directory:
val files: RDD[(String, String)] = spark.sparkContext.wholeTextFiles("file:///C:/Datasets/DataFiles/")
That gives one row per file, alongside the full file contents. Now let's break the file contents into individual items. Given the fact your files seem to have one word per line, this is really easy using line breaks:
val wordsPerFile: RDD[(String, Array[String])] = files.mapValues(_.split("\n"))
Now we just need to count the number of items that are present in each of those Array[String]:
val wordCountPerFile: RDD[(String, Int)] = wordsPerFile.mapValues(_.size)
And that's basically it. It's worth mentioning though the the word counting is not being distributed at all (it's just using an Array[String]) because you are loading the whole contents of your files at once.

How to copy the key of the previous line, to the key field of the next line in a Key-Value Pair RDD

Sample Dataset:
$, Claw "OnCreativity" (2012) [Himself]
$, Homo Nykytaiteen museo (1986) [Himself] <25>
Suuri illusioni (1985) [Guests] <22>
$, Steve E.R. Sluts (2003) (V) <12>
$hort, Too 2012 AVN Awards Show (2012) (TV) [Himself - Musical Guest]
2012 AVN Red Carpet Show (2012) (TV) [Himself]
5th Annual VH1 Hip Hop Honors (2008) (TV) [Himself]
American Pimp (1999) [Too $hort]
I have created a Key-Value Pair RDD as using the following code:
To split data: val actorTuple = actor.map(l => l.split("\t"))
To make KV pair: val actorKV = actorTuple.map(l => (l(0), l(l.length-1))).filter{case(x,y) => y != "" }
The Key-Value RDD output on console:
Array(($, Claw,"OnCreativity" (2012) [Himself]), ($, Homo,Nykytaiteen museo (1986) [Himself] <25>), ("",Suuri illusioni (1985) [Guests] <22>), ($, Steve,E.R. Sluts (2003) (V) <12>).......
But, a lot of lines have this "" as key i.e blank (see the RDD output above), because of the nature of dataset, So, I want to have a function that copies the actor of the previous line to this line if it's empty.
How this can be done.
New to Spark and Scala. But perhaps it would be simpler to change your parsing of the lines, and first create a pair RDD with values of type list, eg.
($, Homo, (Nykytaiteen museo (1986) [Himself] <25>,Suuri illusioni (1985) [Guests] <22>) )
I don't know your data, but perhaps if a line doesn't begin with "$" you append onto the value list.
Then depending on what you want to do, perhaps you could use flatMapValues(func) on the pair RDD described above. This applies a function which returns an iterator to each value of a pair RDD, and for each element returned, produces a key-value entry with the old key.
ADDED:
What format is your input data ("Sample Dataset") in? Is it a text file or .tsv?
You probably want to load the whole file at once. That is, use .wholeTextFiles() rather than .textFile() to load your data. This is because your records are stored across more than one line in the file.
ADDED
I'm not going to download the file, but it seems to me that each record you are interested in begins with "$".
Spark can work with any of the Hadoop Input formats, so check those to see if there is one which will work for your sample data.
If not, you could write your own Hadoop InputFormat implementation that parses files into records split on this character instead of the default for TextFiles, which is the '\n' character.
Continuing from the idea xyzzy gave, how about you try this after loading in the file as a string:
val actorFileSplit = actorsFile.split("\n\n")
val actorData = sc.parallelize(actorsFileSplit)
val actorDataSplit = actorsData.map(x => x.split("\t+",2).toList).map(line => (line(0), line(1).split("\n\t+").toList))
To explain what I'm doing, I start by splitting the string up every time we find a line break. Consecutively I parallelize this into a sparkcontext for mapping functions. Then I split every entry into two parts which is delimited by the first occurance of a number of tabs (one or more). The first part should now be the actor and the second part should still be the string with the movie titles. The second part may once again be split at every new line followed by a number of tabs. This should create a list with all the titles for every actor. The final result is in the form:
actorDataSplit = [(String, [String])]
Good luck

How to import column of numbers from mixed string numerical text file

Variations of this question have already been asked several times, for example here. However, I can't seem to get this to work for my data.
I have a text file with 3 columns. First and third columns are floating point numbers. Middle column is strings. I'm only interested in getting the first column really.
Here's what I tried:
filename=fopen('heartbeatn1nn.txt');
A = textscan(filename,'%f','HeaderLines',0);
fclose(filename);
When I do this A comes out as just a single number--the first element in the column. How do I get the whole column? I've also tried this with the '.tsv' file extension, same result.
Also tried:
filename=fopen('heartbeatn1nn.txt');
formatSpec='%f';sizeA=[1 Inf];
A = fscanf(filename,formatSpec,sizeA);
fclose(filename);
with same result.
Could the file size be a problem? Not sure how many rows but probably quite a few since file size is 1.7M.
Assuming the columns in your text file are separated by single whitespace characters your format specification should look like this:
A = textscan(filename,'%f %s %f');
A now contains the complete file content. To obtain the first column:
first_col = A{:,1};
Alternatively, you can tell textscan to skip the unneeded fields with the * option:
first_col = cell2mat( textscan(filename, '%f %*s %*f') );
This returns only the first column.

How to randomly select from a list of 47 names that are entered from a data file?

I have managed to input a number data file into a matrix but have been unable to do so for any data that is not a number.
I have a list of 47 names and supposed to generate a random name from the list. I have tried to use the function textscan but was not going anywhere. Also how do I generate a random name from the list? All I have been able to do was generate a random number between 1 to 47.
Appreciate the replies. I should have said I need it in MATLAB sorry.
Here is a sample list of data in my data file
name01
name02
name03
and the code to read it:
fid = fopen('names.dat','rt');
headerChars = fgetl(fid);
data = fscanf(fid,'%f,%f,%f,%f',[4 47]).';
fclose(fid);
The above is what I have to read the data file into a matrix but it is only reading the first line. (Yes it was modified from a previous post here on this forums :/)
Edit: As per the helpful comments from mtrw, and the fixed formatting of the sample data file, I've updated my answer with more detail.
With a single name (i.e. "Bob", "Bob Smith", or "Smith, Bob") on each line of the file, you can use the function TEXTSCAN by specifying '%s' as the format argument (to denote reading a string) and the newline character '\n' as the 'Delimiter' (the character that separates the strings in the file):
fid = fopen('namefile.txt','r');
names = textscan(fid,'%s','Delimiter','\n');
fclose(fid);
Then it's a matter of randomly picking one of the names. You can use the function RANDI to generate a random integer in the range from 1 to the number of names read from the file (found using the NUMEL function):
names = names{1}; %# Get the contents from the cell returned by TEXTSCAN
selectedName = names{randi(numel(names))};
Sounds like you're halfway home. Take that random number and use it as an index for the list.
For example, if you randomly generate the number 23 then fetch the 23rd entry in the list which gives you a random name draw.
Use the RANDOMBETWEEN function to get a random number within your range. Use INDEX to get the actual cell value. For instance:
=INDEX(A1:A47, RANDBETWEEN(1, 47))
The above will work for your specific case of 47 names, assuming they're in column A. In general, you'd want something like:
=INDEX(MyNames, RANDBETWEEN(ROW(MyNames), ROW(MyNames) + ROWS(MyNames) - 1))
This assumes you've named your range of cells "MyNames" (for example, by selecting all the cells in your range and setting a name in the naming box). The above formula works by using the ROW function to return the top row of the MyNames array and the ROWS function to get the total rows in MyNames.