Reading float values from a csv file in Scala? - scala

I want to send (integer) values from a csv file to a (Chisel) class here.
I just can't read the values from a csv file - I have already tried all code snippets scattered around the internet. (csv file is in the format below ->)
1321437196.0,
-2132416838.0,
1345437196.0
Code I am using:
val bufferedSource = io.Source.fromFile("sourceFile.csv")
val rows = Array.ofDim[Int](3)
var count = 0
for (line <- bufferedSource.getLines) {
rows(count) = line.split(",").map(_.trim).toString.toInt
count += 1
}
bufferedSource.close
println(rows.mkString(" "))
Output:
[Ljava.lang.String;#51f9ef45
[Ljava.lang.String;#2f42c90a
[Ljava.lang.String;#6d9bd75d
I have understood the error message and tried all various snippets mentioned here(Printing array in Scala , Scala - printing arrays) , but I just can't see where I am going wrong here. Just to point out, I don't want a Double value here but want a converted Signed Integer , hence that is why toInt.
Thanks will appreciate help with this!

"1.0".toInt won't work. Need to go from String to Float to Int.
val bufferedSource = io.Source.fromFile("sourceFile.csv")
val rows = bufferedSource.getLines
.map(_.split(",").head.trim.toFloat.toInt)
.toArray
bufferedSource.close
rows //res1: Array[Int] = Array(1321437184, -2132416896, 1345437184)

Change
line.split(",").map(_.trim).toString.toInt
to
line.split(",")(0).trim.toInt

val bufferedSource = io.Source.fromFile("sourceFile.csv")
val rows = bufferedSource.getLines
.map(_.split(",").headOption.mkString.toInt)
bufferedSource.close

Related

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

How to loop over array and concat elements into one print statement or variable Spark Scala

I am trying to figure out for my application to concatenate the elements of my array as I am looping over them into one variable or a print statement. I need these printed in the stdout in a certain format so another application can use them (an oozie job).
Here is what I have so far the relevant part
filterDF.registerTempTable("filterDF_table")
val filterDF_table_print = spark.sql("""
SELECT SUBSTRING(constraint,locate('(',constraint) + 1,locate(',',constraint) -locate('(',constraint) -1) as error_column,
SUBSTRING(constraint,1 ,locate('(',constraint) -1) as error_reason
FROM filterDF_table
""")
filterDF_table_print.rdd.map(row => {
val row1 = row.getAs[String]("error_reason")
val make = if (row1.toLowerCase == "patternmatchconstraint") "Invalid Length" else "error_reason"
("field",row(0),make) }).collect().foreach(println)
Now this is great so far it took me a while to get this far these are all of the elements I need in my printed statement. Just not in the format I am hoping for.
(field,FOO1,Invalid Length)
(field,FOO2,Invalid Length)
(field,FOO3,Invalid Length)
(field,FOO4,Invalid Length)
(field,FOO5,Invalid Length)
(field,FOO6,Invalid Length)
(field,FOO7,Invalid Length)
What I need for my next application to run properly is something like this.
OUTVAR:field,FOO1,Invalid Length
field,FOO2,Invalid Length
field,FOO3,Invalid Length
field,FOO4,Invalid Length
field,FOO5,Invalid Length
field,FOO6,Invalid Length
field,FOO7,Invalid Length
I am not so worried about the formatting and spacing at this point I can google around for that or ask another question if need be. Mainly I need to get this all into one printed statement to move forward.
Here is my suggested solution. I don't have the rest of your codebase, so there is no way for me to test it on my own machine, but here is my best attempt:
val res = filterDF_table_print.rdd.map(row => {
val row1 = row.getAs[String]("error_reason")
val make = if (row1.toLowerCase == "patternmatchconstraint") "Invalid Length" else "error_reason"
("field",row(0),make)
}).collect()
val toPrint = res.map{ case (x, y, z) => s"$x, $y, $z" }.mkString("\n")
println(toPrint)

How mimic the function map.getORelse to a CSV file

I have a CSV file that represent a map[String,Int], then I am reading the file as follows:
def convI2N (vkey:Int):String={
val in = new Scanner("dictionaryNV.csv")
loop.breakable{
while (in.hasNext) {
val nodekey = in.next(',')
val value = in.next('\n')
if (value == vkey.toString){
n=nodekey
loop.break()}
}}
in.close
n
}
the function give the String given the Int. The problem here is that I must browse the whole file, and the file is to big, then the procedure is too slow. Someone tell me that this is O(n) complexity time, and recomend me to pass to O(log n). I suppose that the function map.getOrElse is O(log n).
Someone can help me to find a way to get a best performance of this code?
As additional comment, the dictionaryNV file is sorted by the Int values
maybe I can divide the file by lines, or set of lines. The CSV has like 167000 Tuples [String,Int]
or in another way how you make some kind of binary search through the csv in scala?
If you are calling confI2N function many times then definitely the job will be slow because each time you have to scan the big file. So if the function is called many times then it is recommended to store them in temporary variable as properties or hashmap or collection of tuple2 and change the other code that is eating the memory.
You can try following way which should be faster than scanner way
Assuming that your csv file is comma separated as
key1,value1
key2,value2
Using Source.fromFile can be your solution as
def convI2N (vkey:Int):String={
var n = "not found"
val filtered = Source.fromFile("<your path to dictionaryNV.csv>")
.getLines()
.map(line => line.split(","))
.filter(sline => sline(0).equalsIgnoreCase(vkey.toString))
for(str <- filtered){
n = str(0)
}
n
}

How to make multiple output in scala?

I have 1 input file with n lines. How can I create n output files from n lines?
I just know
for (line <- Source.fromFile(filePath).getLines) {
println(line)
}
If the question is to know how to write the file, this can be done in two approaches.
1) By using PrintWriter
val writersample = new PrintWriter(new File("sample.txt" ))
writersample.write("put the content you want write")
//you can write as many lines as you want
writersample.close
2) By using FileWriter
val file = new File("sample.txt")
val bufferw = new BufferedWriter(new FileWriter(file))
bufferw.write("whatever you want to write here")
bufferw.close()
If you are looking to write n different files, probably you need to repeat the code by overriding the filename eachtime in a loop.
differences between both approaches can be read # https://coderanch.com/t/418148/certification/Information-PrintWriter-FileWriter
Please let me know if you are looking for different answer than this.

Spark - create RDD of (label, features) pairs from CSV file

I have a CSV file and want to perform a simple LinearRegressionWithSGD on the data.
A sample data is as follow (the total rows in the file is 99 including labels) and the objective is to predict the y_3 variable:
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0
2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0
2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0
992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0
4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0
1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0
1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0
3526.6666666666665,26.0,1432.0,0.0,1.0,0.0,12.0
506.8421052631579,17.0,1430.0,1.0,0.0,0.0,5.0
2095.890410958904,35.0,1430.0,1.0,0.0,0.0,12.0
720.0,35.0,1430.0,1.0,0.0,0.0,5.0
2416.5,17.0,1432.0,0.0,0.0,1.0,12.0
3306.6666666666665,35.0,1800.0,0.0,0.0,1.0,12.0
6105.974025974026,35.0,1800.0,1.0,0.0,0.0,25.0
1400.4624277456646,35.0,1800.0,1.0,0.0,0.0,5.0
1414.5454545454545,26.0,1430.0,1.0,0.0,0.0,12.0
5204.68085106383,26.0,1800.0,0.0,0.0,1.0,25.0
1812.2222222222222,17.0,1800.0,1.0,0.0,0.0,12.0
2763.5928143712576,35.0,1100.0,1.0,0.0,0.0,12.0
I already read the data with the following command:
val data = sc.textFile(datadir + "/data_2.csv");
When I want to create a RDD of (label, features) pairs with the following command:
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
So I can not continue for training a model, any help?
P.S. I run the spark with Scala IDE in Windows 7 x64.
After lots of efforts I found out the solution. The first problem was related to the header rows and the second was related to mapping function. Here is the complete solution:
//To read the file
val csv = sc.textFile(datadir + "/data_2.csv");
//To find the headers
val header = csv.first;
//To remove the header
val data = csv.filter(_(0) != header(0));
//To create a RDD of (label, features) pairs
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
I hope it can save your time.
When you read in your file the first line
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
Is also read and transformed in your map function so you're trying to call toDouble on y_3. You need to filter out the first row and do the learning using the remaining rows.