Reading a csv file and selecting three columns in Scala - scala

I need to read a csv file and then to make a new file having the specified 3 columns ..
I am aware of reading a text file but not csv file .
import scala.io.Source._
val lines = fromFile("file.txt").getLines

Or if you just want the first three columns, try this
val lines = fromFile("file.txt").
getLines.
map(_.split(",",4).take(3)).
toList

Assuming a collection of indices idx that refer to columns in the csv file, consider first
val idx = Array(1,3,4)
val xs = (1 to 10).toArray
and so we can fetch the 2nd, 4th and 5th columns (index 0 refers to the first column),
idx.map(xs)
Array(2, 4, 5)
We can apply this idea onn each array from splitting each line as follows,
Source.fromFile("file.csv").getLines.map(_.split(",").map(idx))
This approach allows for defining the indices of interest at runtime (non hard-coding).

Related

Convert csv file to map

I have a csv file containing a list of abbreviations and their full values such that the file looks like the below
original,mappedValue
bbc,britishBroadcastingCorporation
ch4,channel4
I want to convert this csv file into a Map such that it is of the form
val x:Map[String,String] = Map("bbc"->"britishBroadcastingCorporation", "ch4"->"channel4")
I have tried using the below:
Source.fromFile("pathToFile.csv").getLines().drop(1).map(_.split(","))
but this leaves me with an Iterator[Array[String]]
You are close , split provides an array. You have to convert it into a tuple and then to a map
Source.fromFile("/home/agr/file.csv").getLines().drop(1).map(csv=> (csv.split(",")(0),csv.split(",")(1))).toMap
res4: scala.collection.immutable.Map[String,String] = Map(bbc -> britishBroadcastingCorporation, ch4 -> channel4)
In real life , you will check for existance of bad rows and filtering out the array splits whose length is less than 2 or may be put that into another bin as bad data etc.

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

How to Remove first few lines/header from multiple files using scala in spark

I was able to remove the first few lines of a single file using the code below:
scala> val file = sc.textFile("file:///root/path/file.csv")
Removing first 5 lines:
scala> val Data = file.mapPartitionsWithIndex{ (idx, iter) => if (idx == 0) iter.drop(5) else iter }
The problem is: Suppose that I have multiple files with the same columns, and I want to load all of them into rdd, removing the first few lines of each file.
Is this actually possible?
I'd appreciate any help. Thanks in advance!
Lets assume there are 2 files.
ravis-MacBook-Pro:files raviramadoss$ cat file.csv
first_file_first_record
first_file_second_record
first_file_third_record
first_file_fourth_record
first_file_fifth_record
first_file_sixth_record
ravis-MacBook-Pro:files raviramadoss$ cat file_2.csv
second_file_first_record
second_file_second_record
second_file_third_record
second_file_fourth_record
second_file_fifth_record
second_file_sixth_record
second_file_seventh_record
second_file_eight_record
Scala Code
sc.wholeTextFiles("/Users/raviramadoss/files").flatMap( _._2.lines.drop(5) ).collect()
Output:
res41: Array[String] = Array(first_file_sixth_record, second_file_sixth_record, second_file_seventh_record, second_file_eight_record)
In Spark/Hadoop if you give the input path as the directory containing all the files then the code which you have written will work on all the individual files separately.
So to achieve your objective, just give the input path as the directory containing all the files. So the first few lines will be removed from all the files.