Convert csv file to map - scala

I have a csv file containing a list of abbreviations and their full values such that the file looks like the below
original,mappedValue
bbc,britishBroadcastingCorporation
ch4,channel4
I want to convert this csv file into a Map such that it is of the form
val x:Map[String,String] = Map("bbc"->"britishBroadcastingCorporation", "ch4"->"channel4")
I have tried using the below:
Source.fromFile("pathToFile.csv").getLines().drop(1).map(_.split(","))
but this leaves me with an Iterator[Array[String]]

You are close , split provides an array. You have to convert it into a tuple and then to a map
Source.fromFile("/home/agr/file.csv").getLines().drop(1).map(csv=> (csv.split(",")(0),csv.split(",")(1))).toMap
res4: scala.collection.immutable.Map[String,String] = Map(bbc -> britishBroadcastingCorporation, ch4 -> channel4)
In real life , you will check for existance of bad rows and filtering out the array splits whose length is less than 2 or may be put that into another bin as bad data etc.

Related

Spark - read text file, string off first X and last Y rows using monotonically_increasing_id

I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Elaborated scenario -> HDFS directory which is "fed" with new log data of multiple types of bank accounts activity.
Each row represents a random activity type, and each row (String) contains the text "ActivityType=<TheTypeHere>".
In Spark-Scala, what's the best approach to read the input file/s in the HDFS directory and output multiple HDFS files, where each ActivityType is written to its own new file?
Adapted first answer to the statement:
The location of the "key" string is random within the parent String,
the only thing that is guaranteed is that it contains that sub-string,
in this case "ActivityType" followed by some val.
The question is really about this. Here goes:
// SO Question
val rdd = sc.textFile("/FileStore/tables/activitySO.txt")
val rdd2 = rdd.map(x => (x.slice (x.indexOfSlice("ActivityType=<")+14, x.indexOfSlice(">", (x.indexOfSlice("ActivityType=<")+14))), x))
val df = rdd2.toDF("K", "V")
df.write.partitionBy("K").text("SO_QUESTION2")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
3,4,4,ActivityType=<ACT_002>,A,1,2
ABC,ActivityType=<ACT_0033>
DEF,ActivityType=<ACT_0033>
Output is 3 files whereby the key is e.g. not ActivityType=, but rather ACT_001, etc. The key data is not stripped, it is still there in the String. You can modify that if you want as well as output location and format.
You can use MultipleOutputFormat for this.Convert rdd into key value pairs such that ActivityType is the key.Spark will create different files for different keys.You can decide based on the key where to place the files and what their names will be.
You can do something like this using RDDs whereby I assume you have variable length files and then converting to DFs:
val rdd = sc.textFile("/FileStore/tables/activity.txt")
val rdd2 = rdd.map(_.split(","))
.keyBy(_(0))
val rdd3 = rdd2.map(x => (x._1, x._2.mkString(",")))
val df = rdd3.toDF("K", "V")
//df.show(false)
df.write.partitionBy("K").text("SO_QUESTION")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
ActivityType=<ACT_002>,A,1,2
ActivityType=<ACT_003>,ABC
I get then as output 3 files, in this case 1 for each record. A bit hard to show as did it in Databricks.
You can adjust your output format and location, etc. partitionBy is the key here.

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

split the file into multiple files based on a string in spark scala

I have a text file with the below data having no particular format
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
I want the output as two files as below :
Based on string abc, I want to split the file.
file 1:
abc*123 *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~
hig*0109*10052200*Rq~
file 2:
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~
And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.
There is this little dirty trick that I used for a project :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")
You can set the delimiter of your spark context for reading files. So you could do something like this :
val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
.map(x => (delimit ++ x))
.toDF("delimit_column")
.filter(col("delimit_column") !== delimit)
Then you can map each element of your DataFrame (or RDD) to be written to a file.
It's a dirty method but it might help you !
Have a good day
PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter
You can benefit from sparkContext's wholeTextFiles function to read the file. Then parse it to separate the strings ( here I have used #### as distinct combination of characters that won't repeat in the text)
val rdd = sc.wholeTextFiles("path to the file")
.flatMap(tuple => tuple._2.replace("\r\nabc", "####abc").split("####")).collect()
And then loop the array to save the texts to output
for(str <- rdd){
//saving codes here
}

Reading a csv file and selecting three columns in Scala

I need to read a csv file and then to make a new file having the specified 3 columns ..
I am aware of reading a text file but not csv file .
import scala.io.Source._
val lines = fromFile("file.txt").getLines
Or if you just want the first three columns, try this
val lines = fromFile("file.txt").
getLines.
map(_.split(",",4).take(3)).
toList
Assuming a collection of indices idx that refer to columns in the csv file, consider first
val idx = Array(1,3,4)
val xs = (1 to 10).toArray
and so we can fetch the 2nd, 4th and 5th columns (index 0 refers to the first column),
idx.map(xs)
Array(2, 4, 5)
We can apply this idea onn each array from splitting each line as follows,
Source.fromFile("file.csv").getLines.map(_.split(",").map(idx))
This approach allows for defining the indices of interest at runtime (non hard-coding).