Spark/Scala read hadoop file - scala

In a pig script I saved a table using PigStorage('|').
I have in the corresponding hadoop folder files like
part-r-00000
etc.
What is the best way to load it in Spark/Scala ? In this table I have 3 fields: Int, String, Float
I tried:
text = sc.hadoopFile("file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
But then I would need somehow to split each line. Is there a better way to do it?
If I were coding in python I would create a Dataframe indexed by the first field and whose columns are the values found in the string field and coefficients the float values. But I need to use scala to use the pca module. And the dataframes don't seem that close to python's ones
Thanks for the insight

PigStorage creates a text file without schema information so you need to do that work yourself something like
sc.textFile("file") // or directory where the part files are
val data = csv.map(line => {
vals=line.split("|")
(vals(0).toInt,vals(1),vals(2).toDouble)}
)

Related

how to read CSV file in scala

I have a CSV file and I want to read that file and store it in case class. As I know A CSV is a comma separated values file. But in case of my csv file there are some data which have already comma itself. and it creates new column for every comma. So the problem how to split data from that.
1st data
04/20/2021 16:20(1st column) Here a bunch of basic techniques that suit most businesses, and easy-to-follow steps that can help you create a strategy for your social media marketing goals.(2nd column)
2nd data
11-07-2021 12:15(1st column) Focus on attracting real followers who are genuinely interested in your content, and make the most of your social media marketing efforts.(2nd column)
var i=0
var length=0
val data=Source.fromFile(file)
for (line <- data.getLines) {
val cols = line.split(",").map(_.trim)
length = cols.length
while(i<length){
//println(cols(i))
i=i+1
}
i=0
}
If you are reading a complex CSV file then the ideal solution is to use an existing library. Here is a link to the ScalaDex search results for CSV.
ScalaDex CSV Search
However, based on the comments, it appears that you might actually be wanting to read data stored in a Google Sheet. If that is the case, you can utilize the fact that you have some flexibility to save the data in a text file yourself. When I want to read data from a Google Sheet in Scala, the approach I use first is to save the file in a format that isn't hard to read. If the fields have embedded commas but no tabs, which is common, then I will save the file as a TSV and parse that with split("\t").
A simple bit of code that only uses the standard library might look like the following:
val source = scala.io.Source.fromFile("data.tsv")
val data = source.getLines.map(_.split("\t")).toArray
source.close
After this, data will be an Array[Array[String]] with your data in it that you can process as you desire.
Of course, if your data includes both tabs and commas then you'll really want to use one of those more robust external libraries.
You could use univocity CSV parser for faster stuffs.
You can also use it for creation as well.
Univocity parsers

Saving pyspark dataframe with complicated schema in plain text for testing

How do I make clean test data for pyspark? I have figured something out that seems pretty good, but parts seem a little awkward, so I'm posting.
Let's say I have a dataframe df with a complicated schema and a small number of rows. I want test data checked into my repo. I don't want a binary file. At this point, I'm not sure the best way to proceed -but I'm thinking i have a file like
test_fn.py
and it has this in it
schema_str='struct<eventTimestamp:timestamp,list_data:array<struct<valueA:string,valueB:string,valueC:boolean>>>'
to get the schema in txt format, using the df.schema.simpleString() function. Then to get the rows - I do
lns = [row.json_txt for row in df.select((F.to_json(F.struct('*'))).alias('json_txt')).collect()]
now I put those lines in my test_fn.py file, or I could have a .json file in the repo.
Now to run the test, I have to make a dataframe with the correct schema and data from this text. It seems the only way spark will parse the simple string is if I create a dataframe with it, that is I can't pass that simple string to the from_json function? So this is a little awkward which is why I thought I'd post -
schema2 = spark.createDataFrame(data=[], schema=schema_str).schema
lns = # say I read the lns back from above
df_txt = spark.createDataFrame(data=lns, schema=T.StringType())
I see df_txt just has one column called 'value'
df_json = df_txt.select(F.from_json('value', schema=schema2).alias('xx'))
sel = ['xx.%s' % nm for nm in df_json.select('xx').schema.fields[0].dataType.fieldNames()]
df2 = df_json.select(*sel)
Now df2 should be the same as df1 - which I see is the case from the deepdiff module.

Spark Spark Empty Json Files reading from Directory

I'm reading from a path say /json//myfiles_.json
I'm then flattening the json using explode. This causes an error since I have some empty files. How do I tell it to ignore empty files of somehow filter them out?
I can detect individual files checking if the head is empty but I need to do this on the collection of files iterated in the dataframe with the use of the wildcard path.
So the answer seems to be that I need to provide a schema explicitly because it can't infer one from empty file - as you would expect!
e.g.
val schemadf = sqlContext.read.json(schemapath) //infer schema from file with data or do manually
val schema = schemadf.schema
val raw = sqlContext.read.schema(schema).json(monthfile)
val prep = raw.withColumn("MyArray", explode($"MyArray"))
.select($"ID", $"name", $"CreatedAt")
display(prep)

How to perform row and column filter operations in a csv file in scala

I am writing scala scripts. I need to perform row filter operations such as greater than,less than operations for the csv file. I have tried using filter option in the script unable to get the results. Please let me know how to perform filter operation for the csv file. The sample data has been attached here for reference.Thanks in advance.
for (line <- bufferedSource.getLines) {
cols += line.split(",").filter(csv => csv(1).toInt > 10000)}
Instead of resorting to for use map. This code snippet should work
bufferedSource.getLines.map(row => row.split(",")).filter(cols => cols(1).toInt > 10000).toList
Also, it's a better approach to use a case class for the CSV you are filtering to make your code more readable.

Spark: RDD.saveAsTextFile when using a pair of (K,Collection[V])

I have a dataset of employees and their leave-records. Every record (of type EmployeeRecord) contains EmpID (of type String) and other fields. I read the records from a file and then transform into PairRDDFunctions:
val empRecords = sc.textFile(args(0))
....
val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)
At this point, 'empsGroupedByEmpID' is of type RDD[String,Iterable[EmployeeRecord]]. I transform this into PairRDDFunctions:
val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)
Then, I go for processing the records as per the logic of the application. Finally, I get an RDD of type [Iterable[EmployeeRecord]]
val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>
When I try to write the contents of this RDD to a text file using the available API thus:
finalRecords.saveAsTextFile("./path/to/save")
the I find that in the file every record begins with an ArrayBuffer(...). What I need is a file with one EmployeeRecord in each line. Is that not possible? Am I missing something?
I have spotted the missing API. It is well...flatMap! :-)
By using flatMap with identity, I can get rid of the Iterator and 'unpack' the contents, like so:
finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")
That solves the problem I have been having.
I also have found this post suggesting the same thing. I wish I saw it a bit earlier.