Scala/Spark save texts in program without save to files - scala

My code will save val: s into result.txt
And then read the file again
I want to know is there a method My code can run directly without save to another file and read it back.
I user val textFile = sc.parallelize(s)
But the next part would have error: value contains is not a member of char
import java.io._
val s = (R.capture("lines"))
resultPath = /home/user
val pw = new PrintWriter(new File(f"$resultPath%s/result.txt"))
pw.write(s)
pw.close
//val textFile = sc.textFile(f"$resultPath%s/result.txt") old method:save into a file and read it back
val textFile = sc.parallelize(s)
val rows = textFile.map { line =>
!(line contains "[, 1]")
val fields = line.split("[^\\d.]+")
((fields(0), fields(1).toDouble))
}

I would have to say the problem you are having is that the variable s is a String data type and you are doing a parallelize on a String instead of a collection. So when you run the map function it is iterating over each character in the String.

Related

How to store values into a dataframe from a List using Scala for handling nested JSON data

I have the below code, where I am pulling data from an API and storing it into a JSON file. Further I will be loading into an oracle table. Also, the data value of the ID column is the column name under VelocityEntries. I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
With the below code I am able to print the data in completedEntries.values but I need help to put it into one df and add with the emedded_df
val inputStream = scala.io.Source.fromInputStream(connection.getInputStream).mkString
val fileWriter1 = new FileWriter(new File(filename))
fileWriter1.write(inputStream.mkString)
fileWriter1.close()
val json_df = spark.read.option("multiLine", true).json(filename)
val embedded_df = json_df.select(explode(col("sprints")) as "x").select(("x.*"))
val list_df = json_df.select("velocityStatEntries.*").columns.toList
for( i <- list_df)
{
completed_df = json_df.select(s"velocityStatEntries.$i.completed.value")
completed_df.show()
}

How to assign textfile string to dictionary value into one variable and how to extract the value by passing key in spark scala?

I am reading a text file from the local file system. I want to convert String to Dictionary(MAP) store it into one variable. And want to extract value by passing key. I am new to spark scala.
scala>val file = sc.textFile("file:///test/prod_details.txt");
scala> file.foreach(println)
{"00000006-0000-0000": "AWS", "00000009-0000-0000": "JIRA", "00000010-0000-0000-0000": "BigData", "00000011-0000-0000-0000": "CVS"}
scala> val rowRDD=file.map(_.split(","))
Expected Result is :
If I pass the key as "00000010-0000-0000-0000",
the function should return the value as BigData
Since your file is in json format and is not big you can read your file with spark json connector and then extract keys and columns :
val df = session.read.json("path to file")
val keys = df.columns
val values = df.collect().last.toSeq
val map = keys.zip(values).toMap

code for counting no of records of a file in hdfs

I already read a file in hdfs using filesystem and need to count the no of records of a file. can u help for counting no of records of file for below code.
val inputStream:FSDataInputStream = fileSystem.open(dataFile)
val data = IOUtils.toString(inputStream, "UTF-8")
inputStream.close()
I am assuming that by record count you mean the count of lines.
You can use the java.io.BufferedReader to read the input stream line by line and incrementing a counter variable
import java.io.BufferedReader
import java.io.InputStreamReader
var count = 0
val inputStream: FSDataInputStream = fileSystem.open(dataFile)
val reader: BufferedReader = new BufferedReader(new InputStreamReader(inputStream))
var line: String = reader.readLine()
while(line!=null){
count+=1
line = reader.readLine()
}
Alternatively you can also use reader.lines().count() to get the count of lines but using this you will not be able to reuse the input stream to get the actual data in lines since inputstream is not reusable.

Spark: read csv file from s3 using scala

I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.
val myFile = "myLocalPath/myFile.csv"
for (line <- Source.fromFile(myFile).getLines()) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
}
Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!
val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));
val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
var line = ""
while ((line = reader.readLine()) != null) {
val data = line.split(",")
myHashMap.put(data(0), data(1).toDouble)
println(line);
}
I think I got it work like below:
val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));
val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
for (line <- myData) {
val data = line.split(",")
myMap.put(data(0), data(1).toDouble)
}
println(" my map : " + myMap.toString())
Read in csv-file with sc.textFile("s3://myBucket/myFile.csv"). That will give you an RDD[String]. Get that into a map
val myHashMap = data.collect
.map(line => {
val substrings = line.split(" ")
(substrings(0), substrings(1).toDouble)})
.toMap
You can the use sc.broadcast to broadcast your map, so that it is readily available on all your worker nodes.
(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)
This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile. Use the below code
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey#Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
var data = line.split(",")
var value1 = data(0)
var value2 = data(1).toDouble
}
In the above code, sc.textFile will read the data from your file and store in the line RDD. It then split each line with , to a different RDD data inside the loop. Then you can access values from this RDD with the index.

Write and read raw byte arrays in Spark - using Sequence File SequenceFile

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?
Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes
val rdd: RDD[Array[Byte]] = ???
// To write
rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
.saveAsSequenceFile("/output/path", codecOpt)
// To read
val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
.map(_._2.copyBytes())
Here is a snippet with all required imports that you can run from spark-shell, as requested by #Choix
import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.NullWritable
val path = "/tmp/path"
val rdd = sc.parallelize(List("foo"))
val bytesRdd = rdd.map{str => (NullWritable.get, new BytesWritable(str.getBytes) ) }
bytesRdd.saveAsSequenceFile(path)
val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
val recoveredAsString = recovered.map( new String(_) )
recoveredAsString.collect()
// result is: Array[String] = Array(foo)