Hello im trying to read a JSON file and sorting with a template that things would be in a specific order, can i get some pointers on how to do it? - scala

I found how to read from it but , can't seem to find the information i need on how to order it using my own template. and writing it on a diffrent json file. Im using Scala.

Usually to transform data from one JSON file to another you will need to parse it to some data structures in memory (case classes, Scala collections, etc.), transform them and serialize back to file.
Circe is most inefficient JSON parser, especially when it is need to parse files. Its core parser works only with strings that requires reading whole file to RAM and convert it to string from encoded bytes (usually UTF-8), even its alternative Jawn parser reads whole file to a byte array, then convert it to a string and then start parsing. Its formatter also have lot of overheads: serialization of whole output to string or byte buffer before you can start writing it to file.
Much better would be to use circe-jackson integration or even better to use jackson-module-scala: both support reading from FileInputStream and writing to FileOutputStream.
Most efficient Scala parser and serializer that can be used for buffered reading/writing from/to files is here and example of parse-transform-serialize code with it is below.
Let we have a following content of the JSON file:
{
"name": "John",
"devices": [
{
"id": 1,
"model": "HTC One X"
}
]
}
And we are going to transform it to:
{
"name": "John",
"devices": [
{
"id": 1,
"model": "HTC One X"
},
{
"id": 2,
"model": "iPhone X"
}
]
}
Here is how we can do it with jsoniter-scala:
libraryDependencies ++= Seq(
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-core" % "0.29.2" % Compile,
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-macros" % "0.29.2" % Provided // required only in compile-time
)
// import required packages
import java.io._
import com.github.plokhotnyuk.jsoniter_scala.macros._
import com.github.plokhotnyuk.jsoniter_scala.core._
// define your model that mimic JSON format
case class Device(id: Int, model: String)
case class User(name: String, devices: Seq[Device])
// create codec for type that corresponds to root of JSON
implicit val codec = JsonCodecMaker.make[User](CodecMakerConfig())
// read & parse JSON from file to your data structures
val user = {
val fis = new FileInputStream("/tmp/input.json")
try readFromStream(fis)
finally fis.close()
}
// transform your data
val newUser = user
.copy(devices = user.devices :+ Device(id = 2, model = "iPhone X"))
// write your transformed data to json file
val fos = new FileOutputStream("/tmp/output.json")
try writeToStream(newUser, fos)
finally fos.close()

you question is very abstract, but here's a good library for JSON parsing and manipulation in Scala
https://github.com/circe/circe

Related

Create a gatling custom feeder for large json data files

I am new to Gatling and Scala and I am trying to create a test that has a custom 'feeder' which would allow each load test thread to use (and reuse) one of about 250 json data files as a post payload.
Each post payload file has 1000 records of this form:
[{
"zip": "66221-2115",
"recordId": "18378e10-e046-4ad3-9293-0847f8a05b2f",
"firstName": "ANGELA",
"lastName": "MADEUP",
"city": "Springfield",
"street": "123 Fake St",
"state": "KS",
"email": "AMADEUP#GMAIL.COM"
},
...
]
(files are about 250kB each)
Ideally, I would like to read them in at the start of the test kind of like this:
int fileCount = 3;
ClassLoader classLoader = getClass().getClassLoader();
List<File> files = new ArrayList<>();
for (int i =0; i<=fileCount; i++){
String fileName = String.format("identityMatching/address_data_%d.json", i);
File file = new File(classLoader.getResource(fileName).getFile());
files.add(file);
}
and then get the file contents with something like:
FileUtils.readFileToString(files.get(1), StandardCharsets.UTF_8)
I am now fiddling with getting this code working in scala but am wondering a couple things:
1) Can I make this code into a feeder so that I can use it like a CSV feeder?
2) When should I load the json from the files into memory? At the start of the test or when each thread needs the data?
I haven't received any answers so I will post what I have learned.
1) I was able to use a feeder with the filenames in it (not the file content)
2) I think that the best approach for reading the data in is:
.body(RawFileBody(jsonMessage))
RawFileBody(path: Expression[String]) where path is the location of a file that will be uploaded as is
(from https://gatling.io/docs/current/http/http_request)

Converting a string that represents a list into an actual list in Jython?

I have a string in Jython that represents a list of JSON arrays:
[{"datetime": 1570216445000, "type": "test"},{"datetime": 1570216455000, "type": "test2"}]
If I try to iterate over this though, it just iterates over each character. How can I make it iterate over the actual list so I can get each JSON array out?
Background info - This script is being run in Apache NiFi, below is the code that the string originates from:
from org.apache.commons.io import IOUtils
...
def process(self, inputStream):
text = IOUtils.toString(inputStream,StandardCharsets.UTF_8)
You can parse a JSON similar to how you do it in Python.
Sample Code:
import json
# Sample JSON text
text = '[{"datetime": 1570216445000, "type": "test"},{"datetime": 1570216455000, "type": "test2"}]'
# Parse the JSON text
obj = json.loads(text)
# 'obj' is a dictionary
print obj[0]['type']
print obj[1]['type']
Output:
> jython json_string_to_object.py
test
test2

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).

store (binary) file - play framework using scala in heroku

I'm trying to store user-uploaded images in my application which is written by scala and play framework 2.2.x
I've deployed my app in heroku.
Heroku does not allow me to save my file in file system.
So I've tried to store my file in data base.
here is the code that I use for storing image :
def updateImage(id: Long, image: Array[Byte]) = {
val selected = getById(id)
DB.withConnection {
implicit c =>
SQL("update subcategory set image={image} where id = {id}").on('id -> id, 'image -> image).executeUpdate()
}
selected }
and here is the code that I use to retreive my image :
def getImageById(id: Long): Array[Byte] = DB.withConnection {
implicit c =>
val all = SQL("select image from subcategory where id = {id}").on('id -> id)().map {
case Row(image: Array[Byte]) => image
case Row(Some(image: Array[Byte])) => image
case Row(image: java.sql.Blob )=> image.getBytes(0 , image.length().toInt)
}
all.head
}
The problem is: when I use H2 database and blob column, I get the "Match Error" exception.
When I use Postgresql and bytea column, I got no error but when I retrieve the image, It's in hex format and some of the bytes in the beginning of the array are missing.
According to the PostgreSQL documentation, bytea stores the length of the array in the four bytes at the beginning of the array. These are stripped when you read the row, so that's why they seem to be "missing" when you compare the data in Scala with the data in the DB.
You will have to set the response's content-type to the appropriate value if you want the web browser to display the image correctly, as otherwise it does not know it is receiving image data. The Ok.sendFile helper does it for you. Otherwise you will have to do it by hand:
def getPicture = Action {
SimpleResult(
header = ResponseHeader(200),
body = Enumerator(pictureByteArray))
.as(pictureContentType)
}
In the example above, pictureByteArray is the Array[Byte] containing the picture data from your database, and pictureContentType is a string with the appropriate content type (for example, image/jpeg).
This is all quite well explained in the Play documentation.