java heap space error when converting csv to json but no error with d3.csv() - scala

Platform being used: Apache Zeppelin
Language: scala, javascript
I use d3js to read a csv file of size ~40MB and it works perfectly fine with the below code:
<script type="text/javascript">
d3.csv("test.csv", function(data) {
// data is JSON array. Do something with data;
console.log(data);
});
</script>
Now, the idea is to avoid d3js, instead, construct the JSONarray in scala and access this variable in javascript code through z.angularBind(). Both of the below code works for smaller files, but gives java heap space error for the CSV file of size 40MB. What I am unable to understand is when d3.csv() can perfectly do the job without any heap space error, why cannot these 2 below code?
Edited Code 1: Using scala's
import java.io.BufferedReader;
import java.io.FileReader;
import org.json._
import scala.io.Source
var br = new BufferedReader(new FileReader("/root/test.csv"))
var contentLine = br.readLine();
var keys = contentLine.split(",")
contentLine = br.readLine();
var ja = new JSONArray();
while (contentLine != null) {
var splits = contentLine.split(",")
var i = 0
var jo = new JSONObject()
for(i <- 0 to splits.length-1){
jo.put(keys(i), splits(i));
}
ja.put(jo);
contentLine = br.readLine();
}
//z.angularBind("ja",ja.toString()) //ja can be accessed now in javascript (EDITED-10/11/15)
Edited Code 2:
I thought the heap space issue might go away if I use Apache spark to construct the JSON array like in below code, but this one too gives heap space error:
def myf(keys: Array[String], value: String):String = {
var splits = value.split(",")
var jo = new JSONObject()
for(i <- 0 to splits.length-1){
jo.put(keys(i), splits(i));
}
return(jo.toString())
}
val csv = sc.textFile("/root/test.csv")
val firstrow = csv.first
val header = firstrow.split(",")
val data = csv.filter(x => x != firstrow)
var g = data.map(value => myf(header,value)).collect()
// EDITED BELOW 2 LINES-10/11/15
//var ja= g.mkString("[", ",", "]")
//z.angularBind("ja",ja) //ja can be accessed now in javascript

You are creating JSON-objects. They are not native to java/scala and will therefore take up more space in that environment. What does z.angularBind() really do?
Also what is the heap size of your javascript environment (see https://www.quora.com/What-is-the-maximum-size-of-a-JavaScript-object-in-browser-memory for chrome) and your java environment (see How is the default java heap size determined?).
Update: Removed the original part of the answer where I misunderstood the question

Related

Apache Spark Data Generator Function on Databricks Not working

I am trying to execute the Data Generator function provided my Microsoft to test streaming data to Event Hubs.
Unfortunately, I keep on getting the error
Processing failure: No such file or directory
When I try and execute the function:
%scala
DummyDataGenerator.start(15)
Can someone take a look at the code and help decipher why I'm getting the error:
class DummyDataGenerator:
streamDirectory = "/FileStore/tables/flight"
None # suppress output
I'm not sure how the above cell gets called into the function DummyDataGenerator
%scala
import scala.util.Random
import java.io._
import java.time._
// Notebook #2 has to set this to 8, we are setting
// it to 200 to "restore" the default behavior.
spark.conf.set("spark.sql.shuffle.partitions", 200)
// Make the username available to all other languages.
// "WARNING: use of the "current" username is unpredictable
// when multiple users are collaborating and should be replaced
// with the notebook ID instead.
val username = com.databricks.logging.AttributionContext.current.tags(com.databricks.logging.BaseTagDefinitions.TAG_USER);
spark.conf.set("com.databricks.training.username", username)
object DummyDataGenerator extends Runnable {
var runner : Thread = null;
val className = getClass().getName()
val streamDirectory = s"dbfs:/tmp/$username/new-flights"
val airlines = Array( ("American", 0.17), ("Delta", 0.12), ("Frontier", 0.14), ("Hawaiian", 0.13), ("JetBlue", 0.15), ("United", 0.11), ("Southwest", 0.18) )
val reasons = Array("Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft")
val rand = new Random(System.currentTimeMillis())
var maxDuration = 3 * 60 * 1000 // default to three minutes
def clean() {
System.out.println("Removing old files for dummy data generator.")
dbutils.fs.rm(streamDirectory, true)
if (dbutils.fs.mkdirs(streamDirectory) == false) {
throw new RuntimeException("Unable to create temp directory.")
}
}
def run() {
val date = LocalDate.now()
val start = System.currentTimeMillis()
while (System.currentTimeMillis() - start < maxDuration) {
try {
val dir = s"/dbfs/tmp/$username/new-flights"
val tempFile = File.createTempFile("flights-", "", new File(dir)).getAbsolutePath()+".csv"
val writer = new PrintWriter(tempFile)
for (airline <- airlines) {
val flightNumber = rand.nextInt(1000)+1000
val deptTime = rand.nextInt(10)+10
val departureTime = LocalDateTime.now().plusHours(-deptTime)
val (name, odds) = airline
val reason = Random.shuffle(reasons.toList).head
val test = rand.nextDouble()
val delay = if (test < odds)
rand.nextInt(60)+(30*odds)
else rand.nextInt(10)-5
println(s"- Flight #$flightNumber by $name at $departureTime delayed $delay minutes due to $reason")
writer.println(s""" "$flightNumber","$departureTime","$delay","$reason","$name" """.trim)
}
writer.close()
// wait a couple of seconds
//Thread.sleep(rand.nextInt(5000))
} catch {
case e: Exception => {
printf("* Processing failure: %s%n", e.getMessage())
return;
}
}
}
println("No more flights!")
}
def start(minutes:Int = 5) {
maxDuration = minutes * 60 * 1000
if (runner != null) {
println("Stopping dummy data generator.")
runner.interrupt();
runner.join();
}
println(s"Running dummy data generator for $minutes minutes.")
runner = new Thread(this);
runner.run();
}
def stop() {
start(0)
}
}
DummyDataGenerator.clean()
displayHTML("Imported streaming logic...") // suppress output
you should be able to use the Databricks Labs Data Generator on the Databricks community edition. I'm providing the instructions below:
Running Databricks Labs Data Generator on the community edition
The Databricks Labs Data Generator is a Pyspark library so the code to generate the data needs to be Python. But you should be able to create a view on the generated data and consume it from Scala if that's your preferred language.
You can install the framework on the Databricks community edition by creating a notebook with the cell
%pip install git+https://github.com/databrickslabs/dbldatagen
Once it's installed you can then use the library to define a data generation spec and by using build, generate a Spark dataframe on it.
The following example shows generation of batch data similar to the data set you are trying to generate. This should be placed in a separate notebook cell
Note - here we generate 10 million records to illustrate ability to create larger data sets. It can be used to generate datasets much larger than that
%python
num_rows = 10 * 1000000 # number of rows to generate
num_partitions = 8 # number of Spark dataframe partitions
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build()
display(df_flight_data)
You can find information on how to generate streaming data in the online documentation at https://databrickslabs.github.io/dbldatagen/public_docs/using_streaming_data.html
You can create a named temporary view over the data so that you can access it from SQL or Scala using one of two methods:
1: use createOrReplaceTempView
df_flight_data.createOrReplaceTempView("delays")
2: use options for build. In this case the name passed to the Data Instance initializer will be the name of the view
i.e
df_flight_data = flightdata_defn.build(withTempView=True)
This code will not work on the community edition because of this line:
val dir = s"/dbfs/tmp/$username/new-flights"
as there is no DBFS fuse on Databricks community edition (it's supported only on full Databricks). It's potentially possible to make it working by:
Changing that directory to local directory, like, /tmp or something like
adding a code (after writer.close()) to list flights-* files in that local directory, and using dbutils.fs.mv to move them into streamDirectory

Download all the files from a s3 bucket using scala

I tried the below code to download one file successfully but unable to download all the list of files
client.getObject(
new GetObjectRequest(bucketName, "TestFolder/TestSubfolder/Psalm/P.txt"),
new File("test.txt"))
Thanks in advance
Update
I tried the below code but getting list of directories ,I want list of files rather
val listObjectsRequest = new ListObjectsRequest().
withBucketName("tivo-hadoop-dev").
withPrefix("prefix").
withDelimiter("/")
client.listObjects(listObjectsRequest).getCommonPrefixes
It's a simple thing but I struggled like any thing before concluding below mentioned answer.
I found a java code and changed to scala accordingly and it worked
val client = new AmazonS3Client(credentials)
val listObjectsRequest = new ListObjectsRequest().
withBucketName("bucket-name").
withPrefix("path/of/dir").
withDelimiter("/")
var objects = client.listObjects(listObjectsRequest);
do {
for (objectSummary <- objects.getObjectSummaries()) {
var key = objectSummary.getKey()
println(key)
var arr=key.split("/")
var file_name = arr(arr.length-1)
client.getObject(
new GetObjectRequest("bucket" , key),
new File("some/path/"+file_name))
}
objects = client.listNextBatchOfObjects(objects);
} while (objects.isTruncated())
Below code is fast and useful especially when you want to download all objects at a specific local directory. It maintains the files under the exact same s3 prefix hierarchy
val xferMgrForAws:TransferManager = TransferManagerBuilder.standard().withS3Client(awsS3Client).build();
var objectListing:ObjectListing = null;
objectListing = awsS3Client.listObjects(awsBucketName, prefix);
val summaries:java.util.List[S3ObjectSummary] = objectListing.getObjectSummaries();
if(summaries.size() > 0) {
val xfer:MultipleFileDownload = xferMgrForAws.downloadDirectory(awsBucketName, prefix, new File(localDirPath));
xfer.waitForCompletion();
println("All files downloaded successfully!")
} else {
println("No object present in the bucket !");
}

Gatling & Scala : How to split values in loop?

I want to split some values in loop. I used split method in check and it works for me. But, there are more than 25 values of two different types.
So, I am implementing loop in scala and struggling.
Consider the following scenario:
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
class testSimulation extends Simulation {
val httpProtocol = http
.baseURL("https://website.com")
.doNotTrackHeader("1")
.disableCaching
val uri1 = "https://website.com"
val scn = scenario("EditAttribute")
.exec(http("LogIn")
.post(uri1 + "/web/guest/")
.headers(headers_0)
.exec(http("getPopupData")
.post("/website/getPopupData")
.check(jsonPath("$.data[0].pid").transform(_.split('#').toSeq).saveAs("pID"))) // Saving splited value
.exec(http("Listing")
.post("/website/listing")
.check(jsonPath("$.data[*].AdId").findAll.saveAs("aID")) // All values are collected in vector
// .check(jsonPath("$.data[*].AdId").transform(_.split('#').toSeq).saveAs("aID")) // Split method Not working for batch
// .check(jsonPath("$.data[*].AdId").findAll.saveAs("aID")) // To verify the length of array (vector)
.check(jsonPath("$.data[0].RcId").findAll.saveAs("rID")))
.exec(http("UpdatedDataListing")
.post("/website/search")
.formParam("entityTypeId", "${pId(0)}") // passing splited value, perfectly done
.formParam("action_id", "${aId(0)},${aId(1)},${aId(2)},..and so on) // need to pass splitted values which is not happening
.formParam("userId", "${rID}")
// To verify values on console (What value I m getting after splitting)...
.exec( session => {
val abc = session("pID").as[Seq[String]]
val xyz = session("aID").as[Seq[String]]
println("Separated pId ===> " +abc(0)) // output - first splitted value
println("Separated pId ===> " +abc(1)) // split separater
println("Separated pId ===> " +abc(2)) // second splitted value
println("Length ===> " +abc.length) // output - 3
println("Length ===> " +xyz.length) // output - 25
session
}
)
.exec(http("logOut")
.get("https://" + uri1 + "/logout")
.headers(headers_0))
setUp(scn.inject(atOnceUsers(1))).protocols(httpProtocol)
}
I want to implement a loop which performs splitting of all (25) values in session. I do not want to do hard coding.
I am newbie to scala and Gatling as well.
Since it is a session function the below snippet will give a direction to continue ,use split just like you do in Java :-
exec { session =>
var requestIdValue = new scala.util.Random().nextInt(Integer.MAX_VALUE).toString();
var length = jobsQue.length
try {
var reportElement = jobsQue.pop()
jobData = reportElement.getData;
xml = Configuration.XML.replaceAll("requestIdValue", requestIdValue);
println(s"For Request Id : $requestIdValue .Data Value from feeder is : $jobData Current size of jobsQue : $length");
} catch {
case e: NoSuchElementException => print("Erorr")
}
session.setAll(
"xmlRequest" -> xml)
}

Scala : How to use variable in for loop outside loop block

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")

MongoDB 3.2 C# driver version 2.2.3.3 Gridfs Download large files more than 2gb

I am uploading files using the following code:
using (var s = File.OpenRead(#"C:\2gbDataTest.zip"))
{
var t = Task.Run<ObjectId>(() =>
{
return fs.UploadFromStreamAsync("2gbDataTest.zip", s);
});
return t.Result;
}
//works for the files below 2gb
var t1 = fs.DownloadAsBytesAsync(id);
Task.WaitAll(t1);
var bytes = t1.Result;
I am getting error
I am new to MongoDb and C#, can any one please show me how to download files greater than 2GB in size?
You are hitting the limit in terms of the size a byte array (kept in memory) download can be, so your only choice is to use a Stream instead like you are doing when you upload, something like (with a valid destination):
IGridFSBucket fs;
ObjectId id;
FileStream destination;
await fs.DownloadToStreamAsync(id, destination);
//Just writing complete code for others, This will work ;
//Thanks to "Adam Comerford"
var fs = new GridFSBucket(database);
using (var newFs = new FileStream(filePathToDownload, FileMode.Create))
{
//id is file objectId
var t1 = fs.DownloadToStreamAsync(id, newFs);
Task.WaitAll(t1);
newFs.Flush();
newFs.Close();
}