spark out of memory multiple iterations - scala

I have a spark job that (runs in spark 1.3.1) has to iterate over several keys (about 42) and process the job. Here is the structure of the program
Get the key from a map
Fetch data from hive (hadoop-yarn underneath) that is matching the key as a data frame
Process data
Write results to hive
When I run this for one key, everything works fine. When I run with 42 keys, I am getting an out of memory exception around 12th iteration. Is there a way I can clean the memory in between each iteration? Help appreciated.
Here is the high level code that I am working with.
public abstract class SparkRunnable {
public static SparkContext sc = null;
public static JavaSparkContext jsc = null;
public static HiveContext hiveContext = null;
public static SQLContext sqlContext = null;
protected SparkRunnableModel(String appName){
//get the system properties to setup the model
// Getting a java spark context object by using the constants
SparkConf conf = new SparkConf().setAppName(appName);
sc = new SparkContext(conf);
jsc = new JavaSparkContext(sc);
// Creating a hive context object connection by using java spark
hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
// sql context
sqlContext = new SQLContext(sc);
}
public abstract void processModel(Properties properties) throws Exception;
}
class ModelRunnerMain(model: String) extends SparkRunnableModel(model: String) with Serializable {
override def processModel(properties: Properties) = {
val dataLoader = DataLoader.getDataLoader(properties)
//loads keys data frame from a keys table in hive and converts that to a list
val keysList = dataLoader.loadSeriesData()
for (key <- keysList) {
runModelForKey(key, dataLoader)
}
}
def runModelForKey(key: String, dataLoader: DataLoader) = {
//loads data frame from a table(~50 col X 800 rows) using "select * from table where key='<key>'"
val keyDataFrame = dataLoader.loadKeyData()
// filter this data frame into two data frames
...
// join them to transpose
...
// convert the data frame into an RDD
...
// run map on the RDD to add bunch of new columns
...
}
}
My data frame size is under a Meg. But I create several data frames from this by selecting and joining etc. I assume all these get garbage collected once the iteration is done.
Here is configuration I am running with.
spark.eventLog.enabled:true spark.broadcast.port:7086
spark.driver.memory:12g spark.shuffle.spill:false
spark.serializer:org.apache.spark.serializer.KryoSerializer
spark.storage.memoryFraction:0.7 spark.executor.cores:8
spark.io.compression.codec:lzf spark.shuffle.consolidateFiles:true
spark.shuffle.service.enabled:true spark.master:yarn-client
spark.executor.instances:8 spark.shuffle.service.port:7337
spark.rdd.compress:true spark.executor.memory:48g
spark.executor.id: spark.sql.shuffle.partitions:700
spark.cores.max:56
Here is the exception I am getting.
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.util.io.ByteArrayChunkOutputStream.allocateNewChunkIfNeeded(ByteArrayChunkOutputStream.scala:66)
at org.apache.spark.util.io.ByteArrayChunkOutputStream.write(ByteArrayChunkOutputStream.scala:55)
at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:264)
at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(LZFOutputStream.java:266)
at com.ning.compress.lzf.LZFOutputStream.write(LZFOutputStream.java:124)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:124)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:202)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:84)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:839)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$15$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1042)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$15$$anonfun$apply$1.apply(DAGScheduler.scala:1039)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$15$$anonfun$apply$1.apply(DAGScheduler.scala:1039)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$15.apply(DAGScheduler.scala:1039)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$15.apply(DAGScheduler.scala:1038)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1038)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1390)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)

Using checkpoint() or localCheckpoint() can cut the spark lineage and improve the performance of the application in iterations.

Related

Apache Spark Data Generator Function on Databricks Not working

I am trying to execute the Data Generator function provided my Microsoft to test streaming data to Event Hubs.
Unfortunately, I keep on getting the error
Processing failure: No such file or directory
When I try and execute the function:
%scala
DummyDataGenerator.start(15)
Can someone take a look at the code and help decipher why I'm getting the error:
class DummyDataGenerator:
streamDirectory = "/FileStore/tables/flight"
None # suppress output
I'm not sure how the above cell gets called into the function DummyDataGenerator
%scala
import scala.util.Random
import java.io._
import java.time._
// Notebook #2 has to set this to 8, we are setting
// it to 200 to "restore" the default behavior.
spark.conf.set("spark.sql.shuffle.partitions", 200)
// Make the username available to all other languages.
// "WARNING: use of the "current" username is unpredictable
// when multiple users are collaborating and should be replaced
// with the notebook ID instead.
val username = com.databricks.logging.AttributionContext.current.tags(com.databricks.logging.BaseTagDefinitions.TAG_USER);
spark.conf.set("com.databricks.training.username", username)
object DummyDataGenerator extends Runnable {
var runner : Thread = null;
val className = getClass().getName()
val streamDirectory = s"dbfs:/tmp/$username/new-flights"
val airlines = Array( ("American", 0.17), ("Delta", 0.12), ("Frontier", 0.14), ("Hawaiian", 0.13), ("JetBlue", 0.15), ("United", 0.11), ("Southwest", 0.18) )
val reasons = Array("Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft")
val rand = new Random(System.currentTimeMillis())
var maxDuration = 3 * 60 * 1000 // default to three minutes
def clean() {
System.out.println("Removing old files for dummy data generator.")
dbutils.fs.rm(streamDirectory, true)
if (dbutils.fs.mkdirs(streamDirectory) == false) {
throw new RuntimeException("Unable to create temp directory.")
}
}
def run() {
val date = LocalDate.now()
val start = System.currentTimeMillis()
while (System.currentTimeMillis() - start < maxDuration) {
try {
val dir = s"/dbfs/tmp/$username/new-flights"
val tempFile = File.createTempFile("flights-", "", new File(dir)).getAbsolutePath()+".csv"
val writer = new PrintWriter(tempFile)
for (airline <- airlines) {
val flightNumber = rand.nextInt(1000)+1000
val deptTime = rand.nextInt(10)+10
val departureTime = LocalDateTime.now().plusHours(-deptTime)
val (name, odds) = airline
val reason = Random.shuffle(reasons.toList).head
val test = rand.nextDouble()
val delay = if (test < odds)
rand.nextInt(60)+(30*odds)
else rand.nextInt(10)-5
println(s"- Flight #$flightNumber by $name at $departureTime delayed $delay minutes due to $reason")
writer.println(s""" "$flightNumber","$departureTime","$delay","$reason","$name" """.trim)
}
writer.close()
// wait a couple of seconds
//Thread.sleep(rand.nextInt(5000))
} catch {
case e: Exception => {
printf("* Processing failure: %s%n", e.getMessage())
return;
}
}
}
println("No more flights!")
}
def start(minutes:Int = 5) {
maxDuration = minutes * 60 * 1000
if (runner != null) {
println("Stopping dummy data generator.")
runner.interrupt();
runner.join();
}
println(s"Running dummy data generator for $minutes minutes.")
runner = new Thread(this);
runner.run();
}
def stop() {
start(0)
}
}
DummyDataGenerator.clean()
displayHTML("Imported streaming logic...") // suppress output
you should be able to use the Databricks Labs Data Generator on the Databricks community edition. I'm providing the instructions below:
Running Databricks Labs Data Generator on the community edition
The Databricks Labs Data Generator is a Pyspark library so the code to generate the data needs to be Python. But you should be able to create a view on the generated data and consume it from Scala if that's your preferred language.
You can install the framework on the Databricks community edition by creating a notebook with the cell
%pip install git+https://github.com/databrickslabs/dbldatagen
Once it's installed you can then use the library to define a data generation spec and by using build, generate a Spark dataframe on it.
The following example shows generation of batch data similar to the data set you are trying to generate. This should be placed in a separate notebook cell
Note - here we generate 10 million records to illustrate ability to create larger data sets. It can be used to generate datasets much larger than that
%python
num_rows = 10 * 1000000 # number of rows to generate
num_partitions = 8 # number of Spark dataframe partitions
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build()
display(df_flight_data)
You can find information on how to generate streaming data in the online documentation at https://databrickslabs.github.io/dbldatagen/public_docs/using_streaming_data.html
You can create a named temporary view over the data so that you can access it from SQL or Scala using one of two methods:
1: use createOrReplaceTempView
df_flight_data.createOrReplaceTempView("delays")
2: use options for build. In this case the name passed to the Data Instance initializer will be the name of the view
i.e
df_flight_data = flightdata_defn.build(withTempView=True)
This code will not work on the community edition because of this line:
val dir = s"/dbfs/tmp/$username/new-flights"
as there is no DBFS fuse on Databricks community edition (it's supported only on full Databricks). It's potentially possible to make it working by:
Changing that directory to local directory, like, /tmp or something like
adding a code (after writer.close()) to list flights-* files in that local directory, and using dbutils.fs.mv to move them into streamDirectory

MinBy does not return any result Apache Storm

I have built Storm topology with data retrieval from Kafka. And I would like to build an aggregation with counting minimum for each of the batches on one of the fields. I tried to use maxBy function on the stream, however, it does not display any results, although the data is flowing through the system and output function worked with other aggregations. How can it be implemented differently or what can be fixed in the current implementation?
Here is my current implementation:
val tridentTopology = new TridentTopology()
val stream = tridentTopology.newStream("kafka_spout",
new KafkaTridentSpoutOpaque(spoutConfig))
.map(new ParserMapFunction, new Fields("created_at", "id", "text", "source", "timestamp_ms",
"user.id", "user.name", "user.location", "user.url", "user.description", "user.followers_count",
"user.friends_count", "user.favorite_count", "user.lang", "entities.hashtags"))
.maxBy("user.followers_count")
.map(new OutputFunction)
My custom output function:
class OutputFunction extends MapFunction{
override def execute(input: TridentTuple): Values = {
val values = input.getValues.asScala.toList.toString
println(s"TWEET: $values")
new Values(values)
}
}

In Spark,while writing dataset into database it takes some pre-assumed time for save operation

I ran the spark-submit command as mentioned below,which performs the Datasets loading from DB,processing,and in final stage it push the multiple datasets into Oracle DB.
./spark-submit --class com.sample.Transformation --conf spark.sql.shuffle.partitions=5001
--num-executors=40 --executor-cores=1 --executor-memory=5G
--jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-api-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/drools-core-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/drools-compiler-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-maven-support-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-internal-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/xstream-1.4.10.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-commons-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/ecj-4.4.2.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/mvel2-2.4.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-commons-7.7.0.Final.jar,
/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-api-7.7.0.Final.jar
--driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar
--master spark://10.180.181.41:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Transformation-0.0.1-SNAPSHOT.jar"
> /scratch/rmbbuild/spark_ormb/POC-jar/logs/logs12.txt
But,it takes some pre-assumed time while writing the dataset into the DB,don't know why it is consuming this long time before starting the write process.
Attaching the screenshot which clearly highlights the problem which i am facing.
Please go through the screenshot before commenting out the solution.
Spark Dashboard Stages Screenshot:
If we look at the screenshot I have highlighted the timing of around 10mins,which is consumed before every dataset write into the DB.
Even I changed the batchsize to 100000 as such follows:
outputDataSetforsummary.write().mode("append").format("jdbc").option("url", connection)
.option("batchSize", "100000").option("dbtable", CI_TXN_DTL).save();
So,if any one can explain out why this pre-write time in consumed everytime,and how to avoid this timings.
I am attaching the code for more description of the program.
public static void main(String[] args) {
SparkConf conf = new
// SparkConf().setAppName("Transformation").setMaster("local");
SparkConf().setAppName("Transformation").setMaster("spark://xx.xx.xx.xx:7077");
String connection = "jdbc:oracle:thin:ABC/abc#//xx.x.x.x:1521/ABC";
// Create Spark Context
SparkContext context = new SparkContext(conf);
// Create Spark Session
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", CI_TXN_DETAIL_STG).load();
//Dataset<Row> txnDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_TXN_DETAIL_STG").load();
Dataset<Row> newTxnDf = txnDf.drop(ACCT_ID);
Dataset<Row> accountDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", CI_ACCT_NBR).load();
// Dataset<Row> accountDf = sparkSession.read().format("jdbc").option("url", connection).option("dbtable", "CI_ACCT_NBR").load();
Dataset<Row> joined = newTxnDf.join(accountDf, newTxnDf.col(ACCT_NBR).equalTo(accountDf.col(ACCT_NBR))
.and(newTxnDf.col(ACCT_NBR_TYPE_CD).equalTo(accountDf.col(ACCT_NBR_TYPE_CD))), "inner");
Dataset<Row> finalJoined = joined.drop(accountDf.col(ACCT_NBR_TYPE_CD)).drop(accountDf.col(ACCT_NBR))
.drop(accountDf.col(VERSION)).drop(accountDf.col(PRIM_SW));
initializeProductDerivationCache(sparkSession,connection);
ClassTag<List<String>> evidenceForDivision = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForDiv = context.broadcast(divisionList, evidenceForDivision);
ClassTag<List<String>> evidenceForCurrency = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForCurrency = context.broadcast(currencySet, evidenceForCurrency);
ClassTag<List<String>> evidenceForUserID = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForUserID = context.broadcast(userIdList, evidenceForUserID);
Encoder<RuleParamsBean> encoder = Encoders.bean(RuleParamsBean.class);
Dataset<RuleParamsBean> ds = new Dataset<RuleParamsBean>(sparkSession, finalJoined.logicalPlan(), encoder);
Dataset<RuleParamsBean> validateDataset = ds.map(ruleParamsBean -> validateTransaction(ruleParamsBean,broadcastVarForDiv.value(),broadcastVarForCurrency.value(),
broadcastVarForUserID.value()),encoder);
Dataset<RuleParamsBean> filteredDS = validateDataset.filter(validateDataset.col(BO_STATUS_CD).notEqual(TFMAppConstants.TXN_INVALID));
//For formatting the data to be inserted in table --> Dataset<Row>finalvalidateDataset = validateDataset.select("ACCT_ID");
Encoder<TxnDetailOutput>txndetailencoder = Encoders.bean(TxnDetailOutput.class);
Dataset<TxnDetailOutput>txndetailDS =validateDataset.map(ruleParamsBean ->outputfortxndetail(ruleParamsBean),txndetailencoder );
KieServices ks = KieServices.Factory.get();
KieContainer kContainer = ks.getKieClasspathContainer();
ClassTag<KieBase> classTagTest = scala.reflect.ClassTag$.MODULE$.apply(KieBase.class);
Broadcast<KieBase> broadcastRules = context.broadcast(kContainer.getKieBase(KIE_BASE), classTagTest);
Encoder<PritmRuleOutput> outputEncoder = Encoders.bean(PritmRuleOutput.class);
Dataset<PritmRuleOutput> outputDataSet = filteredDS.flatMap(rulesParamBean -> droolprocesMap(broadcastRules.value(), rulesParamBean), outputEncoder);
Dataset<Row>piParamDS1 =outputDataSet.select(PRICEITEM_PARM_GRP_VAL);
Dataset<Row> piParamDS = piParamDS1.withColumnRenamed(PRICEITEM_PARM_GRP_VAL, PARM_STR);
priceItemParamGrpValueCache.createOrReplaceTempView("temp1");
Dataset<Row>piParamDSS = piParamDS.where(queryToFiltertheDuplicateParamVal);
Dataset<Row> priceItemParamsGrpDS = piParamDSS.select(PARM_STR).distinct().withColumn(PRICEITEM_PARM_GRP_ID, functions.monotonically_increasing_id());
Dataset<Row> finalpriceItemParamsGrpDS = priceItemParamsGrpDS.withColumn(PARM_COUNT, functions.size(functions.split(priceItemParamsGrpDS.col(PARM_STR),TOKENIZER)));
finalpriceItemParamsGrpDS.persist(StorageLevel.MEMORY_ONLY());
finalpriceItemParamsGrpDS.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP_K).option("batchSize", "1000").save();
Dataset<Row> PritmOutput = outputDataSet.join(priceItemParamsGrpDS,outputDataSet.col(PRICEITEM_PARM_GRP_VAL).equalTo(priceItemParamsGrpDS.col(PARM_STR)),"inner");
Dataset<Row> samplePritmOutput = PritmOutput.drop(outputDataSet.col(PRICEITEM_PARM_GRP_ID))
.drop(priceItemParamsGrpDS.col(PARM_STR));
priceItemParamsGrpDS.createOrReplaceTempView(PARM_STR);
Dataset<Row> priceItemParamsGroupTable =sparkSession.sql(FETCH_QUERY_TO_SPLIT);
Dataset<Row> finalpriceItemParamsGroupTable = priceItemParamsGroupTable.selectExpr("PRICEITEM_PARM_GRP_ID","split(col, '=')[0] as PRICEITEM_PARM_CD ","split(col, '=')[1] as PRICEITEM_PARM_VAL");
finalpriceItemParamsGroupTable.persist(StorageLevel.MEMORY_ONLY());
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP).option("batchSize", "1000").save();
}
It reloads the whole data and joins data frames again and again in every write to db action.
Please add validateDataset.persist(StorageLevel.MEMORY_ONLY()) - (you should consider on mem or on disk or on mem_and_disk on your own depends on your data frame size . Is it fit in mem or not )
For example:
Dataset<RuleParamsBean> validateDataset = ds.map(ruleParamsBean -> validateTransaction(ruleParamsBean,broadcastVarForDiv.value(),broadcastVarForCurrency.value(),broadcastVarForUserID.value()),encoder)
.persist(StorageLevel.MEMORY_ONLY());

bigquery Repeated record added outside of an array

Error:
Exception in thread "main" java.lang.RuntimeException
{"errors":[{"debugInfo":"generic::failed_precondition: Repeated record added outside of an array.","reason":"invalid"}],"index":0}
Language: Scala
Gradle bigquery dependency:
compile "com.google.apis:google-api-services-bigquery:v2-rev326-1.22.0"
The code to generate table schema:
import scala.collection.JavaConversions._
val orderTableSchema = new TableSchema()
orderTableSchema.setFields(Seq(
new TableFieldSchema().setName("revisionId").setType("STRING").setMode("REQUIRED"),
new TableFieldSchema().setName("executions").setType("RECORD").setMode("REPEATED").setFields(Seq(
new TableFieldSchema().setName("ts").setType("INTEGER"),
new TableFieldSchema().setName("price").setType("FLOAT"),
new TableFieldSchema().setName("qty").setType("INTEGER")
))
))
The table is created successfully with correct schema for executions column as shown in BigQuery web ui:
revisionId STRING REQUIRED
executions RECORD REPEATED
executions RECORD REPEATED
executions.ts INTEGER NULLABLE
executions.price FLOAT NULLABLE
executions.qty INTEGER NULLABLE
The code to insert data which fails:
import scala.collection.JavaConversions._
val row = new TableRow()
.set("revisionId", "revision1")
.set("executions", Seq(
new TableRow().set("ts", 1L).set("price", 100.01).set("qty", 1000)
))
val content = new TableDataInsertAllRequest().setRows(Seq(
new TableDataInsertAllRequest.Rows().setJson(row)
))
val insertAll = bigQuery.tabledata().insertAll(projectId, datasetId, "order", content)
val insertResponse = insertAll.execute()
if (insertResponse.getInsertErrors != null) {
insertResponse.getInsertErrors.foreach(println)
// this prints:
// {"errors":[{"debugInfo":"generic::failed_precondition: Repeated record added outside of an array.","reason":"invalid"}],"index":0}
// throw to just terminate early for demo
throw new RuntimeException()
}
Found the problem.
It's something with scala-to-java collections conversions mess. I had to explicitly add the conversion using JavaConversions.seqAsJavaList and it magically started working
val row = new TableRow()
.set("revisionId", s"revisionId$v")
.set("executions", JavaConversions.seqAsJavaList(Seq(
new TableRow().set("ts", 1L).set("price", 100.01).set("qty", 1000),
new TableRow().set("ts", 2L).set("price", 100.02).set("qty", 2000),
new TableRow().set("ts", 3L).set("price", 100.03).set("qty", 3000)
)))

Processing multiple files as independent RDD's in parallel

I have a scenario where a certain number of operations including a group by has to be applied on a number of small (~300MB each) files. The operation looks like this..
df.groupBy(....).agg(....)
Now to process it on multiple files, I can use a wildcard "/**/*.csv" however, that creates a single RDD and partitions it to for the operations. However, looking at the operations, it is a group by and involves lot of shuffle which is unnecessary if the files are mutually exclusive.
What, I am looking at is, a way where i can create independent RDD's on files and operate on them independently.
It is more an idea than a full solution and I haven't tested it yet.
You can start with extracting your data processing pipeline into a function.
def pipeline(f: String, n: Int) = {
sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load(f)
.repartition(n)
.groupBy(...)
.agg(...)
.cache // Cache so we can force computation later
}
If your files are small you can adjust n parameter to use as small number of partitions as possible to fit data from a single file and avoid shuffling. It means you are limiting concurrency but we'll get back to this issue later.
val n: Int = ???
Next you have to obtain a list of input files. This step depends on a data source but most of the time it is more or less straightforward:
val files: Array[String] = ???
Next you can map above list using pipeline function:
val rdds = files.map(f => pipeline(f, n))
Since we limit concurrency at the level of the single file we want to compensate by submitting multiple jobs. Lets add a simple helper which forces evaluation and wraps it with Future
import scala.concurrent._
import ExecutionContext.Implicits.global
def pipelineToFuture(df: org.apache.spark.sql.DataFrame) = future {
df.rdd.foreach(_ => ()) // Force computation
df
}
Finally we can use above helper on the rdds:
val result = Future.sequence(
rdds.map(rdd => pipelineToFuture(rdd)).toList
)
Depending on your requirements you can add onComplete callbacks or use reactive streams to collect the results.
If you have many files, and each file is small (you say 300MB above which I would count as small for Spark), you could try using SparkContext.wholeTextFiles which will create an RDD where each record is an entire file.
By this way we can write multiple RDD parallely
public class ParallelWriteSevice implements IApplicationEventListener {
private static final IprogramLogger logger = programLoggerFactory.getLogger(ParallelWriteSevice.class);
private static ExecutorService executorService=null;
private static List<Future<Boolean>> futures=new ArrayList<Future<Boolean>>();
public static void submit(Callable callable) {
if(executorService==null)
{
executorService=Executors.newFixedThreadPool(15);//Based on target tables increase this
}
futures.add(executorService.submit(callable));
}
public static boolean isWriteSucess() {
boolean writeFailureOccured = false;
try {
for (Future<Boolean> future : futures) {
try {
Boolean writeStatus = future.get();
if (writeStatus == false) {
writeFailureOccured = true;
}
} catch (Exception e) {
logger.error("Erorr - Scdeduled write failed " + e.getMessage(), e);
writeFailureOccured = true;
}
}
} finally {
resetFutures();
if (executorService != null)
executorService.shutdown();
executorService = null;
}
return !writeFailureOccured;
}
private static void resetFutures() {
logger.error("resetFutures called");
//futures.clear();
}
}