Does anyone know a way to create a large RDD from a sequence of RDD's in a DStream for a specific batch interval:
For example in the code below:
def createLargeRDD() {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
val DStream = KafkaUtilHelper.RetrieveDStream(ssc)
DStream.transform { rdd =>
/* Form an RDD with all of the RDD's that were put into the
DStream variable above for the 3 seconds batch interval */
rdd
}
}
So RDD's are being added to that DStream variable every 3 seconds. Is there a way I can aggregate all of those RDD's that are in the DStream for that 3 second time period into one large RDD and save that RDD to HBase or some external source.
Related
Environment
Scala
Apache Spark: Spark 2.2.1
EMR on AWS: emr-5.12.1
Content
I have one large DataFrame, like below:
val df = spark.read.option("basePath", "s3://some_bucket/").json("s3://some_bucket/group_id=*/")
There are JSON files ~1TB at s3://some_bucket and it includes 5000 partitions of group_id.
I want to execute conversion using SparkSQL, and it differs by each group_id.
The Spark code is like below:
// Create view
val df = spark.read.option("basePath", "s3://data_lake/").json("s3://data_lake/group_id=*/")
df.createOrReplaceTempView("lakeView")
// one of queries like this:
// SELECT
// col1 as userId,
// col2 as userName,
// .....
// FROM
// lakeView
// WHERE
// group_id = xxx;
val queries: Seq[String] = getGroupIdMapping
// ** Want to know better ways **
queries.par.foreach(query => {
val convertedDF: DataFrame = spark.sql(query)
convertedDF.write.save("s3://another_bucket/")
})
The par can parallelize by Runtime.getRuntime.availableProcessors num, and it will be equal to the number of driver's cores.
But It seems weird and not efficient enough because it has nothing to do with Spark's parallization.
I really want to do with something like groupBy in scala.collection.Seq.
This is not right spark code:
df.groupBy(groupId).foreach((groupId, parDF) => {
parDF.createOrReplaceTempView("lakeView")
val convertedDF: DataFrame = spark.sql(queryByGroupId)
convertedDF.write.save("s3://another_bucket")
})
1) First of all if your data is already stored in files per group id there is no reason to mix it up and then group by id using Spark.
It's much more simple and efficient to load for each group id only relevant files
2) Spark itself parallelizes the computation. So in most cases there is no need for external parallelization.
But if you feel that Spark doesn't utilize all resources you can:
a) if each individual computation takes less than few seconds then task schedulling overhead is comparable to task execution time so it's possible to get a boost by running few tasks in parallel.
b) computation takes significant amount of time but resources are still underutilized. Then most probably you should increase the number of partitions for your dataset.
3) If you finally decided to run several tasks in parallel it can be achieved this way:
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit
I have a .csv file that I read in to a RDD:
val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))
I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.
You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:
val sc = new SparkContext(conf)
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)
How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?
I am using Spark 1.6.0.
You can just map your Array to the desired index, e.g. :
dataH.map(arr => arr(0)).toDF("col1")
Or safer (avoids Exception in case the index is out of bound):
dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1")
I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
It is possible.
In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2), so you can later parse it.
Reading can be done with textFile function from SparkContext and then .map to eliminate ()
So:
Version 1:
rdd.saveAsTextFile ("hdfs:///test1/");
// later, in other program
val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x => {
// here remove () and parse long / strings
})
Version 2:
rdd.saveAsObjectFile ("hdfs:///test1/");
// later, in other program - watch, you have tuples out of the box :)
val newRdds = sparkContext.sc.sequenceFile("hdfs:///test1/part-*", classOf[Long], classOf[String])
I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized.
However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,
val conf = {
new SparkConf()
.setAppName("Spark-HDFS-Read-Write")
}
val sqlContext = new SQLContext(sc)
val sc = new SparkContext(conf)
val hdfs = "hdfs:///"
val df = Seq((1, "Name1")).toDF("id", "name")
// Writing file in CSV format
df.write.format("com.databricks.spark.csv").mode("overwrite").save(hdfs + "user/hdfs/employee/details.csv")
// Writing file in PARQUET format
df.write.format("parquet").mode("overwrite").save(hdfs + "user/hdfs/employee/details")
// Reading CSV files from HDFS
val dfIncsv = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").load(hdfs + "user/hdfs/employee/details.csv")
// Reading PQRQUET files from HDFS
val dfInParquet = sqlContext.read.parquet(hdfs + "user/hdfs/employee/details")
I am trying to partition a input files based on accountId But this partition has be done only if dataFrames contains more than 1000 records. The accountId is a dynamic integer that could not be uknown. Consider the following code below
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.textFileStream("input")
lines.print()
lines.foreachRDD { rdd =>
val count = rdd.count()
if (count > 0) {
val df = sqlContext.read.json(rdd)
val filteredDF = df.filter(df("accountId")==="3")
if (filteredDF.count() > 1000) {
df.write.partitionBy("accountId").format("json").save("output")
}
}
}
ssc.start()
ssc.awaitTermination()
But the above code partitions all accountId which is not needed.
I want to find the count for each accountId in the dataframe.
If records for each accountId exceeds 1000, then write the partitioned information into output source.
For example, If the input file has 1500 records for accountId=1 and 10 records for accountId=2, then partition filtered dataframe based on accountId=1 into output source and keep accountId=2 records in memmory.
How to achieve this using spark-streaming?
Should'd you be doing
filteredDF.write.partitionBy("accountId").format("json").save("output")
instead of
df.write.partitionBy("accountId").format("json").save("output")
Below is my usecase i am using Apache Spark
1) I have around 2500 Parquet files on HDFS, file size varies from file to file.
2) I need to process each parquet files and build a new DataFrame and write a new DataFrame into orc file format.
3) My Spark driver program is like this.
I am iterating each file, processing single parquet file creating a new DataFrame and writing a new DataFrame as ORC, below is the code snippet.
val fs = FileSystem.get(new Configuration())
val parquetDFMap = fs.listStatus(new Path(inputFilePath)).map(folder => {
(folder.getPath.toString, sqlContext.read.parquet(folder.getPath.toString))})
parquetDFMap.foreach {
dfMap =>
val parquetFileName = dfMap._1
val parqFileDataFrame = dfMap._2
for (column <- parqFileDataFrame.columns)
{
val rows = parqFileDataFrame.select(column)
.mapPartitions(lines => lines.filter(filterRowsWithNullValues(_))
.map(row => buildRowRecords(row, masterStructArr.toArray, valuesArr)))
val newDataFrame: DataFrame = parqFileDataFrame.sqlContext.createDataFrame(rows, StructType(masterStructArr))
newDataFrame.write.mode(SaveMode.Append).format("orc").save(orcOutPutFilePath+tableName)
}
}
The problem with this design I am able to process only one parquet file in time, parallelism is applied only when I create a new data frame and when the new DataFrame is written into ORC format. So if any of the tasks like creating a new DataFrame or writing a new DataFrame in to ORC take long time to complete other lined up parquet processing is stuck until the current parquet operation gets completed.
Can you please help me with a better approach or design for this usecase.
Can you create a single data frame for all the parquet files instead of one dataframe for each file
val df = sqlContext.read.parquet(inputFilePath)
df.map(row => convertToORc(row))
I was able to parallelise the parquet file processing by paralleling the by doing parquetDFMap.foreach.par