I want to generate the 'csv' files as per below logic for the table in cassandra.
val df = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
I want to generate the 'csv' files as per below logic.
Since there are 3 distinct 'emailid's' I need to generate 3 distinct 'csv' files.
Three csv files for below 3 different queries.
select * from table where emailId='abc#gmail.com'
select * from table where emailId='def#gmail.com'
select * from table where emailId='xyz#gmail.com'
How can I do this. Can anyone please help me on this.
Version:
Spark 1.6.2
Scala 2.10
Create a distinct list of the emails then iterate over them. When iterating, filter for only the emails that match and save the dataframe to Cassandra.
import sql.implicits._
val emailData = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
val distinctEmails = emailData.select("emailId").distinct().as[String].collect
for (email <- distinctEmails){
val subsetEmailsDF = emailData.filter($"emailId" === email).coalesce(1)
//... Save the subset dataframe to cassandra
}
Note: coalesce(1) sends all the data to one node. This can create memory issues if the dataframe is too large.
Related
We have an AWS S3 bucket with millions of documents in a complex hierarchy, and a CSV file with (among other data) links to a subset of those files, I estimate this file will be about 1000 to 10.000 rows. I need to join the data from the CSV file with the contents of the documents for further processing in Spark. In case it matters, we're using Scala and Spark 2.4.4 on an Amazon EMR 6.0.0 cluster.
I can think of two ways to do this. First is to add a transformation on the CSV DataFrame that adds the content as a new column:
val df = spark.read.format("csv").load("<csv file>")
val attempt1 = df.withColumn("raw_content", spark.sparkContext.textFile($"document_url"))
or variations thereof (for example, wrapping it in a udf) don't seem to work, I think because sparkContext.textFile returns an RDD, so I'm not sure it's even supposed to work this way? Even if I get it working, is the best way to keep it performant in Spark?
An alternative I tried to think of is to use spark.sparkContext.wholeTextFiles upfront and then join the two dataframes together:
val df = spark.read.format("csv").load("<csv file>")
val contents = spark.sparkContext.wholeTextFiles("<s3 bucket>").toDF("document_url", "raw_content");
val attempt2 = df.join(contents, df("document_url") === contents("document_url"), "left")
but wholeTextFiles doesn't go into subdirectories and the needed paths are hard to predict, and I'm also unsure of the performance impact of trying to build an RDD of the entire bucket of millions of files if I only need a small fraction of it, since the S3 API probably doesn't make it very fast to list all the objects in the bucket.
Any ideas? Thanks!
I did figure out a solution in the end:
val df = spark.read.format("csv").load("<csv file>")
val allS3Links = df.map(row => row.getAs[String]("document_url")).collect()
val joined = allS3Links.mkString(",")
val contentsDF = spark.sparkContext.wholeTextFiles(joined).toDF("document_url", "raw_content");
The downside to this solution is that it pulls all the urls to the driver, but it's workable in my case (100,000 * ~100 char length strings is not that much) and maybe even unavoidable.
I need to write Cassandra Partitions as parquet file. Since I cannot share and use sparkSession in foreach function. Firstly, I call collect method to collect all data in driver program then I write parquet file to HDFS, as below.
Thanks to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
I am able to get my partitioned rows. I want to write partitioned rows into seperated parquet file, whenever a partition is read from cassandra table. I also tried sparkSQLContext that method writes task results as temporary. I think, after all the tasks are done. I will see parquet files.
Is there any convenient method for this?
val keyedTable : CassandraTableScanRDD[(Tuple2[Int, Date], MyCassandraTable)] = getTableAsKeyed()
keyedTable.groupByKey
.collect
.foreach(f => {
import sparkSession.implicits._
val items = f._2.toList
val key = f._1
val baseHDFS = "hdfs://mycluster/parquet_test/"
val ds = sparkSession.sqlContext.createDataset(items)
ds.write
.option("compression", "gzip")
.parquet(baseHDFS + key._1 + "/" + key._2)
})
Why not use Spark SQL everywhere & use built-in functionality of the Parquet to write data by partitions, instead of creating a directory hierarchy yourself?
Something like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load()
data.write
.option("compression", "gzip")
.partitionBy("col1", "col2")
.parquet(baseHDFS)
In this case, it will create a separate directory for every value of col & col2 as nested directories, with name like this: ${column}=${value}. Then when you read, you may force to read only specific value.
I am new to spark. I have some json data that comes as an HttpResponse. I'll need to store this data in hive tables. Every HttpGet request returns a json which will be a single row in the table. Due to this, I am having to write single rows as files in the hive table directory.
But I feel having too many small files will reduce the speed and efficiency. So is there a way I can recursively add new rows to the Dataframe and write it to the hive table directory all at once. I feel this will also reduce the runtime of my spark code.
Example:
for(i <- 1 to 10){
newDF = hiveContext.read.json("path")
df = df.union(newDF)
}
df.write()
I understand that the dataframes are immutable. Is there a way to achieve this?
Any help would be appreciated. Thank you.
You are mostly on the right track, what you want to do is to obtain multiple single records as a Seq[DataFrame], and then reduce the Seq[DataFrame] to a single DataFrame by unioning them.
Going from the code you provided:
val BatchSize = 100
val HiveTableName = "table"
(0 until BatchSize).
map(_ => hiveContext.read.json("path")).
reduce(_ union _).
write.insertInto(HiveTableName)
Alternatively, if you want to perform the HTTP requests as you go, we can do that too. Let's assume you have a function that does the HTTP request and converts it into a DataFrame:
def obtainRecord(...): DataFrame = ???
You can do something along the lines of:
val HiveTableName = "table"
val OtherHiveTableName = "other_table"
val jsonArray = ???
val batched: DataFrame =
jsonArray.
map { parameter =>
obtainRecord(parameter)
}.
reduce(_ union _)
batched.write.insertInto(HiveTableName)
batched.select($"...").write.insertInto(OtherHiveTableName)
You are clearly misusing Spark. Apache Spark is analytical system, not a database API. There is no benefit of using Spark to modify Hive database like this. It will only bring a severe performance penalty without benefiting from any of the Spark features, including distributed processing.
Instead you should use Hive client directly to perform transactional operations.
If you can batch-download all of the data (for example with a script using curl or some other program) and store it in a file first (or many files, spark can load an entire directory at once) you can then load that file(or files) all at once into spark to do your processing. I would also check to see it the webapi as any endpoints to fetch all the data you need instead of just one record at a time.
We have customer data in a Hive table and sales data in another Hive table, which has data in TB's. We are trying to pull the sales data for multiple customers and save it to a file.
What we tried so far:
We tired with left outer join between customer and sales tables, but because of the huge sales data it is not working.
val data = customer.join(sales,"customer.id" = "sales.customerID",leftouter)
So the alternative is to pull the data form sales table based on specific customer region list and see if this region data has the customer data, if data exist save it in other dataframe and load the data to the same dataframe for all the regions.
My question here is, whether the multiple insert of data for the dataframe is supported in spark.
If the sales dataframe is larger than the customer dataframe then you could simply switch the order of the dataframes in the join operation.
val data = sales.join(customer,"customer.id" === "sales.customerID", "left_outer")
You could also add a hint for Spark to broadcast the smaller dataframe, though I belive it needs to be smaller than 2GB:
import org.apache.spark.sql.functions.broadcast
val data = sales.join(broadcast(customer),"customer.id" === "sales.customerID", "leftouter")
To use the other approach and iterativly merge dataframes is also possible. For this purpose you can use the union method (Spark 2.0+) or unionAll (older versions). This method will append a dataframe to another. In the case where you have a list of dataframes that you want to merge with each other you can use union together with reduce:
val dataframes = Seq(df1, df2, df3)
dataframes.reduce(_ union _)
I using hive through Spark. I have a Insert into partitioned table query in my spark code. The input data is in 200+gb. When Spark is writing to a partitioned table, it is spitting very small files(files in kb's). so now the output partitioned table folder have 5000+ small kb files. I want to merge these in to few large MB files, may be about few 200mb files. I tired using hive merge settings, but they don't seem to work.
'val result7A = hiveContext.sql("set hive.exec.dynamic.partition=true")
val result7B = hiveContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val result7C = hiveContext.sql("SET hive.merge.size.per.task=256000000")
val result7D = hiveContext.sql("SET hive.merge.mapfiles=true")
val result7E = hiveContext.sql("SET hive.merge.mapredfiles=true")
val result7F = hiveContext.sql("SET hive.merge.sparkfiles = true")
val result7G = hiveContext.sql("set hive.aux.jars.path=c:\\Applications\\json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar")
val result8 = hiveContext.sql("INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table")'
The above hive settings work in a mapreduce hive execution and spits out files of specified size. Is there any option to do this Spark or Scala?
I had the same issue. Solution was to add DISTRIBUTE BY clause with the partition columns. This ensures that data for one partition goes to single reducer. Example in your case:
INSERT INTO TABLE partition_table PARTITION (date) select a,b,c from partition_json_table DISTRIBUTE BY date
You may want to try using the DataFrame.coalesce method; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion). So using the number of records you are inserting and the typical size of each record, you can estimate how many partitions to coalesce to if you want files of ~200MB.
The dataframe repartition(1) method works in this case.