I am new to scala and HDFS. I need to dump my data into HDFS. The data is in the form of a spark dataframe but I want to write it as a CSV in HDFS.
Can someone please share the basic boilder plate code for starters.
Thanks
If your data is flat then the following would work.
val df: DataFrame = ???
val filePath: String = ???
df.map(_.mkString(",")).saveAsTextFile(filePath)
However the issue with the question is that we need to see what your data looks like. For example if its got nested Structs then saving as a CSV isn't clearly defined.
Related
I have to compare a DF with another one that is the same schema readed from a specific path, but maybe in that path there are not files so I've thought that I have to compare it with a null DF with the same columns as the original.
So I am trying to create a DF with the schema from another DF that contains a lot of columns but I can't find a solution for this. I have been reading the following posts but no one helps me:
How to create an empty DataFrame with a specified schema?
How to create an empty DataFrame? Why "ValueError: RDD is empty"?
How to create an empty dataFrame in Spark
How can I do it in scala? Or is better take other option?
originalDF.limit(0) will return an empty dataframe with the same schema.
I need to write Cassandra Partitions as parquet file. Since I cannot share and use sparkSession in foreach function. Firstly, I call collect method to collect all data in driver program then I write parquet file to HDFS, as below.
Thanks to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
I am able to get my partitioned rows. I want to write partitioned rows into seperated parquet file, whenever a partition is read from cassandra table. I also tried sparkSQLContext that method writes task results as temporary. I think, after all the tasks are done. I will see parquet files.
Is there any convenient method for this?
val keyedTable : CassandraTableScanRDD[(Tuple2[Int, Date], MyCassandraTable)] = getTableAsKeyed()
keyedTable.groupByKey
.collect
.foreach(f => {
import sparkSession.implicits._
val items = f._2.toList
val key = f._1
val baseHDFS = "hdfs://mycluster/parquet_test/"
val ds = sparkSession.sqlContext.createDataset(items)
ds.write
.option("compression", "gzip")
.parquet(baseHDFS + key._1 + "/" + key._2)
})
Why not use Spark SQL everywhere & use built-in functionality of the Parquet to write data by partitions, instead of creating a directory hierarchy yourself?
Something like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load()
data.write
.option("compression", "gzip")
.partitionBy("col1", "col2")
.parquet(baseHDFS)
In this case, it will create a separate directory for every value of col & col2 as nested directories, with name like this: ${column}=${value}. Then when you read, you may force to read only specific value.
I am using Spark to read multiple parquet files into a single RDD, using standard wildcard path conventions. In other words, I'm doing something like this:
val myRdd = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet")
However, sometimes these Parquet files will have different schemas. When I'm doing my transforms on the RDD, I can try and differentiate between them in the map functions, by looking for the existence (or absence) of certain columns. However a surefire way to know which schema a given row in the RDD uses - and the way I'm asking about specifically here - is to know which file path I'm looking at.
Is there any way, on an RDD level, to tell which specific parquet file the current row came from? So imagine my code looks something like this, currently (this is a simplified example):
val mapFunction = new MapFunction[Row, (String, Row)] {
override def call(row: Row): (String, Row) = myJob.transform(row)
}
val pairRdd = myRdd.map(mapFunction, encoder=kryo[(String, Row)]
Within the myJob.transform( ) code, I'm decorating the result with other values, converting it to a pair RDD, and do some other transforms as well.
I make use of the row.getAs( ... ) method to look up particular column values, and that's a really useful method. I'm wondering if there are any similar methods (e.g. row.getInputFile( ) or something like that) to get the name of the specific file that I'm currently operating on?
Since I'm passing in wildcards to read multiple parquet files into a single RDD, I don't have any insight into which file I'm operating on. If nothing else, I'd love a way to decorate the RDD rows with the input file name. Is this possible?
You can add a new column for the file name as shown below
import org.apache.spark.sql.functions._
val myDF = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet").withColumn("inputFile", input_file_name())
is it possible to convert RDD[CassandraRow] to RDD[String] ? if so , is there any disadvantage of working against the converted RDD ?
You can use sqlContext to read data from Cassandra table, it returns an DataFrame, and when you read text file using sparkContext it returns RDD and then you can convert that to DataFrame.
If your text files are CSV, Spark 2.0 Supports csv data source, it returns an DataFrame by deafult. Please see this.. https://spark.apache.org/releases/spark-release-2-0-0.html#new-features and https://github.com/databricks/spark-csv/issues/
Update:
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
I want to use datasets instead of dataframes.
I'm reading a parquet file and want to infer the types directly:
val df: Dataset[Row] = spark.read.parquet(path)
I don't want Dataset[Row] but a Dataset.
I know I can do something like:
val df= spark.read.parquet(path).as[myCaseClass]
but, my data has many columns! so, if I can avoid writing a case class it would be great!
Why do you want to work with a Dataset? I think it's because you will have not only the schema for free (which you have with the result DataFrame anyway) but because you will have a type-safe schema.
You need to have an Encoder for your dataset and to have it you need a type that would represent your dataset and hence the schema.
Either you select your columns to a reasonable number and use as[MyCaseClass] or you should accept what DataFrame offers.