Scala - Writing dataframe to a file as binary - scala

I have a hive table of type parquet, with column Content storing various documents as base64 encoded.
Now, I need to read that column and write into a file in HDFS, so that the base64 column will be converted back to a document for each row.
val profileDF = sqlContext.read.parquet("/hdfspath/profiles/");
profileDF.registerTempTable("profiles")
val contentsDF = sqlContext.sql(" select unbase64(contents) as contents from profiles where file_name'file1'")
Now that contentDF is storing the binary format of a document as a row, which I need to write to a file. Tried different options but couldn't get back the dataframe content to a file.
Appreciate any help regarding this.

I would suggest save as parquet:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrameWriter.html#parquet(java.lang.String)
Or convert to RDD and do save as object:
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/rdd/RDD.html#saveAsObjectFile(java.lang.String)

Related

Save pyspark dataframe entires as separate html files on s3

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

How to specify schema for the folder structure when reading parquet file into a dataframe [duplicate]

This question already has an answer here:
Reading partition columns without partition column names
(1 answer)
Closed 2 years ago.
I have to read parquet files that are stored in the following folder structure
/yyyy/mm/dd/ (eg: 2021/01/31)
If I read the files like this, it works:
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Unfortunately, the folder structure is not stored in the typical partitioned format /yyyy=2021/mm=01/dd=31/ and I don't have the luxury of converting it to that format.
I was wondering if there is a way I can provide Spark a hint as to the folder structure so that it would make "2021/01/31" available as yyyy, mm, dd in my dataframe.
I have another set of files, which are stored in the /yyyy=aaaa/mm=bb/dd=cc format and the following code works:
partitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
Things I have tried
I have specified the schema, but it just returned nulls
customSchema = StructType([
StructField("yyyy",LongType(),True),
StructField("mm",LongType(),True),
StructField("dd",LongType(),True),
StructField("id",LongType(),True),
StructField("a",LongType(),True),
StructField("b",LongType(),True),
StructField("c",TimestampType(),True)])
partitionDF = spark.read.option("mergeSchema", "true").schema(customSchema).parquet("abfss://xxx#abc.dfs.core.windows.net/Address/")
display(partitionDF)
the above returns no data!. If I change the path to: "abfss://xxx#abc.dfs.core.windows.net/Address////.parquet", then I get data, but yyyy,mm,dd columns are empty.
Another option would be to load the folder path as a column, but I cant seem to find a way to do that.
TIA
Databricks N00B!
I suggest you load the data without the partitioned folders as you mentioned
unPartitionedDF = spark.read.option("mergeSchema", "true").parquet("abfss://xxx#abc.dfs.core.windows.net/Address/*/*/*/*.parquet")
Then add a column with the input_file_name function value in:
import pyspark.sql.functions as F
unPartitionedDF = unPartitionedDF.withColumn('file_path', F.input_file_name())
Then you could split the values of the new file_path column into three separate columns.
df = unPartitionedDF.withColumn('year', F.split(df['file_path'], '/').getItem(3)) \
.withColumn('month', F.split(df['file_path'], '/').getItem(4)) \
.withColumn('day', F.split(df['file_path'], '/').getItem(5))
The input value of getItem function is based on the exact folder structure you have.
I hope it could resolve your proble.

how to assign column names available in csv file as header to orc file

I have column names in one .csv file and want to assign these as column headers to Data Frame in scala. Since it is generic script, I don't want to hard code in the script rather pass the values from csv file.
You can do it:
val columns = spark.read.option("header","true").csv("path_to_csv").schema.fieldNames
val df: DataFrame = ???
df.toDF(columns:_*).write.format("orc").save("your_orc_dir")
in pyspark:
columns = spark.read.option("header","true").csv("path_to_csv").columns
df.toDF(columns).write.format("orc").save("your_orc_dir")
but store data schema separately from data is bad idea

issue insert data in hive create small part files

i am processing more than 1000000 records of json file i am reading file line by line and extract requried key values
(json are mix structure is not fix. so i am parsing and generate requried json element) and generate json string simillar to json_string variable and push to hive table data are store properly but at hadoop apps/hive/warehouse/jsondb.myjson_table folder contain small part files. every insert query the new (.1 to .20 kb)part file will be created. beacuse of that if i run simple query on hive as it will take more than 30 min. showing sample code of my logic this iterate multipal times for new records to inesrt in hive.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("SparkSessionZipsExample").enableHiveSupport().getOrCreate()
var json_string = """{"name":"yogesh_wagh","education":"phd" }"""
val df = spark.read.json(Seq(json_string).toDS)
//df.write.format("orc").saveAsTable("bds_data1.newversion");
df.write.mode("append").format("orc").insertInto("bds_data1.newversion");
i have also try to add hive property to merge the files but it wont work,
i have also try to create table from existing table for combine small part file to one 256 mb files..
please share sample code to insert multipal records and append record in part file.
I think each of those individual inserts creating a new part file.
You could create dataset/dataframe of these json strings and then save it to hive table.
you could merge the existing small file using hive ddl ALTER TABLE table_name CONCATENATE;

Generate a CSV file with combination of unique field and XML using spark dataframe

I am reading an XML into a spark Dataframe using com.databricks.spark.xml and trying to generate a csv file as output.
My Input is like below
<id>1234</id>
<dtl>
<name>harish</name>
<age>21</age>
<class>II</class>
</dtl>
My output should be a csv file with the combination of id and remaining whole XML tag like
id, xml
1234,<dtl><name>harish</name><age>21</age><class>II</class></dtl>
Is there a way to achieve the output in the above format.
your help is very much appreciated.
Create a plain RDD to load xml as text file using sc.textFile() without parsing.
Extract id manually with the help of regex/xpath and also try to slice RDD string using string slicing from start of your tag to end of your tag.
Once it's done you will have your data into map like (id,"xml").
I hope this tactical solution will help you...