Read pdf file in apache spark dataframes - scala

We can read avro file using the below code,
val df = spark.read.format("com.databricks.spark.avro").load(path)
is it possible to read pdf files using Spark dataframes?

You cannot read a pdf and store in a df as it will cannot interrupt the columns of the dataframe(basically it doens't have a standard schema), so if you want to get some data from a pdf first convert that to csv or parquet and then you can read from that file and then create a dataframe as it has a defined schema
visit this gitbook to understand more on what are the available read formats which you can use to get the data as a Dataframe
DataFrameReader — Loading Data From External Data Sources

Related

Read hdfs data to Spark DF without mentioning file type

Is there any approach to read hdfs data to spark df without explicitly mentioning file type.
spark.read.format("auto_detect").option("header", "true").load(inputPath)
We can achieve above requirement by using scala.sys.process_ or python subprocess(cmd). and splitting the extension of any part file. But without using any subprocess or sys.process, can we achieve this ..?

Pyspark & HDFS: Add new dataframe column to existing parquet files in hdfs

let me first start with my scenario:
I have a huge dataframe stored in HDFS. I load the dataframe in a spark session
and create a new column without changing any of the existing content. After this, I want to store the dataframe to the original directory in HDFS.
Now, I know, I can practically do with with spark's write operation in the fashion df.parquet.write("my_df_path", mode="overwrite"). Since the data is immense, I'm investigating whether there is a so to speak column-wise append-mode or method, that does not write the complete dataframe back only the difference to the stored data. The final target is to save both memory and computational effort for the HDFS system.

how to save output from cassandra table using spark

I want to save the output/rows read from cassandra table to a file in either csv or json format. Using, Spark 1.6.3:
scala>val results.sqlContext.sql("select * from myks.mytable")
scala>val.write.option("header","true").save("/tmp/xx.csv") -- writes to cfs:// filesystem
I am not able to find an option to write to the OS as csv or json format file.
Appreciate any help!
Use the spark Cassandra connector to read data from a Cassandra table into spark
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.

Write and append Spark streaming data to a text file in HDFS

I am creating a spark scala code in which I am reading a continuous stream from MQTT server.
I am running my job in yarn cluster mode. I want to save and append this stream in a single text file in HDFS.
I will be receiving stream of data after every 1 second. So I need this data to be appended in single text file in HDFS.
Can any one help.
Use data frame and use mode Append
This will append the data every time new record comes.
val sqlContext = new org.apache.spark.sql.SQLContext(context)
import sqlContext.implicits._
stream.map(_.value).foreachRDD(rdd => {
rdd.foreach(println)
if (!rdd.isEmpty()) {
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).save("C:/data/spark/")
// rdd.saveAsTextFile("C:/data/spark/")
}
})
#Amrutha J Raj
rdd.toDF("value").coalesce(1).write.mode(SaveMode.Append).json("C:/data/spark/")
This means, RDD will convert to DF and we have used coalesce(1) so it will have only one file if you wont use that then spark may generate multiple files so with this it will restrict to only one and our write mode is Append so it will be appending to the existing file and inn json format.