How can I read multiple parquet files in spark scala - scala

Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?
"id=200393/date=2019-03-25"
"id=200393/date=2019-03-26"
"id=200393/date=2019-03-27"
"id=200393/date=2019-03-28"
"id=200393/date=2019-03-29" and so on ...
Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)
Is there any better way than below ?
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")
spark.read.format("parquet").load(parquetFiles: _*)
The above code is working but I want to do something like below-
val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)

you can read it this way to read all folders in a directory id=200393:
val df = spark.read.parquet("id=200393/*")
If you want to select only some dates, for example only september 2019:
val df = spark.read.parquet("id=200393/2019-09-*")
If you have some special days, you can have the list of days in a list
val days = List("2019-09-02", "2019-09-03")
val paths = days.map(day => "id=200393/" ++ day)
val df = spark.read.parquet(paths:_*)

If you want to keep the column 'id', you could try this:
val df = sqlContext
.read
.option("basePath", "id=200393/")
.parquet("id=200393/date=*")

Related

Why difference when importing csv with spark

I have this csv file, payments.csv, for some particular rows the timestamp is changing by itself. the first 3 lines are the screenshots for easier understanding.
import spark.implicits._
import org.apache.spark.sql.functions.{col,when,to_date,row_number,date_add,expr}
import org.apache.spark.sql.expressions.{Window}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
//Importing the csv
val df = spark.read.option("header","true").option("inferSchema","true").csv("payment.csv")
val df2 = df.filter($"payment_id" === 21112).show()
val time_value = df2.collect(){0}{5}
println(time_value)
clueless about it as of now.
Screenshots:

Scala RDD get earliest date by group

I have a case class RDD in Scala and need to find the earliest date by each group (patientID).
Here is the input:
patientID date
000000047-01 2008-03-21T21:00:00Z
000000047-01 2007-10-24T19:45:00Z
000000485-01 2011-06-17T21:00:00Z
000000485-01 2006-02-22T18:45:00Z
The expected should be:
patientID date
000000047-01 2007-10-24T19:45:00Z
000000485-01 2006-02-22T18:45:00Z
I tried something like following but didn't work
val out = medication.groupBy(x => x.patientID).sortBy(x => x.date).take(1)
Okay!
So I understand your question correctly you want the top from every record, if that's the case then here I have created the solution.
val dataDF = Seq(
("000000047-01", "2008-03-21T21:00:00Z"),
("000000047-01" , "2007-10-24T19:45:00Z"),
("000000485-01", "2011-06-17T21:00:00Z"),
("000000485-01", "2006-02-22T18:45:00Z"))
import spark.implicits._
val dfWithSchema = dataDF.toDF("patientId", "date")
val winSpec = Window.partitionBy("patientId").orderBy("date")
val rank_df = dfWithSchema.withColumn("rank", rank().over(winSpec)).orderBy(col("patientId"))
val result = rank_df.select(col("patientId"),col("date")).where(col("rank") === 1)
result.show()
Please ignore the steps for creating the DF with the schema if you have already schema defined with your data.

How to use Spark-Scala to download a CSV file from the web?

world,
How to use Spark-Scala to download a CSV file from the web and load the file into a spark-csv DataFrame?
Currently I depend on curl in a shell command to get my CSV file.
Here is the syntax I want to enhance:
/* fb_csv.scala
This script should load FB prices from Yahoo.
Demo:
spark-shell -i fb_csv.scala
*/
// I should get prices:
import sys.process._
"/usr/bin/curl -o /tmp/fb.csv http://ichart.finance.yahoo.com/table.csv?s=FB"!
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val fb_df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("/tmp/fb.csv")
fb_df.head(9)
I want to enhance the above script so it is pure Scala with no shell syntax inside.
val content = scala.io.Source.fromURL("http://ichart.finance.yahoo.com/table.csv?s=FB").mkString
val list = content.split("\n").filter(_ != "")
val rdd = sc.parallelize(list)
val df = rdd.toDF
Found better answer from Process CSV from REST API into Spark
Here you go:
import scala.io.Source._
import org.apache.spark.sql.{Dataset, SparkSession}
var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.printSchema()

How to convert RDD of Avro's GenericData.Record to DataFrame?

Perhaps this question may seem a bit abstract, here it is:
val originalAvroSchema : Schema = // read from a file
val rdd : RDD[GenericData.Record] = // From some streaming source
// Looking for a handy:
val df: DataFrame = rdd.toDF(schema)
I explore spark-avro but it has support only to read from a file, not from existing RDD.
import com.databricks.spark.avro._
val sqlContext = new SQLContext(sc)
val rdd : RDD[MyAvroRecord] = ...
val df = rdd.toAvroDF(sqlContext)

How to get files name with spark sc.textFile?

I am reading a directory of files using the following code:
val data = sc.textFile("/mySource/dir1/*")
now my data rdd contains all rows of all files in the directory (right?)
I want now to add a column to each row with the source files name, how can I do that?
The other options I tried is using wholeTextFile but I keep getting out of memory exceptions.
5 servers 24 cores 24 GB (executor-core 5 executor-memory 5G)
any ideas?
You can use this code. I have tested it with Spark 1.4 and 1.5.
It gets the file name from the inputSplit and adds it to each line using the iterator using the mapPartitionsWithInputSplit of the NewHadoopRDD
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val sc = new SparkContext(new SparkConf().setMaster("local"))
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "file:///home/user/test"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.foreach(println)
I think it's pretty late to answer this question but I found an easy way to do what you were looking for:
Step 0: from pyspark.sql import functions as F
Step 1: createDataFrame using the RDD as usual. Let's say df
Step 2: Use input_file_name()
df.withColumn("INPUT_FILE", F.input_file_name())
This will add a column to your DataFrame with source file name.