I want to convert a complete spark dataframe to byte array.I used the below to convert a column to byte array but I want to convert an entire dataframe.
(x:String)=>encrypt(connStr(ECCConstants.S_Key).toString,x,algorithm)
def encrypt(key: String, value: String, algo : String): String = {
Security.addProvider(new org.bouncycastle.jce.provider.BouncyCastleProvider())
val cipher: Cipher = Cipher.getInstance(algo) ///ECB/PKCS5Padding
cipher.init(Cipher.ENCRYPT_MODE, keyToSpec(key,algo))
Base64.encodeBase64String(cipher.doFinal(value.getBytes("UTF-8")))
Related
I am reading a text file from the local file system. I want to convert String to Dictionary(MAP) store it into one variable. And want to extract value by passing key. I am new to spark scala.
scala>val file = sc.textFile("file:///test/prod_details.txt");
scala> file.foreach(println)
{"00000006-0000-0000": "AWS", "00000009-0000-0000": "JIRA", "00000010-0000-0000-0000": "BigData", "00000011-0000-0000-0000": "CVS"}
scala> val rowRDD=file.map(_.split(","))
Expected Result is :
If I pass the key as "00000010-0000-0000-0000",
the function should return the value as BigData
Since your file is in json format and is not big you can read your file with spark json connector and then extract keys and columns :
val df = session.read.json("path to file")
val keys = df.columns
val values = df.collect().last.toSeq
val map = keys.zip(values).toMap
I got this exception while playing with spark.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot up cast price from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "price")
- root class: "org.spark.code.executable.Main.Record"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
How Can this exception be solved? Here is the code
object Main {
case class Record(transactionDate: Timestamp, product: String, price: Int, paymentType: String, name: String, city: String, state: String, country: String,
accountCreated: Timestamp, lastLogin: Timestamp, latitude: String, longitude: String)
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "C:\\winutils\\");
val schema = Encoders.product[Record].schema
val df = SparkConfig.sparkSession.read
.option("header", "true")
.csv("SalesJan2009.csv");
import SparkConfig.sparkSession.implicits._
val ds = df.as[Record]
//ds.groupByKey(body => body.state).count().show()
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum
}
ds.groupByKey(body => body.state)
.agg(typedSum[Record](_.price).name("sum(price)"))
.withColumnRenamed("value", "group")
.alias("Summary by state")
.show()
}
You read the csv file first and tried to convert to it to dataset which has different schema. Its better to pass the schema created while reading the csv file as below
val spark = SparkSession.builder()
.master("local")
.appName("test")
.getOrCreate()
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Record].schema
val ds = spark.read
.option("header", "true")
.schema(schema) // passing schema
.option("timestampFormat", "MM/dd/yyyy HH:mm") // passing timestamp format
.csv(path)// csv path
.as[Record] // convert to DS
The default timestampFormat is yyyy-MM-dd'T'HH:mm:ss.SSSXXX so you also need to pass your custom timestampFormat.
Hope this helps
In my case, the problem was that I was using this:
case class OriginalData(ORDER_ID: Int, USER_ID: Int, ORDER_NUMBER: Int, ORDER_DOW: Int, ORDER_HOUR_OF_DAY: Int, DAYS_SINCE_PRIOR_ORDER: Double, ORDER_DETAIL: String)
However, in the CSV file, I had this for example:
Yes, having "Friday" where only integers representing days of the week should appear, means that I need to clean data. However, to be able to read my CSV file using spark.read.csv("data/jaimemontoya/01.csv"), I used the following code, where the value of ORDER_DOW now is String, not Int anymore:
case class OriginalData(ORDER_ID: Int, USER_ID: Int, ORDER_NUMBER: Int, ORDER_DOW: String, ORDER_HOUR_OF_DAY: Int, DAYS_SINCE_PRIOR_ORDER: Double, ORDER_DETAIL: String)
Add this option on read:
.option("inferSchema", true)
I have xml schema of following structure
I am interested in _EventStartDateand and E these values have to be converted into an AVRO record at the end date being the key and E being the value.
To convert this i use databricks XML parser and then create a temporary table and then repartition it on date.
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.sql.SQLContext
import com.databricks.spark.avro._
val xmlDF =
sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Events").load("/hdfsPath").cache
xmlDF.registerTempTable("event_tbl")
val xmlDF_Tbl = sqlContext.sql("SELECT `#EventStartDate` as eventDate,E as event FROM event_tbl")
val xmlDF_Tbl_Part = xmlDF_Tbl.repartition($"eventDate").rdd.map(s => (s(0),s(1).toString))
xmlDF_Tbl_Part.saveAsTextFile("path to Hdfs")
i get output in following format 2015,WrappedArray([E1], [E2])
i want output in following format 2015,E1###E2###E3 and so on
The DF looks something like this:
[#BL: string, #EventStartDate: string, #MaterialNumber: bigint, #SerialNumber: bigint, #UTCOfs: bigint, E:array[struct<#C:string,#EID:bigint,#Hst:string,#L:bigint,#MID:string,#S:string,#Src:string,#T:string,#Tgt:string,#Usr:string,#VALUE:string>]]
Now how to convert this wrapped array to a delimited string
I have txt files with semistructured data, I have to write it in cassandra through spark-cassandra. But for the first I what to parse in only in scala.
my code :
import java.io.File
import scala.io.Source
object parser extends App {
val path = "somepath"
val fileArray = (new java.io.File(path)).listFiles()
for (file <- fileArray)
for (line <- Source.fromFile(file).getLines())
So how can I parse each string and get values to put it in cassandra?
for example I have (int, text, timestamp, int, text, char, int, text)?
I have to split line for delimiter(" ")? and put them in a tuple? or each of them to convert to readable format?
What you probably could do is to handle it as csv file with delimiter(" ")? So let Spark do the parsing for you.
val spark = SparkSession.builder.config(conf).getOrCreate()
val dataFrame = spark.read.option("inferSchema", "true").option("delimiter", " ").csv(csvfilePath)
I have textRDD: org.apache.spark.rdd.RDD[(String, String)]
I would like to convert it to a DataFrame. The columns correspond to the title and content of each page(row).
Use toDF(), provide the column names if you have them.
val textDF = textRDD.toDF("title": String, "content": String)
textDF: org.apache.spark.sql.DataFrame = [title: string, content: string]
or
val textDF = textRDD.toDF()
textDF: org.apache.spark.sql.DataFrame = [_1: string, _2: string]
The shell auto-imports (I am using version 1.5), but you may need import sqlContext.implicits._ in an application.
I usually do this like the following:
Create a case class like this:
case class DataFrameRecord(property1: String, property2: String)
Then you can use map to convert into the new structure using the case class:
rdd.map(p => DataFrameRecord(prop1, prop2)).toDF()