Spark Scala - Convert String to BigDecimal (11,2) - scala

I am trying to convert one String type column into BigDecimal in Spark Scala but in my hdfs directory it is storing the value in (38,18) default parquet file.
I have tried below conversion. Here, the row is my DataFrame.
linenumber = Try(BigDecimal(row.line_number)).getOrElse(BigDecimal(0))
Suggest how can I convert "row.line_number" value to BigDecimal(11,2).

Related

using caseclass versus structtype in spark scala

When should I use Structtype, and when should I use case class.
I am trying to create a spark dataset.
I have an input CSV file, I am trying to create a dataframe first and then convert it to the dataset using df.as[].
Now in order to generate the schema, should I use structtype or case class? Please help.
You don't have to use StructType when reading your CSV file but :
By default all fields would be Strings unless you specify the inferschema option
You'd have to name every field like this if you don't have a header
sparkSession.read.csv("my/csv/path.csv").toDF("id","product","customer","time").as[Transaction]

How do I use a from_json() dataframe in Spark?

I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.

How to read Hive table with column with JSON strings?

I have a hive table column (Json_String String) it has some 1000 rows, Where each row is a Json of same structure. I am trying read the json in to Dataframe as below
val df = sqlContext.read.json("select Json_String from json_table")
but it is throwing up the below exception
java.io.IOException: No input paths specified in job
is there any way to read all the rows in to dataframe as we do with Json files using wild card
val df = sqlContext.read.json("file:///home/*.json")
I think what you're asking for is to read the Hive table as usual and transform the JSON column using from_json function.
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
Given you use sqlContext in your code, I'm afraid that you use Spark < 2.1.0 which then does not offer from_json (which was added in 2.1.0).
The solution then is to use a custom user-defined function (UDF) to do the parsing yourself.
val df = sqlContext.read.json("select Json_String from json_table")
The above won't work since json operator expects a path or paths to JSON files on disk (not as a result of executing a query against a Hive table).
json(paths: String*): DataFrame Loads a JSON file (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.

Datatype conversion of Parquet using spark sql - Dynamically without specify a column name explicityly

I am looking for a way to handle the data type conversion dynamically. SparkDataframes , i am loading the data into a Dataframe using a hive SQL and storing into dataframe and then writing to a parquet file. Hive is unable to read some of the data types and i wanted to convert the decimal datatypes to Double . Instead of specifying a each column name separately Is there any way we can dynamically handle the datatype. Lets say in my dataframe i have 50 columns out of 8 are decimals and need to convert all 8 of them to Double datatype Without specify a column name. can we do that directly?
There is no direct way to do this convert data type here are some ways,
Either you have to cast those columns in hive query .
or
Create /user case class of data types you required and populate data and use it to generate parquet.
or
you can read data type from hive query meta and use dynamic code to get case one or case two to get. achieved
There are two options:
1. Use the schema from the dataframe and dynamically generate query statement
2. Use the create table...select * option with spark sql
This is already answered and this post has details, with code.

convert RDD[CassandraRow] to RDD[String]

is it possible to convert RDD[CassandraRow] to RDD[String] ? if so , is there any disadvantage of working against the converted RDD ?
You can use sqlContext to read data from Cassandra table, it returns an DataFrame, and when you read text file using sparkContext it returns RDD and then you can convert that to DataFrame.
If your text files are CSV, Spark 2.0 Supports csv data source, it returns an DataFrame by deafult. Please see this.. https://spark.apache.org/releases/spark-release-2-0-0.html#new-features and https://github.com/databricks/spark-csv/issues/
Update:
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html