can not cast values in spark scala dataframe - scala

I am trying to parse the data from numbers
Enviroment: DataBricks Scala 2.12 Spark 3.1
I had chosen columns that were incorrectly parsed as Strings the reason is that sometimes numbers were written with coma sometimes with dot.
I am trying to first replace all commas to dots parse it as floats, create schema with type of floating numbers and recreate the dataframe but it does not work.
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, FloatType};
import org.apache.spark.sql.{Row, SparkSession}
import sqlContext.implicits._
//temp is a dataframe with data that I included below
val jj = temp.collect().map(row=> Row(row.toSeq.map(it=> if(it==null) {null} else {it.asInstanceOf[String].replace( ",", ".").toFloat }) ))
val schemaa = temp.columns.map(colN=> (StructField(colN, FloatType, true)))
val newDatFrame = spark.createDataFrame(jj,schemaa)
Data screen
CSV
Podana aktywność,CRP(6 mcy),WBC(6 mcy),SUV (max) w miejscu zapalenia,SUV (max) tła,tumor to background ratio
218,72,"15,2",16,"1,8","8,888888889"
"199,7",200,"16,5","21,5","1,4","15,35714286"
270,42,"11,17","7,6","2,4","3,166666667"
200,226,"29,6",9,"2,8","3,214285714"
200,45,"13,85",17,"2,1","8,095238095"
300,null,"37,8","6,19","2,5","2,476"
290,175,"7,35",9,"2,4","3,75"
279,160,"8,36",13,2,"6,5"
202,24,10,"6,7","2,6","2,576923077"
334,"22,9","8,01",12,"2,4",5
"200,4",null,"25,56",7,"2,4","2,916666667"
198,102,"8,36","7,4","1,8","4,111111111"
"211,6","26,7","10,8","4,2","1,6","2,625"
205,null,null,"9,7","2,07","4,685990338"
326,300,18,14,"2,4","5,833333333"
270,null,null,15,"2,5",6
258,null,null,6,"2,5","2,4"
300,197,"13,5","12,5","2,6","4,807692308"
200,89,"20,9","4,8","1,7","2,823529412"
"201,7",28,null,11,"1,8","6,111111111"
198,9,13,9,2,"4,5"
264,null,"20,3",12,"2,5","4,8"
230,31,"13,3","4,8","1,8","2,666666667"
284,107,"9,92","5,8","1,49","3,89261745"
252,270,null,8,"1,56","5,128205128"
266,null,null,"10,4","1,95","5,333333333"
242,null,null,"14,7",2,"7,35"
259,null,null,"10,01","1,65","6,066666667"
224,null,null,"4,2","1,86","2,258064516"
306,148,10.3,11,1.9,"0,0002488406289"
294,null,5.54,"9,88","1,93","5,119170984"

You can map the columns using Spark SQL regexp_replace. collect is not needed and will not give a good performance. You might also want to use double instead of float because some entries have many decimal places.
val new_df = df.select(
df.columns.map(
c => regexp_replace(col(c), ",", ".").cast("double").as(c)
):_*
)

Related

Inferschema detecting column as string instead of double from parquet in pyspark

Problem -
I am reading a parquet file in pyspark using azure databricks. There are columns which lot of nulls and have decimal values, these columns are read as string instead of double.
Is there any way of inferring the proper data type in pyspark?
Code -
To read parquet file -
df_raw_data = sqlContext.read.parquet(data_filename[5:])
The output of this is a dataframe with more than 100 columns of which most of the columns are of the type double but the printSchema() shows it as string.
P.S -
I have a parquet file which can have dynamic columns hence defining struct for the dataframe does not work for me. I used to convert the spark dataframe to pandas and use convert_objects but that does not work as the parquet file is huge.
You can define the schema using StructType and then provide this schema in the schema option while loading the data.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
fileSchema = StructType([StructField('atm_id', StringType(),True),
StructField('atm_street_number', IntegerType(),True),
StructField('atm_zipcode', IntegerType(),True),
StructField('atm_lat', DoubleType(),True),
])
df_raw_data = spark.read \
.option("header",True) \
.option("format", "parquet") \
.schema(fileSchema) \
.load(data_filename[5:])

import data with a column of type Pig Map into spark Dataframe?

So I'm trying to import data that has a column of type Pig map into a spark dataframe, and I couldn't find anything on how do I explode the map data into 3 columns with names: street, city and state. I'm probably searching for the wrong thing. Right now I can import them into 3 columns using StructType and StructField options.
val schema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address", StringType, true))) #this is the part that I need to explode
val data = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ";")
.schema(schema)
.load("hdfs://localhost:8020/filename")
Example row of the data that I need to make 5 columns from:
328;Some Name;[street#streetname,city#Chicago,state#IL]
What do i need to do to explode the map into 3 columns so id have essentially a new dataframe with 5 columns ? I just started Spark and I've never used pig. I only figured out it was a pig map through searching the structure [key#value].
I'm using spark 1.6 by the way with scala. Thank you for any help.
I'm not too familiar with the pig format (there may even be libraries for it), but some good ol' fashioned string manipulation seems to work. In practice you may have to do some error checking, or you'll get index out of range errors.
val data = spark.createDataset(Seq(
(328, "Some Name", "[street#streetname,city#Chicago,state#IL]")
)).toDF("id", "name", "address")
data.as[(Long, String, String)].map(r => {
val addr = (r._3.substring(1, r._3.length - 1)).split(",")
val street = addr(0).split("#")(1)
val city = addr(1).split("#")(1)
val state = addr(2).split("#")(1)
(r._1, r._2, street, city, state)
}).toDF("id", "name", "street", "city", "state").show()
which results in
+---+---------+----------+-------+-----+
| id| name| street| city|state|
+---+---------+----------+-------+-----+
|328|Some Name|streetname|Chicago| IL|
+---+---------+----------+-------+-----+
I'm not 100% certain of the compatibility with spark 1.6, however. You may end up having to map the Dataframe (as opposed to Dataset, as I'm converting it with the .as[] call), and extract the individual value's from the Row object in your anonymous .map() function. The overall concept should be the same though.

Dataframe: how to groupBy/count then order by count in Scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
You can use sort or orderBy as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect() since it brings the data to the driver as an Array.
Hope this helps!
//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)

Loading a csv file in spark - Unwanted ' " ' value appearing in cell values

I have loaded a csv file, then stored that as a RDD and then as a dataframe:
Running this in a spark-shell (spark version 1.6, scala 2.10.5)
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schemaString="age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y"
val schema = StructType (schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val data = sc.textFile("bankproject.csv")
val rowRDD=data.map(_.split(";")).map(d=>Row(d(0),d(1),d(2),d(3),d(4),d(5),d(6),d(7),d(8),d(9),d(10),d(11),d(12),d(13),d(14),d(15),d(16)))
val bankDF=sqlContext.createDataFrame(rowRDD, schema)
bankDF.show()
Now the first and last cell of the all the rows in the dataframe has an additional double quotes ' " ' (see dataframe image) , what am I doing wrong here?

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28