Loading a csv file in spark - Unwanted ' " ' value appearing in cell values - scala

I have loaded a csv file, then stored that as a RDD and then as a dataframe:
Running this in a spark-shell (spark version 1.6, scala 2.10.5)
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType, StructField, StringType};
val schemaString="age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y"
val schema = StructType (schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val data = sc.textFile("bankproject.csv")
val rowRDD=data.map(_.split(";")).map(d=>Row(d(0),d(1),d(2),d(3),d(4),d(5),d(6),d(7),d(8),d(9),d(10),d(11),d(12),d(13),d(14),d(15),d(16)))
val bankDF=sqlContext.createDataFrame(rowRDD, schema)
bankDF.show()
Now the first and last cell of the all the rows in the dataframe has an additional double quotes ' " ' (see dataframe image) , what am I doing wrong here?

Related

can not cast values in spark scala dataframe

I am trying to parse the data from numbers
Enviroment: DataBricks Scala 2.12 Spark 3.1
I had chosen columns that were incorrectly parsed as Strings the reason is that sometimes numbers were written with coma sometimes with dot.
I am trying to first replace all commas to dots parse it as floats, create schema with type of floating numbers and recreate the dataframe but it does not work.
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, FloatType};
import org.apache.spark.sql.{Row, SparkSession}
import sqlContext.implicits._
//temp is a dataframe with data that I included below
val jj = temp.collect().map(row=> Row(row.toSeq.map(it=> if(it==null) {null} else {it.asInstanceOf[String].replace( ",", ".").toFloat }) ))
val schemaa = temp.columns.map(colN=> (StructField(colN, FloatType, true)))
val newDatFrame = spark.createDataFrame(jj,schemaa)
Data screen
CSV
Podana aktywność,CRP(6 mcy),WBC(6 mcy),SUV (max) w miejscu zapalenia,SUV (max) tła,tumor to background ratio
218,72,"15,2",16,"1,8","8,888888889"
"199,7",200,"16,5","21,5","1,4","15,35714286"
270,42,"11,17","7,6","2,4","3,166666667"
200,226,"29,6",9,"2,8","3,214285714"
200,45,"13,85",17,"2,1","8,095238095"
300,null,"37,8","6,19","2,5","2,476"
290,175,"7,35",9,"2,4","3,75"
279,160,"8,36",13,2,"6,5"
202,24,10,"6,7","2,6","2,576923077"
334,"22,9","8,01",12,"2,4",5
"200,4",null,"25,56",7,"2,4","2,916666667"
198,102,"8,36","7,4","1,8","4,111111111"
"211,6","26,7","10,8","4,2","1,6","2,625"
205,null,null,"9,7","2,07","4,685990338"
326,300,18,14,"2,4","5,833333333"
270,null,null,15,"2,5",6
258,null,null,6,"2,5","2,4"
300,197,"13,5","12,5","2,6","4,807692308"
200,89,"20,9","4,8","1,7","2,823529412"
"201,7",28,null,11,"1,8","6,111111111"
198,9,13,9,2,"4,5"
264,null,"20,3",12,"2,5","4,8"
230,31,"13,3","4,8","1,8","2,666666667"
284,107,"9,92","5,8","1,49","3,89261745"
252,270,null,8,"1,56","5,128205128"
266,null,null,"10,4","1,95","5,333333333"
242,null,null,"14,7",2,"7,35"
259,null,null,"10,01","1,65","6,066666667"
224,null,null,"4,2","1,86","2,258064516"
306,148,10.3,11,1.9,"0,0002488406289"
294,null,5.54,"9,88","1,93","5,119170984"
You can map the columns using Spark SQL regexp_replace. collect is not needed and will not give a good performance. You might also want to use double instead of float because some entries have many decimal places.
val new_df = df.select(
df.columns.map(
c => regexp_replace(col(c), ",", ".").cast("double").as(c)
):_*
)

import data with a column of type Pig Map into spark Dataframe?

So I'm trying to import data that has a column of type Pig map into a spark dataframe, and I couldn't find anything on how do I explode the map data into 3 columns with names: street, city and state. I'm probably searching for the wrong thing. Right now I can import them into 3 columns using StructType and StructField options.
val schema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address", StringType, true))) #this is the part that I need to explode
val data = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ";")
.schema(schema)
.load("hdfs://localhost:8020/filename")
Example row of the data that I need to make 5 columns from:
328;Some Name;[street#streetname,city#Chicago,state#IL]
What do i need to do to explode the map into 3 columns so id have essentially a new dataframe with 5 columns ? I just started Spark and I've never used pig. I only figured out it was a pig map through searching the structure [key#value].
I'm using spark 1.6 by the way with scala. Thank you for any help.
I'm not too familiar with the pig format (there may even be libraries for it), but some good ol' fashioned string manipulation seems to work. In practice you may have to do some error checking, or you'll get index out of range errors.
val data = spark.createDataset(Seq(
(328, "Some Name", "[street#streetname,city#Chicago,state#IL]")
)).toDF("id", "name", "address")
data.as[(Long, String, String)].map(r => {
val addr = (r._3.substring(1, r._3.length - 1)).split(",")
val street = addr(0).split("#")(1)
val city = addr(1).split("#")(1)
val state = addr(2).split("#")(1)
(r._1, r._2, street, city, state)
}).toDF("id", "name", "street", "city", "state").show()
which results in
+---+---------+----------+-------+-----+
| id| name| street| city|state|
+---+---------+----------+-------+-----+
|328|Some Name|streetname|Chicago| IL|
+---+---------+----------+-------+-----+
I'm not 100% certain of the compatibility with spark 1.6, however. You may end up having to map the Dataframe (as opposed to Dataset, as I'm converting it with the .as[] call), and extract the individual value's from the Row object in your anonymous .map() function. The overall concept should be the same though.

I can't fit the FP-Growth model in spark

Please, can you help me ? I have an 80 CSV files dataset and a cluster of one master and 4 slaves. I want to read the CSV files in a dataframe and parallelize it on the four slaves. After that, I want to filter the dataframe with a group by. In my spark queries, the result contains columns "code_ccam" and "dossier" grouped by ("code_ccam","dossier"). I want to use the FP-Growth algorithm to detect sequences of "code_ccam" which are repeated by "folder". But when I use the FPGrowth.fit() command, I have the following error :
"error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Dataset[_]"
Here are my spark commands:
val df = spark.read.option("header", "true").csv("file:///home/ia/Projet-Spark-ace/Donnees/Fichiers CSV/*.csv")
import org.apache.spark.sql.functions.{concat, lit}
val df2 = df.withColumn("dossier", concat(col("num_immatriculation"), lit(""), col("date_acte"), lit(""), col("rang_naissance"), lit(""), col("date_naissance")))
val df3 = df2.drop("num_immatriculation").drop("date_acte").drop("rang_naissance").drop("date_naissance")
val df4 = df3.select("dossier","code_ccam").groupBy("dossier","code_ccam").count()
val transactions = df4.agg(collect_list("code_ccam").alias("codes_ccam")).rdd.map(x => x)
import org.apache.spark.ml.fpm.FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("code_ccam").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(transactions)
Tkank you very much. It worked. I replaced collect_list by collect_set.

columnSimilarities() back to Spark Data Frame

I have a second question around CosineSimilarity / ColumnSimilarities in Spark 2.1. I'm kinda new to scala and all the Spark environment and this is not really clear to me:
How can I get back the ColumnSimilarities for each combination of columns from the rowMatrix in spark. Here is what I tried:
Data:
import org.apache.spark.sql.{SQLContext, Row, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import org.apache.spark.sql.functions._
// rdd
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row(2.0, 7.0, 1.0),
Row(3.5, 2.5, 0.0),
Row(7.0, 5.9, 0.0)
)
)
// Schema
val schema = new StructType()
.add(StructField("item_1", DoubleType, true))
.add(StructField("item_2", DoubleType, true))
.add(StructField("item_3", DoubleType, true))
// Data frame
val df = spark.createDataFrame(rowsRdd, schema)
Code:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
.transform(df)
.select("vs")
.rdd
val items_mllib_vector = rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(items_mllib_vector)
val simsPerfect = mat.columnSimilarities()
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
Output:
Pairwise similarities are: MatrixEntry(0,2,0.24759378423606918), MatrixEntry(1,2,0.7376189553526812), MatrixEntry(0,1,0.8355316482961213)
So What I get is simsPerfect org.apache.spark.mllib.linalg.distributed.CoordinateMatrix of my Columns and similarities. How would I transform this back to a dataframe and get the right columns names with it?
My preferred output:
item_from | item_to | similarity
1 | 2 | 0.83 |
1 | 3 | 0.24 |
2 | 3 | 0.73 |
Thanks in advance
This approach also works without converting the row to String:
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => (row,col,sim)}
val dff = sqlContext.createDataFrame(transformedRDD).toDF("item_from", "item_to", "sim")
where, I assume val sqlContext = new org.apache.spark.sql.SQLContext(sc) is defined already and sc is the SparkContext.
I found a solution for my problem:
//Transform result to rdd
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
//Transform rdd[String] to rdd[Row]
val rdd2 = transformedRDD.map(a => Row(a))
// to DF
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = spark.createDataFrame(rdd2,dfschema)
//create new DF with schema
val newdf = rddToDF.select(expr("(split(value, ','))[0]").cast("string").as("item_from")
,expr("(split(value, ','))[1]").cast("string").as("item_to")
,expr("(split(value, ','))[2]").cast("string").as("sim"))
I'm sure there is another easier way to do this, but I'm happy that it works.

How to convert all column of dataframe to numeric spark scala?

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
Given this DataFrame as example:
val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")
with schema:
StructType(
StructField(id,StringType,true),
StructField(c0,IntegerType,false))
You can loop over DF columns by .columns functions:
val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))
So the new DF schema looks like:
StructType(
StructField(id,FloatType,true),
StructField(c0,FloatType,false))
EDIT:
If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id):
val exclude = Array("id")
val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
current.withColumn(c, col(c).cast("float")))
where exclude is an Array of all columns we want to exclude from casting.
So the schema of this new DF is:
StructType(
StructField(id,StringType,true),
StructField(c0,FloatType,false))
Please notice that maybe this is not the best solution to do it but it can be a starting point.