Save values in spark - scala

i am trying to read and write data from the my local folder, but my data's not identical.
val data =sc.textFile("/user/cts367689/datagen.txt")
val a=data.map(line=>(line.split(",")(0).toInt+line.split(",")(4).toInt,line.split(",")(3),line.split(",")(2)))
a.saveAsTextFile("/user/cts367689/sparkoutput")
Output:
(526,female,avil)
(635,male,avil)
(983,male,paracetamol)
(342,female,paracetamol)
(158,female,avil)
how can i save output as below,need to remove brackets.
Expected Result:
526,female,avil
635,male,avil
983,male,paracetamol
342,female,paracetamol
158,female,avil

val a = data.map (
line =>
(line.split(",")(0).toInt + line.split(",")(4).toInt) + "," +
line.split(",")(3) + "," +
line.split(",")(2)
)
Try doing this instead of returning it in (). That makes a tuple.

spark has capability of handling unstructured files. you are using one those functions.
for CSV(comma separated values) file there are some good libraries to do the same.
you can have look at this link
for your question , answer is as shown below.
import org.apache.spark.sql.SQLContext
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "false")
.load("/user/cts367689/datagen.txt");
df.select("id", "gender", "name").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/user/cts367689/sparkoutput");

use:
val a = data.map(line => line.split(",")(0).toInt+line.split(",")(4).toInt+","+line.split(",")(3)+","+line.split(",")(2))

Related

Load CSVs - unable to pass file paths from dataframe

Below code works fine:
val Path = Seq (
"dbfs:/mnt/testdata/2019/02/Calls2019-02-03.tsv",
"dbfs:/mnt/testdata/2019/02/Calls2019-02-02.tsv"
)
val Calls = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "\t")
.schema(schema)
.load(Path: _*)
But I want to get the paths from the dataframe and the below code is not working.
val tsvPath =
Seq(
FinalFileList
.select($"Path")
.filter($"FileDate">MaxStartTime)
.collect.mkString(",")
.replaceAll("[\\[\\]]","")
)
val Calls = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "\t")
.schema(schema)
.load(tsvPath: _*)
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: dbfs:/mnt/testdata/2019/02/Calls2019-02-03.tsv,dbfs:/mnt/testdata/2019/02/Calls2019-02-02.tsv;
Looks like it is taking the path as "/mnt/file1.tsv, /mnt/file2.tsv" instead of "/mnt/file1.tsv","/mnt/file2.tsv"
Looks like it is taking the path as "/mnt/file1.tsv, /mnt/file2.tsv" instead of "/mnt/file1.tsv","/mnt/file2.tsv"
I suspect your problem is here:
.collect.mkString(",")
.replaceAll("[\\[\\]]","")
.mkString combines the strings together into one. One possible solution here is to split again after replacing:
.collect.mkString(",")
.replaceAll("[\\[\\]]","")
.split(",")
Another would be to just replace each element instead of combining into a string:
.collect.foreach(_.replaceAll("[\\[\\]]",""))
Whichever one is more suited to you.

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+

Trying to create Data frame from a file with delimiter '|'

I want to load a text file which has delimiter "|" into Dataframe in spark.
one way is to create the RDD and use toDF to create Dataframe. However I was wondering if I can create DF directly.
As of now I am using the below command
val productsDF = sqlContext.read.text("/user/danishdshadab786/paper2/products/")
For Spark 2.x
val df = spark.read.format("csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
For Spark<2.0
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.load("/user/danishdshadab786/paper2/products/")
You can add more options like option("header", "true") for reading headers in the same statement.
You can specify the delimiter in the 'read' options:
spark.read
.option("delimiter", "|")
.csv("/user/danishdshadab786/paper2/products/")

Create dataframe with header using header and data file

I have two files data.csv and headers.csv. I want to create dataframe in Spark/Scala with headers.
var data = spark.sqlContext.read.format(
"com.databricks.spark.csv").option("header", "true"
).option("inferSchema", "true").load(data_path)
Can you help me customizing above lines to do this?
you can read the headers.csv by using the above method and use the schema of headers dataframe to read the data.csv as below
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
I hope the answer is helpful

Spark: convert a CSV to RDD[Row]

I have a .csv file, which contains 258 columns in following structure.
["label", "index_1", "index_2", ... , "index_257"]
Now I wanna transform this .csv file to a RDD[Row]:
val data_csv = sc.textFile("~/test.csv")
val rowRDD = data_csv.map(_.split(",")).map(p => Row( p(0), p(1).trim, p(2).trim))
If I do the transform in this way, I have to write down 258 columns specifically. So I tried:
val rowRDD = data_csv.map(_.split(",")).map(p => Row( _ => p(_).trim))
and
val rowRDD = data_csv.map(_.split(",")).map(p => Row( x => p(x).trim))
But these two also not working and report error:
error: missing parameter type for expanded function ((x$2) => p(x$2).trim)
Can anyone tell me how to do this transform? Thanks a lot.
you should use sqlContext instead of sparkContext as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true)
.load(("~/test.csv")
this will create dataframe. calling .rdd on df should give you RDD[Row]
val rdd = df.rdd
Rather reading as a textFile read CSV files with the spark-csv
In your case
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("quote", "\"") //escape the quotes
.option("ignoreLeadingWhiteSpace", true) // escape space before your data
.load("cars.csv")
This loads data as a dataframe, now you can easily change it to RDD.
Hope this helps!
Apart from the other answers that are correct, the correct way to do what you're trying to do is to use Row.fromSeq inside the map function.
val rdd = sc.parallelize(Array((1 to 258).toArray, (1 to 258).toArray) )
.map(Row.fromSeq(_))
This will turn your rdd to type Row:
Array[org.apache.spark.sql.Row] = Array([1,2,3,4,5,6,7,8,9,10...