spark custom schema for csv file - pyspark

can I define schema including subcolumns like below in spark for csv file and join two files on the basis of KeyFields and NonKeyFields
KeyFields NonKeyFields
EmpId DOB FirstName LastName Contact Loc1 Loc2 DOJ Comments Supervisor
My sample data is in the following format
1242569,11-Sep-95,SANDEEP,KUMAR,9010765550,HYDERABAD,OFFSHORE,15-Jan-16,Passed Due,NAGALAKSHMI CHALLA

yes you can do it while reading the csv file like that:
df = sqlContext.read.load(<path of the file>, schema)

Related

Null columns in data frame

I am creating a data frame from a json file in HDFS for e.g hdfs://localhost:9000/usr/sparkApp/finalTxtTweets/-1606454440000.txt/part-00000 in scala spark shell. I have created a schema as I only need the id and text part of the json file with val schema = new StructType().add("UserID", StringType, true).add("Text", StringType, true) and with the schema, I create the data frame with this:
val df = spark.read.schema(schema).option("multiLine", true).option("mode", "PERMISSIVE").json("hdfs://localhost:9000/usr/sparkApp/finalTxtTweets/-1606454440000.txt/part-00000")
However, it returns a data frame of 2 columns (UserID and Text) but with null as rows.
This is what the json file from part-00000 is like
JObject(List((UserID,JInt(2308359613)), (Text,JString(RT #ONEEsports: The ONE Esports MPL Invitational is now live! Join us as 20 of the best teams from Indonesia, Singapore, Malaysia, Myanmar…))))
Anyone have any idea of how to correctly parse the json file? Or is there anything wrong with my code? Thank you very much for helping!

Count distinct users from RDD

I have json file which I loaded into my program using textFile. I want to count the number of distinct users in my json data. I cannot convert to DataFrame or Dataset. I tried the following code it gave me some java EOF error.
jsonFile = sc.textFile('some.json')
dd = jsonFile.filter(lambda x: x[1]).distinct().count()
# 2nd column is user ID coulmn
Sample data
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,text":"Total bill for this horrible service? Over $8Gs","date":"2013-05-07 04:34:36"}
use :
spark.read.json(Json_File, multiLine=True)
to directly read json into dataframe
Try multiLine as True and False as per your Files requirement

Scripts for generating csv files for spark Canssandra data

I want to generate the 'csv' files as per below logic for the table in cassandra.
val df = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
I want to generate the 'csv' files as per below logic.
Since there are 3 distinct 'emailid's' I need to generate 3 distinct 'csv' files.
Three csv files for below 3 different queries.
select * from table where emailId='abc#gmail.com'
select * from table where emailId='def#gmail.com'
select * from table where emailId='xyz#gmail.com'
How can I do this. Can anyone please help me on this.
Version:
Spark 1.6.2
Scala 2.10
Create a distinct list of the emails then iterate over them. When iterating, filter for only the emails that match and save the dataframe to Cassandra.
import sql.implicits._
val emailData = sc.parallelize(Seq(("a",1,"abc#gmail.com"), ("b",2,"def#gmail.com"),("a",1,"xyz#gmail.com"),("a",2,"abc#gmail.com"))).toDF("col1","col2","emailId")
val distinctEmails = emailData.select("emailId").distinct().as[String].collect
for (email <- distinctEmails){
val subsetEmailsDF = emailData.filter($"emailId" === email).coalesce(1)
//... Save the subset dataframe to cassandra
}
Note: coalesce(1) sends all the data to one node. This can create memory issues if the dataframe is too large.

spark-mongo connector to HDFS in csv

I am using Spark-mongo connector (in R lenguage) to query collections, when I select fields and save result as follow:
t2 <- sql(sqlContext, "select name,age from members");
saveDF(t2, "hdfs://server:8020/path/res")
It saves result on parquet files with json content but I want in a simple plain text in hdfs.
¿How can I parse this dataframe into csv format in HDFS?
I expect:
Peter,20
Mike,15
John,30
#Ross thanks, that was the solution:
write.df(dataframe, "hdfs://server:8000/path/hdfs", "com.databricks.spark.csv", "overwrite")

Convert csv.gz files into Parquet using Spark

I need to implement converting csv.gz files in a folder, both in AWS S3 and HDFS, to Parquet files using Spark (Scala preferred). One of the columns of the data is a timestamp and I only have a week of dataset. The timestamp format is:
'yyyy-MM-dd hh:mm:ss'
The output that I desire is that for every day, there is a folder (or partition) where the Parquet files for that specific date is located. So there would 7 output folders or partitions.
I only have a faint idea of how to do this, only sc.textFile is on my mind. Is there a function in Spark that can convert to Parquet? How do I implement this in S3 and HDFS?
Thanks for you help.
If you look into the Spark Dataframe API, and the Spark-CSV package, this will achieve the majority of what you're trying to do - reading in the CSV file into a dataframe, then writing the dataframe out as parquet will get you most of the way there.
You'll still need to do some steps on parsing the timestamp and using the results to partition the data.
old topic but ill think it is important to answer even old topics if not answered right.
in spark version >=2 csv package is already included before that you need to import databricks csv package to your job e.g. "--packages com.databricks:spark-csv_2.10:1.5.0".
Example csv:
id,name,date
1,pete,2017-10-01 16:12
2,paul,2016-10-01 12:23
3,steve,2016-10-01 03:32
4,mary,2018-10-01 11:12
5,ann,2018-10-02 22:12
6,rudy,2018-10-03 11:11
7,mike,2018-10-04 10:10
First you need to create the hivetable so that the spark written data is compatible with the hive schema. (this might be not needed anymore in future versions)
create table:
create table part_parq_table (
id int,
name string
)
partitioned by (date string)
stored as parquet
after youve done that you can easy read the csv and save the dataframe to that table.The second step overwrites the column date with the dateformat like"yyyy-mm-dd". For each of the value a folder will be created with the specific lines in it.
SCALA Spark-Shell example:
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
First two lines are hive configurations which are needed to create a partition folder which not exists already.
var df=spark.read.format("csv").option("header","true").load("/tmp/test.csv")
df=df.withColumn("date",substring(col("date"),0,10))
df.show(false)
df.write.format("parquet").mode("append").insertInto("part_parq_table")
after the insert is done you can directly query the table like "select * from part_parq_table".
The folders will be created in the tablefolder on default cloudera e.g. hdfs:///users/hive/warehouse/part_parq_table
hope that helps
BR
Read csv file /user/hduser/wikipedia/pageviews-by-second-tsv
"timestamp" "site" "requests"
"2015-03-16T00:09:55" "mobile" 1595
"2015-03-16T00:10:39" "mobile" 1544
The following code uses spark2.0
import org.apache.spark.sql.types._
var wikiPageViewsBySecondsSchema = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
var wikiPageViewsBySecondsDF = spark.read.schema(wikiPageViewsBySecondsSchema).option("header", "true").option("delimiter", "\t").csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
Convert String-timestamp to timestamp
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.withColumn("timestampTS", $"timestamp".cast("timestamp")).drop("timestamp")
or
wikiPageViewsBySecondsDF= wikiPageViewsBySecondsDF.select($"timestamp".cast("timestamp"), $"site", $"requests")
Write into parquet file.
wikiPageViewsBySecondsTableDF.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")