Spark: Read a csv file into a map like structure using scala - scala

I have a csv file of the format:
key, age, marks, feature_n
abc, 23, 84, 85.3
xyz, 25, 67, 70.2
Here the number of features can vary. In eg: I have 3 features (age, marks and feature_n). I have to convert it into a Map[String,String] as below :
[key,value]
["abc","age:23,marks:84,feature_n:85.3"]
["xyz","age:25,marks:67,feature_n:70.2"]
I have to join the above data with another dataset A on column 'key' and append the 'value' to another column in dataset A. The csv file can be loaded into a dataframe with schema (schema defined by first row of the csv file).
val newRecords = sparkSession.read.option("header", "true").option("mode", "DROPMALFORMED").csv("/records.csv");
Post this I will join the dataframe newRecords with dataset A and append the 'value' to one of the columns of dataset A.
How can I iterate over each column for each row, excluding the column "key" and generate the string of format "age:23,marks:84,feature_n:85.3" from newRecords?
I can alter the format of csv file and have the data in JSON format if it helps.
I am fairly new to Scala and Spark.

I would suggest the following solution:
val updated:RDD[String]=newRecords.drop(newRecords.col("key")).rdd.map(el=>{val a=el.toSeq;val st= "age"+a.head+"marks:"+a(1)+" feature_n:"+a.tail; st})

Related

How to read JSON in data frame column

I'm reading a HDFS directory
val schema = spark.read.schema(schema).json("/HDFS path").schema
val df= spark.read.schema(schema).json ("/HDFS path")
Here selecting only PK and timestamp from JSON file
Val df2= df.select($"PK1",$"PK2",$"PK3" ,$"ts")
Then
Using windows function to get updated PK on the base of timestamp
val dfrank = df2.withColumn("rank",row_number().over(
Window.partitionBy($"PK1",$"PK2",$"PK3" ).orderBy($"ts".desc))
)
.filter($"rank"===1)
From this window function getting only updated primary keys & timestamp of updated JSON.
Now I have to add one more column where I want to get only JSON with updated PK and Timestamp
How I can do that
Trying below but getting wrong JSON instead of updated JSON
val df3= dfrank.withColumn("JSON",lit(dfrank.toJSON.first()))
Result shown in image.
Here, you convert the entire dataframe to JSON and collect it to the driver with toJSON (that's going to crash with a large dataframe) and add a column that contains a JSON version of the first row of the dataframe to your dataframe. I don't think this is what you want.
From what I understand, you have a dataframe and for each row, you want to create a JSON column that contains all of its columns. You could create a struct with all your columns and then use to_json like this:
val df3 = dfrank.withColumn("JSON", to_json(struct(df.columns.map(col) : _*)))

reading multiple csv files using pyspark

I have requirement to read multiple csv files in one go. Now these csv files may have variable number of columns and in any order. We have requirement to read only specific columns from csv files . How do we do that ? I have tried defining custom schema but then the I get different data in columns.
For ex :
CSV file
ID, Name , Address
How do I select only Id and address column. Since if I say select (Id, Address) then it gives me ID and Name data in Address column. I want to select only ID and Address column according to header names while reading.
Thanks,
Naveed
You can iterate over the files and create a final dataframe like:
files = ['path/to/file1.csv', 'path/to/file2.csv', 'path/to/file3.csv', 'path/to/file4.csv']
#define the output dataframe's schema column name and type should be correct
schema = t.StructType([
t.StructField("a", t.StringType(), True), StructField("c", t.StringType(), True)
])
output_df = spark.createDataFrame([],schema)
for i,file in enumerate(data):
df = spark.read.csv(file, header=True)
output_df = output_df.union(df.select('a','c'))
output_df.show()
output_df will contain your desired output.

how category the all row based on cuontry save back to rdd with same format in spark scala?

I have data like this
vxbjxvsj^country:US;age:23;name:sri
jhddasjd^country:UK;age:24;name:abhi
vxbjxvsj^country:US;age:23;name:shree
jhddasjd^country:UK;age:;name:david
in spark scala i need to identify country categorized by country saves as rdd same format
should be in one rdd or file name us
jhddasjd^country:UK;age:24;name:abhi
jhddasjd^country:UK;age:;name:david
should be in one rdd or file name UK
vxbjxvsj^country:US;age:23;name:sri
vxbjxvsj^country:US;age:23;name:shree
If you read a file as a RDD you will get RDD[String] rdd of string, each line as a string.
To filter you need to split each line and extract the country field and filter on it
rdd.filter(r =>
r.split(":")(1).split(";")(0).equalsIgnoreCase("US")
).saveAsTextFile(s"US"}")
This will get the country field and filters that is "US"
If you want this to be dynamic than you can grab the list of a unique country first and perform filter in a loop as
val countries = df1.map(_.split(":")).map(_ (1).split(";")(0)).collect()
countries.foreach(country => {
rdd.filter(r =>
r.split(":")(1).split(";")(0).equalsIgnoreCase(country)
).saveAsTextFile(s"output/${country}")
})
Hope this helps!

Scala Dataframe : How can I add a column to a Dataframe using a condition between two Dataframes?

I have two Dataframe:
- UsersDF: (column_name:type) => [ (name,String) (age,Int) (Like,Int) ]
- ClusterDF: => [(cluster,bigInt) (names,String)]
A row of clusterDF's names column is composed of a user string separated from the space character ("\ t").
Users are in one cluster.
I would like to add the cluster column to the userDF Dataframe, inspecting the field names. How could I do?
Example:
row of clusterDF: 1, "A B C D"
row of userDF: "A", 23, 150
At the end of the process: row of userDF: "A", 23, 150, 1
The way to do it is to join the two dataframes. For this you need two steps. The first is to convert clusterDF to a usable form:
val fixed = clusterDF.withColumn("name", explode(split($"names", " ")))
We first split the names by space to get an array of names and then explode it to get a row for each value.
Now just join the two:
usersDF.join(fixed, "name")

Tagging a HBase Table using Spark RDD in Scala

I am trying add an extra "tag" column to an Hbase table. Tagging is done on the basis of words present in the rows of the table. Say for example, If "Dark" appears in a certain row, then its tag will be added as "Horror". I have read all the rows from the table in a spark RDD and have matched them with words based on which we would tag. A snippet to code looks like this:
var hBaseRDD2=sc.newAPIHadoopRDD(conf,classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
val transformedRDD = hBaseRDD2.map(tuple => {
(Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieName"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieSummary"))),
Bytes.toString(tuple._2.getValue(Bytes.toBytes("Moviesdata"),Bytes.toBytes("MovieActor")))
)
})
Here, "moviesdata" is the columnfamily of the HBase table and "MovieName"&"MovieSummary" & "MovieActor" are column names. "transformedRDD" in the above snippet is of type RDD[String,String,String]. It has been converted into type RDD[String] by:
val arrayRDD: RDD[String] = transformedRDD.map(x => (x._1 + " " + x._2 + " " + x._3))
From this, all words have been extracted by doing this:
val words = arrayRDD.map(x => x.split(" "))
The words which we would are looking for in the HBase Table rows are in a csv file. One of the column, let's say "synonyms" column, of the csv has the words which we would look for. Another column in the csv is a "target_tag" column, which has the words which would be tagged to the row corresponding to which there is match.
Read the csv by:
val csv = sc.textFile("/tag/moviestagdata.csv")
reading the synonyms column: (synonyms column is the second column, therefore "p(1)" in the below snippet)
val synonyms = csv.map(_.split(",")).map( p=>p(1))
reading the target_tag column: (target_tag is the 3rd column)
val targettag = csv.map(_.split(",")).map(p=>p(2))
Some rows in synonyms and targetag have more than one strings and are seperated by "###". The snippet to seperate them is this:
val splitsyno = synonyms.map(x => x.split("###"))
val splittarget = targettag.map(x=>x.split("###"))
Now, to match each string from "splitsyno", we need to traverse every row, and further a row might have many strings, hence, to create a set of every string, I did this:(an empty set was created)
splitsyno.map(x=>x.foreach(y=>set += y)
To match every string with those in "words" created up above, I did this:
val check = words.exists(set contains _)
Now, the problem which I am facing is that I don't exactly know that strings from what rows in csv are matching to strings from what rows in HBase table. This is needed as I would need to find corresponding target string and which row in HBase table to add to. How should I get it done? Any help would be highly appreciated.