How to create key-value pairs from textFile - scala

I want to use a member in a RDD element as a key, how can i do this
this is my data:
2 1
4 1
1 2
6 3
7 3
7 6
6 7
3 7
I want to create key/value pairs such that key is a element, also value is next element;
I wrote this code:
def main(args: Array[String])
{
System.setProperty("hadoop.home.dir","C:\\spark-1.5.1-bin-hadoop2.6\\winutil")
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
val lines = sc.textFile("followers.txt")
.flatMap{x => (x.indexOfSlice(x1),x.indexOfSlice(x2))}
}
but it is not true and it wont determine the index of elements;
every two number is a line

Maybe I'm misunderstanding your question, but if you are simply looking to split your data into key-value pairs, you just need to do this:
val lines = sc.textFile("followers.txt").map(s => {
val substrings = s.split(" ")
(substrings(0), substrings(1))
})
Does this solve your problem?

Related

Finding average of values against key using RDD in Spark

I have created RDD with first column is Key and rest of columns are values against that key. Every row has a unique key. I want to find average of values against every key. I created Key value pair and tried following code but it is not producing desired results. My code is here.
val rows = 10
val cols = 6
val partitions = 4
lazy val li1 = List.fill(rows,cols)(math.random)
lazy val li2 = (1 to rows).toList
lazy val li = (li1, li2).zipped.map(_ :: _)
val conf = new SparkConf().setAppName("First spark").setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(li,partitions)
val gr = rdd.map( x => (x(0) , x.drop(1)))
val gr1 = gr.values.reduce((x,y) => x.zip(y).map(x => x._1 +x._2 )).foldLeft(0)(_+_)
gr1.take(3).foreach(println)
I want result to be displayed like
1 => 1.1 ,
2 => 2.7
and so on for all keys
First I am not sure what this line is doing,
lazy val li = (li1, li2).zipped.map(_ :: _)
Instead, you could do this,
lazy val li = li2 zip li1
This will create the List of tuples of the type (Int, List[Double]).
And the solution to find the average values against keys could be as below,
rdd.map{ x => (x._1, x._2.fold(0.0)(_ + _)/x._2.length) }.collect.foreach(x => println(x._1+" => "+x._2))

appending data to hbase table from a spark rdd using scala

I am trying to add data to a HBase Table. I have done the following so far:
def convert (a:Int,s:String) : Tuple2[ImmutableBytesWritable,Put]={
val p = new Put(a.toString.getBytes())
p.add(Bytes.toBytes("ColumnFamily"),Bytes.toBytes("col_2"), s.toString.getBytes())//a.toString.getBytes())
println("the value of a is: " + a)
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(Bytes.toBytes(a)), p);
}
new PairRDDFunctions(newrddtohbaseLambda.map(x=>convert(x, randomstring))).saveAsHadoopDataset(jobConfig)
newrddtohbaseLambda is this:
val x = 12
val y = 15
val z = 25
val newarray = Array(x,y,z)
val newrddtohbaseLambda = sc.parallelize(newarray)
"randomstring" is this
val randomstring = "abc, xyz, dfg"
Now, what this does is, it adds abc,xyz,dfg to rows 12, 15 and 25 after deleting the already present values in these rows. I want that value to be present and add abc,xyz,dfg instead replace. How can I get it done? Any help would be appreciated.

how to use split in spark scala?

I have a input file something looks like:
1: 3 5 7
3: 6 9
2: 5
......
I hope to get two list
the first list is made up of numbers before ":", the second list is made up of numbers after ":".
the two lists in the above example are:
1 3 2
3 5 7 6 9 5
I just write code as following:
val rdd = sc.textFile("input.txt");
val s = rdd.map(_.split(":"));
But do not know how to implement following things. Thanks.
I would use flatmaps!
So,
val rdd = sc.textFile("input.txt")
val s = rdd.map(_.split(": ")) # I recommend adding a space after the colon
val before_colon = s.map(x => x(0))
val after_colon = s.flatMap(x => x(1).split(" "))
Now you have two RDDs, one with the items from before the colon, and one with the items after the colon!
If it is possible for your the part of the text before the colon to have multiple numbers, such as an example like 1 2 3: 4 5 6, I would write val before_colon = s.flatMap(x => x(0).split(" "))

How to convert values from a file to a Map in spark-scala?

I have my values in a file as comma separated. Now, i want this data to be converted into a key value pairs(Maps). I know that we can split the values and store in a Array like below.
val prop_file = sc.textFile("/prop_file.txt")
prop_file.map(_.split(",").map(s => Array(s)))
Is there any way to store the data as Map in spark-scala ?
Assuming that each line of your file contain 2 values where first word is Key and next is value, separated by space: -
A 1
B 2
C 3
Something like this can be done: -
val file = sc.textFile("/prop_file.txt")
val words = file.flatMap(x => createDataMap(x))
And here is your function - createDataMap
def createDataMap(data:String): Map[String, String] = {
val array = data.split(",")
//Creating the Map of values
val dataMap = Map[String, String](
(array(0) -> array(1)),
(array(2) -> array(3))
)
return dataMap
}
Next for retrieving the key/ values from the RDD you can leverage following operations: -
//This will print all elements of RDD
words.foreach(f=>println(f))
//Or You can filter the elements too.
words.filter(f=>f._1.equals("A"))
Sumit, I have used the below code to retrieve the value for a particular key and it worked.
val words = file.flatMap(x => createDataMap(x)).collectAsMap
val valueofA = props("A")
print(valueofA)
This gives me 1 as a result

Want to parse a file and reformat it to create a pairRDD in Spark through Scala

I have dataset in a file in the form:
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698
4: 145
5: 8 57544 58089 60048 65880 284186 313376
6: 8
I need to transform this to something like below using Spark and Scala as a part of preprocessing of data:
1 1664968
2 3
2 747213
2 1664968
2 4095634
2 5535664
3 9
3 77935
3 79583
3 84707
And so on....
Can anyone provide input on how this can be done.
The length of the original rows in the file varies as shown in the dataset example above.
I am not sure, how to go about doing this transformation.
I tried soemthing like below which gives me a pair of the key and the first element after the semi-colon.
But I am not sure how to iterate over the entire data and generate the pairs as needed.
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Graphx").setMaster("local"))
val rawLinks = sc.textFile("src/main/resources/links-simple-sorted-top100.txt")
rawLinks.take(5).foreach(println)
val formattedLinks = rawLinks.map{ rows =>
val fields = rows.split(":")
val fromVertex = fields(0)
val toVerticesArray = fields(1).split(" ")
(fromVertex, toVerticesArray(1))
}
val topFive = formattedLinks.take(5)
topFive.foreach(println)
}
val rdd = sc.parallelize(List("1: 1664968","2: 3 747213 1664968 1691047 4095634 5535664"))
val keyValues = rdd.flatMap(line => {
val Array(key, values) = line.split(":",2)
for(value <- values.trim.split("""\s+"""))
yield (key, value.trim)
})
keyValues.collect
split row in 2 parts and map on variable number of columns.
def transform(s: String): Array[String] = {
val Array(head, tail) = s.split(":", 2)
tail.trim.split("""\s+""").map(x => s"$head $x")
}
> transform("2: 3 747213 1664968 1691047 4095634 5535664")
// Array(2 3, 2 747213, 2 1664968, 2 1691047, 2 4095634, 2 5535664)