I want to split the key in map reduce and create a new key value pair.
current doc file:
[(u'ab,xy,sc,dr , u'doc1)]
I want to split the key with each value as:
[(u'ab,doc1) , (u'xy,doc1) ,(u'sc,doc1) , (u'dr,doc1)]
Any help is much appreciated!
Thanks
def process(record):
for key in record[0].split(','):
yield key, record[1]
rdd = sc.parallelize([(u'ab,xy,sc,dr', u'doc1')])
rdd.flatMap(process).collect()
will result in
[(u'ab', u'doc1'), (u'xy', u'doc1'), (u'sc', u'doc1'), (u'dr', u'doc1')]
Related
I have a dataframe with a map column. I want to collect the not null keys into a new column:
You can use map_filter to filter the non null keys and then use map_keys which returns the keys as an array.
I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you
The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()
I am trying to read a HBase table using scala and then add a new column as tags based on the content in the rows in HBase Table. I have read the table as spark RDD. I also have a hashmap of which key value pairs are as follows:
keys are to be matched with the entries of spark rdd(generated from HBase table) and if match is found, the value from the hashmap is to be added into a new column.
The function to write to hbase table in the a new column name is this:
def convert (a:Int,s:String) : Tuple2[ImmutableBytesWritable,Put]={
val p = new Put(a.toString.getBytes())
p.add(Bytes.toBytes("columnfamily"),Bytes.toBytes("col_2"), s.toString.getBytes())//a.toString.getBytes())
println("the value of a is: " + a)
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(Bytes.toBytes(a)), p);
}
new PairRDDFunctions(newrddtohbaseLambda.map(x=>convert(x, ranjan))).saveAsHadoopDataset(jobConfig)
Then to read string from hashmap and compare and add back the code is this:
csvhashmap.keys.foreach{i=> if (arrayRDD.zipWithIndex.foreach{case(a,j) => a.split(" ").exists(i contains _); p = j.toInt}==true){new PairRDDFunctions(convert(p,csvhashmap(i))).saveAsHadoopDataset(jobConfig)}}
here csvhashmap is the above described hashmap, "words" is the rdd where we are trying to match the string. When the above command is run, I get the following error:
error: type mismatch;
found : (org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Put)
required: org.apache.spark.rdd.RDD[(?, ?)]
How to get rid of it? I have tried many things to change the data type but each time I get some error. Also, I have checked for the individual functions inside the above snippet and they are just fine. When I integrate them together, I got the above error. Any help would be appreciated.
I have a log file with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
I need to create a pair RDD with the postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
I need to use mapValues and I did the following:
val namesByPCode = accountsdata.keyBy(line => line.split(',')(8)).mapValues(fields => (fields(0), (fields(4), fields(5)))).collect()
but I'm getting an error. can someone tell me what is wrong with my statement?
keyBy doesn't change the value, so the value stays a single "unsplit" string. You want to first use map to perform the split (to get an RDD[Array[String]]), and then use keyBy and mapValues as you did on the split result:
val namesByPCode = accountsdata.map(_.split(","))
.keyBy(_(8))
.mapValues(fields => (fields(0), (fields(4), fields(5))))
.collect()
BTW - per your description, sounds like you'd also want to call groupByKey on this result (before calling collect), if you want each zipcode to evaluate into a single record with a list of names. keyBy doesn't perform the grouping, it just turns an RDD[V] into an RDD[(K, V)] leaving each record a single record (with potentially many records with same "key").
We have a table in Cassandra 1.2.0. That has an VarInt key. When we search keys we can see that they exist.
Table description:
CREATE TABLE u (
key varint PRIMARY KEY,
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Select key from u limit 10;
key
12040911
60619595
3220132
4602232
3997404
6312372
1128185
1507755
1778092
4701841
When I try and get the row for key 60619595 it works fine.
cqlsh:users> select key from u where key = 60619595;
key
60619595
cqlsh:users> select key from u where key = 3997404;
When I use pycassa to get the whole table I can access the row.
import pycassa
from struct import *
from pycassa.types import *
from urlparse import urlparse
import operator
userspool = pycassa.ConnectionPool('users');
userscf = pycassa.ColumnFamily(userspool, 'u');
users = {}
u = list(userscf.get_range())
for r in u:
users[r[0]] = r[1]
print users[3997404]
returns the correct result.
What am I doing wrong? I cannot see what the error is.
Any help would be appreciated,
Regards
Michael.
PS:
I should say that in pycassa when I try:
userscf.get(3997404)
File "test.py", line 10, in
userscf.get(3997404)
File "/usr/local/lib/python2.7/dist-packages/pycassa/columnfamily.py", line 655, in get
raise NotFoundException()
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None)
It seems to be in Ints that are smaller than the average.
You are mixing CQL and Thrift-based queries, which do not always mix well. CQL abstracts the underlying storage rows, whereas Thrift deals directly with them.
This is a problem we are having in our project. I should have added that
select key from u where key = 3997404;
cqlsh:users>
returns 0 results, even although when select * from u in cqlsh, or get the whole table in pycassa we see the row with the key 3997404.
Sorry for the confusion.
Regards
D.