split the key in a mapreduced text file in pyspark

split the key in a mapreduced text file in pyspark - pyspark

I want to split the key in map reduce and create a new key value pair.
current doc file:
[(u'ab,xy,sc,dr , u'doc1)]
I want to split the key with each value as:
[(u'ab,doc1) , (u'xy,doc1) ,(u'sc,doc1) , (u'dr,doc1)]
Any help is much appreciated!
Thanks

def process(record):
for key in record[0].split(','):
yield key, record[1]
rdd = sc.parallelize([(u'ab,xy,sc,dr', u'doc1')])
rdd.flatMap(process).collect()
will result in
[(u'ab', u'doc1'), (u'xy', u'doc1'), (u'sc', u'doc1'), (u'dr', u'doc1')]

Related

pyspark how to collect map keys into list

I have a dataframe with a map column. I want to collect the not null keys into a new column:

You can use map_filter to filter the non null keys and then use map_keys which returns the keys as an array.

How to get column values from list which contains column names in spark scala dataframe

I have a config defined which contains a list of column for each table to be used as a dedup key
for ex:
config 1 :
val lst = List(section_xid, learner_xid)
these are the column that needs to be used as a dedup keys. This list is dynamic some table will have 1 value some will have 2 or 3 values in it
what I am trying to do is build a single key column from this list
df.
.withColumn( "dedup_key_sk", uuid(md5(concat($"lst(0)",$"lst(1)"))) )
how do I make this dynamic which will work for any number of columns in list .
I tried doing this
df.withColumn("dedup_key_sk", concat(Seq($"col1", $"col2"):_*))
For this to work I had to convert list to Df and each value in list needs to be in separate columns I was not able to figure that out.
tried doing this but didn't work
val res = sc.parallelize(List((lst))).toDF
ANy input here will be appreciated . Thank you

The list of strings can be mapped to a list of columns (using functions.col). This list of columns can then be used with concat:
val lst: List[String] = List("section_xid", "learner_xid")
df.withColumn("dedup_key_sk", concat(lst.map(col):_*)).show()

matching keys of a hashmap to entries of a spark RDD in scala and adding value to if match found and writing rdd back to hbase

I am trying to read a HBase table using scala and then add a new column as tags based on the content in the rows in HBase Table. I have read the table as spark RDD. I also have a hashmap of which key value pairs are as follows:
keys are to be matched with the entries of spark rdd(generated from HBase table) and if match is found, the value from the hashmap is to be added into a new column.
The function to write to hbase table in the a new column name is this:
def convert (a:Int,s:String) : Tuple2[ImmutableBytesWritable,Put]={
val p = new Put(a.toString.getBytes())
p.add(Bytes.toBytes("columnfamily"),Bytes.toBytes("col_2"), s.toString.getBytes())//a.toString.getBytes())
println("the value of a is: " + a)
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(Bytes.toBytes(a)), p);
}
new PairRDDFunctions(newrddtohbaseLambda.map(x=>convert(x, ranjan))).saveAsHadoopDataset(jobConfig)
Then to read string from hashmap and compare and add back the code is this:
csvhashmap.keys.foreach{i=> if (arrayRDD.zipWithIndex.foreach{case(a,j) => a.split(" ").exists(i contains _); p = j.toInt}==true){new PairRDDFunctions(convert(p,csvhashmap(i))).saveAsHadoopDataset(jobConfig)}}
here csvhashmap is the above described hashmap, "words" is the rdd where we are trying to match the string. When the above command is run, I get the following error:
error: type mismatch;
found : (org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Put)
required: org.apache.spark.rdd.RDD[(?, ?)]
How to get rid of it? I have tried many things to change the data type but each time I get some error. Also, I have checked for the individual functions inside the above snippet and they are just fine. When I integrate them together, I got the above error. Any help would be appreciated.

ScalaSpark - Create a pair RDD with a key and a list of values

I have a log file with a data as the following:
1,2008-10-23 16:05:05.0,\N,Donald,Becton,2275 Washburn Street,Oakland,CA,94660,5100032418,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
2,2008-11-12 03:00:01.0,\N,Donna,Jones,3885 Elliott Street,San Francisco,CA,94171,4150835799,2014-03-18 13:29:47.0,2014-03-18 13:29:47.0
I need to create a pair RDD with the postal code as the key and a list of names (Last Name,First Name) in that postal code as the value.
I need to use mapValues and I did the following:
val namesByPCode = accountsdata.keyBy(line => line.split(',')(8)).mapValues(fields => (fields(0), (fields(4), fields(5)))).collect()
but I'm getting an error. can someone tell me what is wrong with my statement?

keyBy doesn't change the value, so the value stays a single "unsplit" string. You want to first use map to perform the split (to get an RDD[Array[String]]), and then use keyBy and mapValues as you did on the split result:
val namesByPCode = accountsdata.map(_.split(","))
.keyBy(_(8))
.mapValues(fields => (fields(0), (fields(4), fields(5))))
.collect()
BTW - per your description, sounds like you'd also want to call groupByKey on this result (before calling collect), if you want each zipcode to evaluate into a single record with a list of names. keyBy doesn't perform the grouping, it just turns an RDD[V] into an RDD[(K, V)] leaving each record a single record (with potentially many records with same "key").

Cassandra - Get on CF with key returning 0 results, but key exists when retrieving whole table using pycassa

We have a table in Cassandra 1.2.0. That has an VarInt key. When we search keys we can see that they exist.
Table description:
CREATE TABLE u (
key varint PRIMARY KEY,
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=1.000000 AND
replicate_on_write='true' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Select key from u limit 10;
key
12040911
60619595
3220132
4602232
3997404
6312372
1128185
1507755
1778092
4701841
When I try and get the row for key 60619595 it works fine.
cqlsh:users> select key from u where key = 60619595;
key
60619595
cqlsh:users> select key from u where key = 3997404;
When I use pycassa to get the whole table I can access the row.
import pycassa
from struct import *
from pycassa.types import *
from urlparse import urlparse
import operator
userspool = pycassa.ConnectionPool('users');
userscf = pycassa.ColumnFamily(userspool, 'u');
users = {}
u = list(userscf.get_range())
for r in u:
users[r[0]] = r[1]
print users[3997404]
returns the correct result.
What am I doing wrong? I cannot see what the error is.
Any help would be appreciated,
Regards
Michael.
PS:
I should say that in pycassa when I try:
userscf.get(3997404)
File "test.py", line 10, in
userscf.get(3997404)
File "/usr/local/lib/python2.7/dist-packages/pycassa/columnfamily.py", line 655, in get
raise NotFoundException()
pycassa.cassandra.ttypes.NotFoundException: NotFoundException(_message=None)
It seems to be in Ints that are smaller than the average.

You are mixing CQL and Thrift-based queries, which do not always mix well. CQL abstracts the underlying storage rows, whereas Thrift deals directly with them.

This is a problem we are having in our project. I should have added that
select key from u where key = 3997404;
cqlsh:users>
returns 0 results, even although when select * from u in cqlsh, or get the whole table in pycassa we see the row with the key 3997404.
Sorry for the confusion.
Regards
D.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

split the key in a mapreduced text file in pyspark - pyspark

I want to split the key in map reduce and create a new key value pair. current doc file: [(u'ab,xy,sc,dr , u'doc1)] I want to split the key with each value as: [(u'ab,doc1) , (u'xy,doc1) ,(u'sc,doc1) , (u'dr,doc1)] Any help is much appreciated! Thanks

def process(record): for key in record[0].split(','): yield key, record[1] rdd = sc.parallelize([(u'ab,xy,sc,dr', u'doc1')]) rdd.flatMap(process).collect() will result in [(u'ab', u'doc1'), (u'xy', u'doc1'), (u'sc', u'doc1'), (u'dr', u'doc1')]

Related

pyspark how to collect map keys into list

How to get column values from list which contains column names in spark scala dataframe

matching keys of a hashmap to entries of a spark RDD in scala and adding value to if match found and writing rdd back to hbase

ScalaSpark - Create a pair RDD with a key and a list of values

Cassandra - Get on CF with key returning 0 results, but key exists when retrieving whole table using pycassa

Categories

Resources