Scala: Creating a HBase table with pre splitting region based on Row Key - scala

I have three RegionServers. I want to evenly distribute a HBase table onto three regionservres based on rowkeys which I have already identified (say, rowkey_100 and rowkey_200). It can be done from hbase shell using:
create 'tableName', 'columnFamily', {SPLITS => ['rowkey_100','rowkey_200']}
If I am not mistaken, this 2 split points will create 3 regions, and the first 100 rows will go to the 1st regionserver, next 100 rows will be in 2nd regionserver and the remaining rows in last regionserver. I want to do the same thing using scala code. How can I specify this in scala code to split table into regions?

Below is a Scala snippet for creating a HBase table with splits:
val admin = new HBaseAdmin(conf)
if (!admin.tableExists(myTable)) {
val htd = new HTableDescriptor(myTable)
val hcd = new HColumnDescriptor(myCF)
val splits = Array[Array[Byte]](splitPoint1.getBytes, splitPoint2.getBytes)
htd.addFamily(hcd)
admin.createTable(htd, splits)
}
There are some predefined region split policies, but in case you want to create your own way of setting split points that span your rowkey range, you can create a simple function like the following:
def autoSplits(n: Int, range: Int = 256) = {
val splitPoints = new Array[Array[Byte]](n)
for (i <- 0 to n-1) {
splitPoints(i) = Array[Byte](((range / (n + 1)) * (i + 1)).asInstanceOf[Byte])
}
splitPoints
}
Just comment out the val splits = ... line and replace createTable's splits parameter with autoSplits(2) or autoSplits(4, 128), etc.

This java code can help
HTableDescriptor td = new HTableDescriptor(TableName.valueOf("tableName"));
HColumnDescriptor cf = new HColumnDescriptor("cf".getBytes());
td.addFamily(cf);
byte[][] splitKeys = new byte[] {key1.getBytes(), key2.getBytes()};
HBaseAdmin dbAdmin = new HBaseAdmin(conf);
dbAdmin.createTable(td, splitKeys);

Related

A scalable Graph method for finding cliques for complete connected components PySpark

I'm trying to split GraphFrame connectedComponent output for each component to have a sub-group for each complete connected, meaning all vertices are connected to each other. the following sketch will help demonstrate what I'm trying to achieve
I'm using NetworkX method in order to achive it as following
def create_subgroups(edges,components, key_name = 'component'):
# joining the edges to enrich component id
sub_components = edges.join(components,[(edges.dst == components.id) | (edges.src == components.id)]).select('src','dst',key_name).drop_duplicates()
# caching the table using temp table
sub_components = save_temp_table(sub_components,f'inner_sub_{key_name}s', zorder = [key_name])
schema = StructType([ \
StructField("index",LongType(),True), \
StructField("id",StringType(),True), \
])
# applying pandas udf to enrich each vertices with the new component id
sub_components = sub_components.groupby(key_name).applyInPandas(pd_create_subgroups, schema).where('id != "not_connected"').drop_duplicates()
# joining the output and mulitplying each vertices by the time of sub-groups were found
components = components.join(sub_components,'id','left')
components = components.withColumn(key_name,when(col('index').isNull(),col(key_name)).otherwise(concat(col(key_name),lit('_'),concat('index')))).drop('index')
return components
import networkx as nx
from networkx.algorithms.clique import find_cliques
def pd_create_subgroups(pdf):
# building the graph
gnx = nx.from_pandas_edgelist(pdf,'src','dst')
# removing one degree nodes
outdeg = gnx.degree()
to_remove = [n[0] for n in outdeg if n[1] == 1]
gnx.remove_nodes_from(to_remove)
bic = list(find_cliques(gnx))
if len(bic)<=2:
return pd.DataFrame(data = {"index":[-1],"id":["not_connected"]})
res = {
"index":[],
"id":[]
}
ind = 0
for i in bic:
if len(i)<3:
continue
for id in i:
res['index'] = res['index'] + [ind]
res['id'] = res['id'] + [id]
ind += 1
return pd.DataFrame(res)
# creating sub-components if necessary
subgroups = create_subgroups(edges,components, key_name = 'component')
My problem is that there's a very large component containing 80% of the vertices causing very slow performance of the clusters. I've been trying to use labelPropagation to create smaller groups but it wouldn't do the trick. it has split it in a way that isn't suitable causing a split of vertices that should have been in the same groups.
Here's the cluster usage when it reaches the pandas_udf part
This issue was resolved by separating vertices into N groups, pulling all edges for each vertice in the group, and calculating the sub-group using the find_cliques method.

Faster way to get single cell value from Dataframe (using just transformation)

I have the following code where I want to get Dataframe dfDateFiltered from dfBackendInfo containing all rows with RowCreationTime greater than timestamp "latestRowCreationTime"
val latestRowCreationTime = dfVersion.agg(max("BackendRowCreationTime")).first.getTimestamp(0)
val dfDateFiltered = dfBackendInfo.filter($"RowCreationTime" > latestRowCreationTime)
The problem I see is that the first line adds a job in Databricks cluster making it slower.
Is there anyway if I could use a better way to filter (for ex. just using transformation instead of action)?
Below are the schemas of the 2 Dataframes:
case class Version(BuildVersion:String,
MainVersion:String,
Hotfix:String,
BackendRowCreationTime:Timestamp)
case class BackendInfo(SerialNumber:Integer,
NumberOfClients:Long,
BuildVersion:String,
MainVersion:String,
Hotfix:String,
RowCreationTime:Timestamp)
The below code worked:
val dfLatestRowCreationTime1 = dfVersion.agg(max($"BackendRowCreationTime").as("BackendRowCreationTime")).limit(1)
val latestRowCreationTime = dfLatestRowCreationTime1.withColumn("BackendRowCreationTime", when($"BackendRowCreationTime".isNull, DefaultTime))
val dfDateFiltered = dfBackendInfo.join(latestRowCreationTime, dfBackendInfo.col("RowCreationTime").gt(latestRowCreationTime.col("BackendRowCreationTime")))

Split string twice and reduceByKey in Scala

I have a .csv file that I am trying to analyse using spark. The .csv file contains amongst others, a list of topics and their counts. The topics and their counts are separated by a ',' and all these topics+counts are in the same string separated by ';' like so
"topic_1,10;topic_2,12;topic_1,3"
As you can see, some topics are in the string multiple times.
I have an rdd containing key value pairs of some date and the topic strings [date, topicstring]
What I want to do is split the string at the ';' to get all the separate topics, then split those topics at the ',' and create a key value pair of the topic name and counts, which can be reduced by key. For the example above this would be
[date, ((topic_1, 13), (topic_2, 12))]
So I have been playing around in spark a lot as I am new to scala. What I tried to do is
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
The problem is that this returns an Array of tuples, which I cannot reduceByKey. When I split the string at ';' this returns an array. I tried mapping this to a tuple (as you can see from the map operation) but this does not work.
The complete code I used is
val rdd = sc.textFile("./data/segment/*.csv")
val topicsByDate = rdd
.filter(line => line.split("\t").length > 23)
.map({case(str) => (str.split("\t")(1), str.split("\t")(23))})
.reduceByKey(_ + _)
val separateTopicsByDate = topicsByDate
.mapValues(_.split(";").map({case(str) => (str)}))
.mapValues({case(arr) => arr
.filter(str => str.split(",").length > 1)
.map({case(str) => (str.split(",")(0), str.split(",")(1).toInt)})
})
separateTopicsByDate.take(2)
This returns
res42: Array[(String, Array[(String, Int)])] = Array((20150219001500,Array((Cecilia Pedraza,91), (Mexico City,110), (Soviet Union,1019), (Dutch Warmbloods,1236), (Jose Luis Vaquero,1413), (National Equestrian Club,1636), (Lenin Park,1776), (Royal Dutch Sport Horse,2075), (North American,2104), (Western Hemisphere,2246), (Maydet Vega,2800), (Mirach Capital Group,110), (Subrata Roy,403), (New York,820), (Saransh Sharma,945), (Federal Bureau,1440), (San Francisco,1482), (Gregory Wuthrich,1635), (San Francisco,1652), (Dan Levine,2309), (Emily Flitter,2327), (K...
As you can see this is an array of tuples which I cannot use .reduceByKey(_ + _) on.
Is there a way to split the string in such a way that it can be reduced by key?
In case if your RDD has rows like:
[date, "topic1,10;topic2,12;topic1,3"]
you can split the values and explode the row using flatMap into:
[date, ["topic1,10", "topic2,12", "topic1,3"]] ->
[date, "topic1,10"]
[date, "topic2,12"]
[date, "topic1,3"]
Then convert each row into [String,Integer] Tuple (rdd1 in the code):
["date_topic1",10]
["date_topic2",12]
["date_topic1",3]
and reduce by Key using addition (rdd2 in the code):
["date_topic1",13]
["date_topic2",12]
Then you separate dates from topics and combine topics with values, getting [String,String] Tuples like:
["date", "topic1,13"]
["date", "topic2,12"]
Finally you split the values into [topic,count] Tuples, prepare ["date", [(topic,count)]] pairs (rdd3 in the code) and reduce by Key (rdd4 in the code), getting:
["date", [(topic1, 13), (topic2, 12)]]
===
below is Java implementation as a sequence of four intermediate RDDs, you may refer to it for Scala development:
JavaPairRDD<String, String> rdd; //original data. contains [date, "topic1,10;topic2,12;topic1,3"]
JavaPairRDD<String, Integer> rdd1 = //contains
//["date_topic1",10]
//["date_topic2",12]
//["date_topic1",3]
rdd.flatMapToPair(
pair -> //pair=[date, "topic1,10;topic2,12;topic1,3"]
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
String k = pair._1; //date
String v = pair._2; //"topic,count;topic,count;topic,count"
String[] v_splits = v.split(";");
for(int i=0; i<v_splits.length; i++)
{
String[] v_split_topic_count = v_splits[i].split(","); //"topic,count"
list.add(new Tuple2<String,Integer>(k + "_" + v_split_topic_count[0], Integer.parseInt(v_split_topic_count[1]))); //"date_topic,count"
}
return list.iterator();
}//end call
);
JavaPairRDD<String,Integer> rdd2 = //contains
//["date_topic1",13]
//["date_topic2",12]
rdd1.reduceByKey((Integer i1, Integer i2) -> i1+i2);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd3 = //contains
//["date", [(topic1,13)]]
//["date", [(topic2,12)]]
rdd2.mapToPair(
pair -> //["date_topic1",13]
{
String k = pair._1; //date_topic1
Integer v = pair._2; //13
String[] dateTopicSplits = k.split("_");
String new_k = dateTopicSplits[0]; //date
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
list.add(new Tuple2<String,Integer>(dateTopicSplits[1], v)); //[(topic1,13)]
return new Tuple2<String,Iterator<Tuple2<String,Integer>>>(new_k, list.iterator());
}
);
JavaPairRDD<String,Iterator<Tuple2<String,Integer>>> rdd4 = //contains
//["date", [(topic1, 13), (topic2, 12)]]
rdd3.reduceByKey(
(Iterator<Tuple2<String,Integer>> itr1, Iterator<Tuple2<String,Integer>> itr2) ->
{
List<Tuple2<String,Integer>> list = new ArrayList<Tuple2<String,Integer>>();
while(itr1.hasNext())
list.add(itr1.next());
while(itr2.hasNext())
list.add(itr2.next());
return list.iterator();
}
);
UPD. This problem can actually be solved by using a single map only - you split the row value (i.e. topicstring) by ; so it gives you [key,value] pairs as [topic,count] and you populate the hashmap by those pairs adding up the counts. Finally you output the date key with all distinct keys accumulated in the hashmap together with their values.
This way seems to be more efficient as well because the size of hashmap is not going to be larger than the size of the original row so the memory space consumed by mapper will be limited by the size of largest row, whereas in the flatmap solution, memory should be large enough to fit all those expanded rows

Split rdd and Select elements

I am trying to capture a stream, transform the data, and then save it locally.
So far, streaming, and writing works fine. However, the transformation only works halfway.
The stream I receive consists out of 9 columns separated by "|". So I want to split it, and let's say select column 1,3, and 5. What I have tried looks like this, but nothing really let to a result
val indices = List(1,3,5)
linesFilter.window(Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS), Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS)).foreachRDD { (rdd, time) =>
if (rdd.count() > 0) {
rdd
.map(_.split("\\|").slice(1,2))
//.map(arr => (arr(0), arr(2))))
//filter(x=> indices.contains(_(x)))) //selec(indices)
//.zipWithIndex
.coalesce(1,true)
//the replacement is is used so that I get a csv file at the end
//.map(_.replace(DELIMITER_STREAM, DELIMITER_OUTPUT))
//.map{_.mkString(DELIMITER_OUTPUT) }
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}
Has anyone a tip how do I split a rdd and then only grab specific elements out of it?
Edit Input:
val lines = streamingContext.socketTextStream(HOST, PORT)
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
Input stream is that:
536365|71053|white metal lantern|6|01-12-10 8:26|3,39|17850|united kingdom|2017-11-17 14:52:22
Thank you very much everyone.
As you recommended, I modified my code like that:
private val DELIMITER_STREAM = "\\|"
val linesFilter = lines
.map(_.toLowerCase)
.filter(_.split(DELIMITER_STREAM).length == 9)
.map(x =>{
val y = x.split(DELIMITER_STREAM)
(y(0),y(1),y(3),y(4),y(5),y(6),y(7))})
and then in the RDD
if (rdd.count() > 0) {
rdd
.map(_.productIterator.mkString(DELIMITER_OUTPUT))
.coalesce(1,true)
.saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}

Processing multiple files separately in single spark submit job

I have following directory structure:
/data/modelA
/data/modelB
/data/modelC
..
Each of these files have data in format (id,score), I have to do following for them separately-
1) group by scores and sort the scores in descending(DF_1: score,count)
2) from DF_1 compute the cumulative frequency for each sorted group of score (DF_2: score, count, cumFreq)
3) from DF_2 select cumulative frequencies that lie between 5-10 (DF_3: score, cumFreq)
4) from DF_3 select minimum score(DF_4: score)
5) from file select all id which have score greater than score in DF_4 and save
I am able to do this by reading the directory as wholeTextFile and creating a common dataframe for all the models, then use group by on model.
I want to do -
val scores_file = sc.wholeTextFiles("/data/*/")
val scores = scores_file.map{ line =>
//step 1
//step 2
//step 3
//step 4
//step 5 : save as line._1
}
This will help dealing with each file separately, and avoid group by.
Assuming that your models are discrete values and you know then you can define all the model into a list
val model = List("modelA", "modelB", "modelC", ... )
you can have the following approach:
model.forEach( model => {
val scoresPerModel = sc.textFile(model);
scoresPerModel.map { line =>
// business logic here
}
})
If the you don't know the model prior to computing the business logic that you have to read using the Hadoop file system API and extract the models from there.
private val fs = {
val conf = new org.apache.hadoop.conf.Configuration()
FileSystem.get(conf)
}
fs.listFiles(new Path(hdfsPath))