HBASE Bulk-Load in multiple region for a single table - scala

I am trying to Load data in HBase using BulkLoad. I am also using Scala and Spark to write the code. But every time data is loading on only one single region. I need to load this into multiple region. I have used below code -
Hbase Configuration:
def getConf: Configuration = {
val hbaseSitePath = "/etc/hbase/conf/hbase-site.xml"
val conf = HBaseConfiguration.create()
conf.addResource(new Path(hbaseSitePath))
conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 100)
conf
}
I can load 80GB of Data in only one single region using above mentioned configuration.
But when I am trying load the same amount of data in multiple region with below mentioned configuration getting exception
java.io.IOException: Trying to load more than 32 hfiles to one family
of one region
Updated Configuration -
def getConf: Configuration = {
val conf = HBaseConfiguration.create()
conf.addResource(new Path(hbaseSitePath))
conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 32)
conf.setLong("hbase.hregion.max.filesize", 107374182)
conf.set("hbase.regionserver.region.split.policy","org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy")
conf
}
For saving records I am using below code -
val kv = new KeyValue(Bytes.toBytes(key), columnFamily.getBytes(),
columnName.getBytes(), columnValue.getBytes())
(new ImmutableBytesWritable(Bytes.toBytes(key)), kv)
rdd.saveAsNewAPIHadoopFile(pathToHFile, classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], conf) //Here rdd is the input
val loadFiles = new LoadIncrementalHFiles(conf)
loadFiles.doBulkLoad(new Path(pathToHFile), hTable)
Need Help on this.

You are getting issue because 32 is default value per region. You should define KeyPrefixRegionSplitPolicy to split your files and you can increase increase hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily as below
conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024)
Try also to increase file size as
conf.setLong("hbase.hregion.max.filesize", 107374182)

Related

MongoDB read operations take too long on spark, Is there any additional conf argument?

I want to query MongoDB collection with Spark using PySpark. But when I use "collect", "count", etc. with some filter conditions(for examle field > 5) the job doesn't appear for long time on Spark interface and takes too long to complete. But in small collections there is no problem, everything goes normal. I think Apache Spark tries to read all data.
My confs;
conf = SparkConf()\
.set('spark.executor.memory', '7g')\
.set('spark.executor.cores', '2')\
.set('spark.num.executors', '8')\
.set('spark.cores.max', '16')\
.set('spark.driver.memory', '10g')\
.set('spark.driver.port', '10001')\
.set('spark.driver.host', 'xx.xx.xx.xx')\
.set('spark.dynamicAllocation.enabled', "false")\
.set("spark.mongodb.read.connection.uri", "mongodb://XX.XX.XX.XX:27017")\
.set("spark.mongodb.read.database", "collection_1")\
.set("spark.mongodb.read.collection", "ADPacket")\
.set("spark.mongodb.write.connection.uri", "mongodb://XX.XX.XX.XX:27017")\
.set("spark.mongodb.write.database", "collection_1")\
.set("spark.mongodb.write.collection", "ADPacket")\
.set("spark.mongodb.write.operationType", "update")\
.set("spark.mongodb.readPreference.name", "secondaryPreferred")\
.setAppName('hello')\
.setMaster('spark://xx.xx.xx.xx:7077')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
sqlContext = SQLContext(sc)
and I load the data;
df = sqlContext.read.format("mongodb").load()
df.createOrReplaceTempView("ADPacket")
I tried both below commands.
sqlContext.sql("SELECT * FROM ADPacket WHERE AssetConnectDeviceKey='xxxx' and TimeStamp<1655555989 LIMIT 500").collect()
df.filter((df["TimeStamp"]>1650050989) & (df["TimeStamp"]<1655555989)).count()
In addition AssetConnectDeviceKey and TimeStamp are indexes in MongoDB collection.

Synapse - Notebook not working from Pipeline

I have a notebook in Azure Synapse that reads parquet files into a data frame using the synapsesql function and then pushes the data frame contents into a table in the SQL Pool.
Executing the notebook manually is successful and the table is created and populated in the Synapse SQL pool.
When I try to call the same notebook from an Azure Synapse pipeline it returns successful however does not create the table. I am using the Synapse Notebook activity in the pipeline.
What could be the issue here?
I am getting deprecated warnings around the synapsesql function but don't know what is actually deprecated.
The code is below.
%%spark
val pEnvironment = "t"
val pFolderName = "TestFolder"
val pSourceDatabaseName = "TestDatabase"
val pSourceSchemaName = "TestSchema"
val pRootFolderName = "RootFolder"
val pServerName = pEnvironment + "synas01"
val pDatabaseName = pEnvironment + "syndsqlp01"
val pTableName = pSourceDatabaseName + "" + pSourceSchemaName + "" + pFolderName
// Import functions and Synapse connector
import org.apache.spark.sql.DataFrame
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.functions.
import org.apache.spark.sql.SqlAnalyticsConnector.
// Get list of "FileLocation" from control.FileLoadStatus
val fls:DataFrame = spark.read.
synapsesql(s"${pDatabaseName}.control.FileLoadStatus").
select("FileLocation","ProcessedDate")
// Read all parquet files in folder into data frame
// Add file name as column
val df:DataFrame = spark.read.
parquet(s"/source/${pRootFolderName}/${pFolderName}/").
withColumn("FileLocation", input_file_name())
// Join parquet file data frame to FileLoadStatus data frame
// Exclude rows in parquet file data frame where ProcessedDate is not null
val df2 = df.
join(fls,Seq("FileLocation"), "left").
where(fls("ProcessedDate").isNull)
// Write data frame to sql table
df2.write.
option(Constants.SERVER,s"${pServerName}.sql.azuresynapse.net").
synapsesql(s"${pDatabaseName}.xtr.${pTableName}",Constants.INTERNAL)
This case happens often and to get the output after pipeline execution. Follow the steps mentioned.
Pick up the Apache Spark application name from the output of pipeline
Navigate to Apache Spark Application under Monitor tab and search for the same application name .
These 4 tabs would be available there: Diagnostics,Logs,Input data,Output data
Go to Logs ad check 'stdout' for getting the required output.
https://www.youtube.com/watch?v=ydEXCVVGAiY
Check the above video link for detailed live procedure.

Generating a single output file for each processed input file in Apach Flink

I am using Scala and Apache Flink to build an ETL that reads all the files under a directory in my local file system periodically and write the result of processing each file in a single output file under another directory.
So an example of this is would be:
/dir/to/input/files/file1
/dir/to/intput/files/fil2
/dir/to/input/files/file3
and the output of the ETL would be exactly:
/dir/to/output/files/file1
/dir/to/output/files/file2
/dir/to/output/files/file3
I have tried various approaches including reducing the parallel processing to one when writing to the dataSink but I still can't achieve the required result.
This is my current code:
val path = "/path/to/input/files/"
val format = new TextInputFormat(new Path(path))
val socketStream = env.readFile(format, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 10)
val wordsStream = socketStream.flatMap(value => value.split(",")).map(value => WordWithCount(value,1))
val keyValuePair = wordsStream.keyBy(_.word)
val countPair = keyValuePair.sum("count")
countPair.print()
countPair.writeAsText("/path/to/output/directory/"+
DateTime.now().getHourOfDay.toString
+
DateTime.now().getMinuteOfHour.toString
+
DateTime.now().getSecondOfMinute.toString
, FileSystem.WriteMode.NO_OVERWRITE)
// The first write method I trid:
val sink = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink.setBucketer(new DateTimeBucketer[WordWithCount]("yyyy-MM-dd--HHmm"))
// The second write method I trid:
val sink3 = new BucketingSink[WordWithCount]("/path/to/output/directory/")
sink3.setUseTruncate(false)
sink3.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm"))
sink3.setWriter(new StringWriter[WordWithCount])
sink3.setBatchSize(3)
sink3.setPendingPrefix("file-")
sink3.setPendingSuffix(".txt")
Both writing methods fail in producing the wanted result.
Can some with experience with Apache Flink guide me to the write approach please.
I solved this issue importing the next dependencies to run on local machine:
hadoop-aws-2.7.3.jar
aws-java-sdk-s3-1.11.183.jar
aws-java-sdk-core-1.11.183.jar
aws-java-sdk-kms-1.11.183.jar
jackson-annotations-2.6.7.jar
jackson-core-2.6.7.jar
jackson-databind-2.6.7.jar
joda-time-2.8.1.jar
httpcore-4.4.4.jar
httpclient-4.5.3.jar
You can review it on :
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/aws.html
Section "Provide S3 FileSystem Dependency"

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Using append command to update line chart in Lightning server

I want to visualize some streaming data using local Lightning server. I created a simple Scala test, which created a line chart that I can access at http://localhost:3000. The problem is that when I use viz.append(newdata) in order to update the existing line chart, this new data is not sent to the server and the graphic remains the same. If, however, I do lgn.lineStreaming(Array(Array(1.0)), then the new line chart is created.
So, what's the problem with updating the streaming line chart in Lightning?
import org.viz.lightning._
var viz: Visualization = _
//...
val lgn = Lightning(host="http://localhost:3000")
lgn.createSession("streamingtest")
// initialization
val series = Array.fill(1)(Array.fill(1)(r.nextInt(1)))
viz = lgn.lineStreaming(series)
// ...
// adding new data (THIS DATA IS NOT SENT TO THE SERVER)
val newdata: Map[String, Any] = Map("1" -> 1.0)
viz.append(newdata)