HBase Need to export data from one cluster and import it to another with slight modification in row key
As I have referred in above post, need to export the HBase data of table from one cluster and import it into the another cluster by changing row key based on our match pattern
In the "org.apache.hadoop.hbase.mapreduce.Import" we have option to change the ColumnFamily using the args "HBASE_IMPORTER_RENAME_CFS"
I have slightly modified the Import code to support row key change.My code is available in Pastebin
https://pastebin.com/ticgeBb0
Changed the row key using the below code.
private static Cell convertRowKv(Cell kv, Map<byte[], byte[]> rowkeyReplaceMap) {
if (rowkeyReplaceMap != null) {
byte[] oldrowkeyName = CellUtil.cloneRow(kv);
String oldrowkey = Bytes.toString(oldrowkeyName);
Set<byte[]> keys = rowkeyReplaceMap.keySet();
for (byte[] key : keys) {
if (oldrowkey.contains(Bytes.toString(key))) {
byte[] newrowkeyName = rowkeyReplaceMap.get(key);
ByteBuffer buffer = ByteBuffer.wrap(oldrowkeyName);
buffer.get(key);
ByteBuffer newbuffer = buffer.slice();
ByteBuffer bb = ByteBuffer.allocate(newrowkeyName.length + newbuffer.capacity());
byte[] newrowkey = bb.array();
kv = new KeyValue(newrowkey, // row buffer
0, // row offset
newrowkey.length, // row length
kv.getFamilyArray(), // CF buffer
kv.getFamilyOffset(), // CF offset
kv.getFamilyLength(), // CF length
kv.getQualifierArray(), // qualifier buffer
kv.getQualifierOffset(), // qualifier offset
kv.getQualifierLength(), // qualifier length
kv.getTimestamp(), // timestamp
KeyValue.Type.codeToType(kv.getTypeByte()), // KV
// Type
kv.getValueArray(), // value buffer
kv.getValueOffset(), // value offset
kv.getValueLength()); // value length
}
}
}
return kv;
}
Executed the Import
hbase org.apache.hadoop.hbase.mapreduce.ImportWithRowKeyChange -DHBASE_IMPORTER_RENAME_ROW=123:123456 import file:///home/nshsh/export/
The row key has been successfully changed. But while put the Cell in the HBase table, using
"org.apache.hadoop.hbase.client.Put.add(Cell)" we have check as
"the row of the kv is the same as the put as we are changing row key"
Here it fails.
Then I have commented the check in Put class and updated the hbase-client.jar. Also I have tried to write HBasePut which extends Put
public class HBasePut extends Put {
public HBasePut(byte[] row) {
super(row);
// TODO Auto-generated constructor stub
}
public Put add(Cell kv) throws IOException{
byte [] family = CellUtil.cloneFamily(kv);
System.err.print(Bytes.toString(family));
List<Cell> list = getCellList(family);
//Checking that the row of the kv is the same as the put
/*int res = Bytes.compareTo(this.row, 0, row.length,
kv.getRowArray(), kv.getRowOffset(), kv.getRowLength());
if (res != 0) {
throw new WrongRowIOException("The row in " + kv.toString() +
" doesn't match the original one " + Bytes.toStringBinary(this.row));
}*/
list.add(kv);
familyMap.put(family, list);
return this;
}
}
In the Mapreduce, the task always fails with the below exception
2020-07-24 13:37:15,105 WARN [htable-pool1-t1] hbase.HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
2020-07-24 13:37:15,122 INFO [LocalJobRunner Map Task Executor #0] client.AsyncProcess: , tableName=import
2020-07-24 13:37:15,178 INFO [htable-pool1-t1] client.AsyncProcess: #2, table=import, attempt=18/35 failed=7ops, last exception: org.apache.hadoop.hbase.client.WrongRowIOException: org.apache.hadoop.hbase.client.WrongRowIOException: The row in \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00/vfrt:con/1589541180643/Put/vlen=225448/seqid=0 doesn't match the original one 123_abcf
at org.apache.hadoop.hbase.client.Put.add(Put.java:330)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toPut(ProtobufUtil.java:574)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:744)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:720)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2168)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
I don't know where the old Put Class has been referred in the task.
Can someone please help to fix this.
Related
In my code has an issue but I can't see what issue in this. Column names are same word by word and it is working, If I use 1 column in csv file but when I try out more then 2-3 column fields it is giving the error below. I have checked read lots of article so I can't fix the error. What can be happen with is this lines. DB already was created with similar fields.
private void DBaktar()
{
string SQLServerConnectionString = "Server =.\\SQLEXPRESS; Database = Qiti; User Id = sa; Password = 7731231xx!!;";
string CSVpath = #"D:\FTP\"; // CSV file Path
string CSVFileConnectionString = String.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};;Extended Properties=\"text;HDR=Yes;FMT=Delimited\";", CSVpath);
var AllFiles = new DirectoryInfo(CSVpath).GetFiles("*.CSV");
string File_Name = string.Empty;
foreach (var file in AllFiles)
{
try
{
DataTable dt = new DataTable();
using (OleDbConnection con = new OleDbConnection(CSVFileConnectionString))
{
con.Open();
var csvQuery = string.Format("select * from [{0}]", file.Name);
using (OleDbDataAdapter da = new OleDbDataAdapter(csvQuery, con))
{
da.Fill(dt);
}
}
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(SQLServerConnectionString))
{
bulkCopy.ColumnMappings.Add("LKod", "LKod");
bulkCopy.ColumnMappings.Add("info", "info");
bulkCopy.ColumnMappings.Add("Codex", "Codex");
bulkCopy.ColumnMappings.Add("LthNo", "LthNo");
bulkCopy.ColumnMappings.Add("Datein", "Datein");
bulkCopy.DestinationTableName = "U_Tik";
bulkCopy.BatchSize = 0;
bulkCopy.EnableStreaming = true;
bulkCopy.WriteToServer(dt);
bulkCopy.Close();
}
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
}
Error exception;
The given ColumnName 'LKod' does not match up with any column in data
source.
ex.StackTrace;
at
System.Data.SqlClient.SqlBulkCopy.WriteRowSourceToServerCommon(Int32
columnCount) at
System.Data.SqlClient.SqlBulkCopy.WriteRowSourceToServerAsync(Int32
columnCount, CancellationToken ctoken) at
System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable table,
DataRowState rowState) at
System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable table)
Some information can be found here: https://sqlbulkcopy-tutorial.net/columnmapping-does-not-match
Cause
You didn't provide any ColumnMappings, and there is more column in the source than in the destination.
You provided an invalid column name for the source.
You provided an invalid column name for the destination.
Solution
ENSURE to provide a ColumnMappings
ENSURE all values for source column name are valid and case sensitive.
ENSURE all values for destination column name are valid and case sensitive.
MAKE the source case insensitive
I have found a solution and working 100% true.. The link below, I hope become a path who need that.
https://johnnycode.com/2013/08/19/using-c-sharp-sqlbulkcopy-to-import-csv-data-sql-server/
I've created a custom Magic command with the intention of generating a spark query programatically. Here's the relevant part of my class that implements the MagicCommandFunctionality:
MagicCommandOutcomeItem execute(MagicCommandExecutionParam magicCommandExecutionParam) {
// get the string that was entered:
String input = magicCommandExecutionParam.command.substring(MAGIC.length())
// use the input to generate a query
String generatedQuery = Interpreter.interpret(input)
MIMEContainer result = Text(generatedQuery);
return new MagicCommandOutput(MagicCommandOutcomeItem.Status.OK, result.getData().toString());
}
This works splendidly. It returns the command that I generated. (As text)
My question is -- how do I coerce the notebook into evaluating that value in the cell? My guess is that a SimpleEvaluationObject and TryResult are involved, but I can't find any examples of their use
Rather than creating the MagicCommandOutput I probably want the Kernel to create one for me. I see that the KernelMagicCommand has an execute method that would do that. Anyone have any ideas?
Okay, I found one way to do it. Here's my solution:
You can ask the current kernelManager for the kernel you're interested in,
then call PythonEntryPoint.evaluate. It seems to do the job!
#Override
MagicCommandOutcomeItem execute(MagicCommandExecutionParam magicCommandExecutionParam) {
String input = magicCommandExecutionParam.command.substring(MAGIC.length() + 1)
// this is the Scala code I want to evaluate:
String codeToExecute = <your code here>
KernelFunctionality kernel = KernelManager.get()
PythonEntryPoint pep = kernel.getPythonEntryPoint(SCALA_KERNEL)
pep.evaluate(codeToExecute)
pep.getShellMsg()
List<Message> messages = new ArrayList<>()
//until there are messages on iopub channel available collect them into response
while (true) {
String iopubMsg = pep.getIopubMsg()
if (iopubMsg == "null") break
try {
Message msg = parseMessage(iopubMsg) //(I didn't show this part)
messages.add(msg)
String commId = (String) msg.getContent().get("comm_id")
if (commId != null) {
kernel.addCommIdManagerMapping(commId, SCALA_KERNEL)
}
} catch (IOException e) {
log.error("There was an error: ${e.getMessage()}")
return new MagicKernelResponse(MagicCommandOutcomeItem.Status.ERROR, messages)
}
}
return new MagicKernelResponse(MagicCommandOutcomeItem.Status.OK, messages)
}
I need to group output of my Enumerator in different ZipEntries, based on specific property (providerId), original chartPreparations stream is ordered by providerId, so I can just keep reference to provider, and add a new entry when provider chages
Enumerator.outputStream(os => {
val currentProvider = new AtomicReference[String]()
// Step 1. Creating zipped output file
val zipOs = new ZipOutputStream(os, Charset.forName("UTF8"))
// Step 2. Processing chart preparation Enumerator
val chartProcessingTask = (chartPreparations) run Iteratee.foreach(cp => {
// Step 2.1. Write new entry if needed
if(currentProvider.get() == null || cp.providerId != currentProvider.get()) {
if (currentProvider.get() != null) {
zipOs.write("</body></html>".getBytes(Charset.forName("UTF8")))
}
currentProvider.set(cp.providerId)
zipOs.putNextEntry(new ZipEntry(cp.providerName + ".html"))
zipOs.write(HTML_HEADER)
}
// Step 2.2 Write chart preparation in HTML format
zipOs.write(toHTML(cp).getBytes(Charset.forName("UTF8")))
})
// Step 3. On Complete close stream
chartProcessingTask.onComplete(_ => zipOs.close())
})
Since current provider reference, is changing, during the output, I made it AtomicReference, so that I could handle references from different threads.
Can currentProvider just be a var Option[String], and why?
Hi all I have a to parse some files to load into a DataSet and I ran into an issue where the first row value is sometimes blank so when I parse the data the rows added to the columns are off because there is no value for the row[RouteCode].
Example Data
Columns are in the first line(Tab delimited) DataRows are in the following rows(Tab delimited)
RouteCode City EmailAddress FirstName
NULL MyCity My-Email MyFirstName
What I am seeing is all the Columns are added fine but each row added the first tab value is not detected so it shifts the columns(hope I am making sense) so in this case the city data is sitting in the RouteCode column and the last column somehow is getting the first row value (tab).
class TextToDataSet
{
public TextToDataSet()
{ }
/// <summary>
/// Converts a given delimited file into a dataset.
/// Assumes that the first line
/// of the text file contains the column names.
/// </summary>
/// <param name="File">The name of the file to open</param>
/// <param name="TableName">The name of the
/// Table to be made within the DataSet returned</param>
/// <param name="delimiter">The string to delimit by</param>
/// <returns></returns>
public static DataSet Convert(string File,
string TableName, string delimiter)
{
//The DataSet to Return
DataSet result = new DataSet();
//Open the file in a stream reader.
using (StreamReader s = new StreamReader(File))
{
//Split the first line into the columns
string[] columns = s.ReadLine().Split(delimiter.ToCharArray());
//Add the new DataTable to the RecordSet
result.Tables.Add(TableName);
//Cycle the colums, adding those that don't exist yet
//and sequencing the one that do.
foreach (string col in columns)
{
bool added = false;
string next = "";
int i = 0;
while (!added)
{
//Build the column name and remove any unwanted characters.
string columnname = col + next;
columnname = columnname.Replace("#", "");
columnname = columnname.Replace("'", "");
columnname = columnname.Replace("&", "");
//See if the column already exists
if (!result.Tables[TableName].Columns.Contains(columnname))
{
//if it doesn't then we add it here and mark it as added
result.Tables[TableName].Columns.Add(columnname);
added = true;
}
else
{
//if it did exist then we increment the sequencer and try again.
i++;
next = "_" + i;
}
}
}
//Read the rest of the data in the file.
string AllData = s.ReadToEnd();
//Split off each row at the Carriage Return/Line Feed
//Default line ending in most windows exports.
//You may have to edit this to match your particular file.
//This will work for Excel, Access, etc. default exports.
string[] rows = AllData.Split("\n".ToCharArray());
//Now add each row to the DataSet
foreach (string r in rows)
{
//Split the row at the delimiter.
string[] items = r.Split(delimiter.ToCharArray());
//Add the item
result.Tables[TableName].Rows.Add(items);
}
}
//Return the imported data.
return result;
}
}
}
If there aren't supposed to be any missing entries anywhere in the file (i.e there should always be something between the tabs) then you could use:
string[] columns = s.ReadLine().Split(delimiter.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
and then check that columns is not an empty array. If it is then read the next line and carry on processing:
while (columns.Length == 0)
{
// Row is empty so read the next line out of the file
columns = s.ReadLine().Split(delimiter.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
}
This will ensure that your data always starts with a filled row. However, it will break down if there is ever an empty entry further down the list.
If there could be empty entries then you'll probably have to check for all columns being empty:
while (columns.All(c => string.IsNullOrEmpty(c)))
{
// Row is empty so read the next line out of the file
columns = s.ReadLine().Split(delimiter.ToCharArray());
}
I've been using JPA to insert entities into a database but I've run up against a problem where I need to do an insert and get the primary key of the record last inserted.
Using PostgreSQL I would use an INSERT RETURNING statement which would return the record id, but with an entity manager doing all this, the only way I know is to use SELECT CURRVAL.
So the problem becomes, I have several data sources sending data into a message driven bean (usually 10-100 messages at once from each source) via OpenMQ and inside this MDB I persists this to PostgreSQL via the entity manager. It's at this point I think there will be a "race condition like" effect of having so many inserts that I won't necessarily get the last record id using SELECT CURRVAL.
My MDB persists 3 entity beans via an entity manager like below.
Any help on how to better do this much appreciated.
public void onMessage(Message msg) {
Integer agPK = 0;
Integer scanPK = 0;
Integer lookPK = 0;
Iterator iter = null;
List<Ag> agKeys = null;
List<Scan> scanKeys = null;
try {
iag = (IAgBean) (new InitialContext()).lookup(
"java:comp/env/ejb/AgBean");
TextMessage tmsg = (TextMessage) msg;
// insert this into table only if doesn't exists
Ag ag = new Ag(msg.getStringProperty("name"));
agKeys = (List) (iag.getPKs(ag));
iter = agKeys.iterator();
if (iter.hasNext()) {
agPK = ((Ag) iter.next()).getId();
}
else {
// no PK found so not in dbase, insert new
iag.addAg(ag);
agKeys = (List) (iag.getPKs(ag));
iter = agKeys.iterator();
if (iter.hasNext()) {
agPK = ((Ag) iter.next()).getId();
}
}
// insert this into table always
iscan = (IScanBean) (new InitialContext()).lookup(
"java:comp/env/ejb/ScanBean");
Scan scan = new Scan();
scan.setName(msg.getStringProperty("name"));
scan.setCode(msg.getIntProperty("code"));
iscan.addScan(scan);
scanKeys = (List) iscan.getPKs(scan);
iter = scanKeys.iterator();
if (iter.hasNext()) {
scanPK = ((Scan) iter.next()).getId();
}
// insert into this table the two primary keys above
ilook = (ILookBean) (new InitialContext()).lookup(
"java:comp/env/ejb/LookBean");
Look look = new Look();
if (agPK.intValue() != 0 && scanPK.intValue() != 0) {
look.setAgId(agPK);
look.setScanId(scanPK);
ilook.addLook(look);
}
// ...
The JPA spec requires that after persist, the entity be populated with a valid ID if an ID generation strategy is being used. You don't have to do anything.