I put a message in queue,as I know queue manager sets msgId value.
When I trace message,I see it puts a null character and it creates a problem on side that I send message to.
I checked documentation and it;
A MsgId generated by the queue manager consists of a 4-byte product identifier (AMQ¬ or CSQ¬ in either ASCII or EBCDIC, where ¬ represents a blank character), followed by a product-specific implementation of a unique string. In IBM® MQ this contains the first 12 characters of the queue-manager name, and a value derived from the system clock.
This is my msgId
any idea why it creates msgId with null character and how can I solve it?
MsgId is defined as a byte string, which allows any byte values to be included.
As you have already found, MsgId values generated by MQ use character data for portions of the byte string and add a binary value in the remaining bytes to create a unique identifier.
The binary portion is derived from the system clock and can be expected to contain arbitrary byte values.
If the receiving application has specific requirements for the format of the MsgId and the byte values it can contain, then the putting application will need to generate a MsgId that conforms to those requirements.
Related
I am using a large German corpus, which I have cleaned of all special characters/numbers/inter-punctuation signs.
Each line contains one sentence.
Running
fastText/./fasttext skipgram -input input.txt -output output.txt
-minCount 2 -minn 2 -maxn 8 -dim 300 -ws 5
returns a VSM with <\s> as first entry.
From how I understand it, there are white spaces left in the document that are interpreted as a token.
Is that correct?
And how can I get rid of them and/or the <\s> in the VSM?
Thank you.
By convention the fasttext tool converts any newlines in the input file to a pseudoword token '<\s>', to represent an end-of-string ('EOS').
See the discussion in the Python binding Markdown docs:
https://github.com/facebookresearch/fastText/blob/main/python/README.md#important-preprocessing-data--encoding-conventions
The newline character is used to delimit lines of text. In particular,
the EOS token is appended to a line of text if a newline character is
encountered. The only exception is if the number of tokens exceeds the
MAX_LINE_SIZE constant as defined in the Dictionary header. This means
if you have text that is not separate by newlines, such as the fil9
dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens
and the EOS token is not appended.
The length of a token is the number of UTF-8 characters by considering
the leading two bits of a byte to identify subsequent bytes of a
multi-byte sequence. Knowing this is especially important when
choosing the minimum and maximum length of subwords. Further, the EOS
token (as specified in the Dictionary header) is considered a
character and will not be broken into subwords.
(Though only mentioned in that doc about the Python bindings, it's definitely defined/implemented in the core C++ code, especially the dictionary.cc file.)
To eliminate that word-token, you'd have to strip all newlines from your input file.
Using SSIS we are extracting names and addresses from one system and providing them to a downstream system via a vendor specific file format that only accepts UTF-8, & parses the data based on character positions, so it expects each row to be an exact length.
Many users have umlauts, apostrophes or accents in their names and addresses.
These characters do not translate well in UTF-8, showing up as xD3, xE1 and similar.
As one char is now replaced with 3, the row length is now incorrect and the upload fails.
Is there a way to represent characters with accents & umlauts in UTF-8?
We can change them in the source system, but that means the spelling is now technically incorrect.
I have a text file that looks like this:
shooting-stars 💫 "are cool"
I have a lexical analyzer that uses FileInputStream to read the characters one at a time, passing those characters to a switch statement that returns the corresponding lexeme.
In this case, 💫 represents assignment so this case passes:
case 'ð' :
return new Lexeme("ASSIGN");
For some reason, the file reader stops at that point, returning a null pointer even though it has yet to process the string (or whatever is after the 💫). Any time it reads in an emoticon it does this. If there were no emoticons, it gets to the end of file. Any ideas?
I suspect the problem is that the character 💫 (Unicode code point U+1F4AB) is outside the range of characters that Java represents internally as single char values. Instead, Java represents characters above U+FFFF as two characters known as surrogate pairs, in this case U+D83D followed by U+DCAB. (See this thread for more info and some links.)
It's hard to know exactly what's going on with the little bit of code that you presented, but my guess is that you are not handling this situation correctly. You will need to adjust your processing logic to deal with your emoticons arriving in two pieces.
How do I create document info dictionary keys containing unicode characters (typically swedish characters, for instance C3A4 U+00E4 ä). I would like to use the PdfStamper to enter my own metadata in the document info dictionary, but I can't get it to accept the swedish characters.
Entering custom metadata using Acrobat works fine and looking at the PDF in a text editor I can see that the characters get encoded like for instance #C3#A4 for the character mentioned above. So is there a way to achieve this programmatically using iText PdfStamper???
regards
Mattias
PS. There is no problem having unicode characters in the info dictionary values, but the keys are a different story.
Please take a look at the NameObject example, and give it a try. You'll see that iText automatically escapes special characters in names.
iText follows the ISO-32000-1 specification that stats (7.3.5, Name Objects):
Beginning with PDF 1.2 a name object is an atomic symbol uniquely
defined by a sequence of any characters (8-bit values) except null
(character code 0). Uniquely defined means that any two name objects
made up of the same sequence of characters denote the same object.
Atomic means that a name has no internal structure; although it is
defined by a sequence of characters, those characters are not
considered elements of the name.
not part of the name but is a prefix indicating that what follows is a
sequence of characters representing the name in the PDF file and shall
follow these rules:
a) A NUMBER SIGN (23h) (#) in a name shall be written by using its
2-digit hexadecimal code (23), preceded by the NUMBER SIGN.
b) Any character in a name that is a regular character (other than
NUMBER SIGN) shall be written as itself or by using its 2-digit
hexadecimal code, preceded by the NUMBER SIGN.
c) Any character that is not a regular character shall be written
using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.
NOTE 1: There is not a unique encoding of names into the PDF file
because regular characters may be coded in either of two ways.
White space used as part of a name shall always be coded using the
2-digit hexadecimal notation and no white space may intervene between
the SOLIDUS and the encoded name.
Regular characters that are outside the range EXCLAMATION MARK(21h)
(!) to TILDE (7Eh) (~) should be written using the hexadecimal
notation.
The token SOLIDUS (a slash followed by no regular characters)
introduces a unique valid name defined by the empty sequence of
characters.
NOTE 2 The examples shown in Table 4 and containing # are not valid
literal names in PDF 1.0 or 1.1.
I'm not copy/pasting table 4, but I don't see any example that uses characters that consist of two bytes. Can you share a PDF that contains a name with a two-byte character that behaves in the way you desire? The PDF specification explicitly says that characters in the context of names are 8-bit values. You seem to be talking about 16-bit values...
Additional note: in the current implementation of iText, we only look at 8 bits:
c = (char)(chars[k] & 0xff);
We deliberately throw away all the higher bits when characters with more than 8 bits are passed.
Actually, I think I have answered your question. Initially, I thought you were asking to add this character: http://www.fileformat.info/info/unicode/char/c3a4/index.htm
As it turns out, you only need "\u00e4" (ä). I've made a small code sample that demonstrates how one would add a custom entry to the DID containing this character: ChangeInfoDictionary.
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
Map<String, String> info = reader.getInfo();
info.put("Special Character: \u00e4", "\u00e4");
stamper.setMoreInfo(info);
stamper.close();
reader.close();
}
Granted, when you open the PDF in a PDF viewer, you don't necessarily see "Special Character: ä" as the key value, but that's a problem of the PDF viewer. When you open the PDF in a text editor, you clearly see:
/Special#20Character:#20#e4(ä)
Which means that iText has correctly escaped the special character.
However: as you pointed out in your comment, the character doesn't show up in Adobe Reader. Based on a PDF I created using Acrobat, I found a workaround by using the following code:
StringBuffer buf = new StringBuffer();
buf.append((char) 0xc3);
buf.append((char) 0xa4);
info.put(buf.toString(), "\u00e4");
Now the character is shown correctly. In other words: it's a matter of encoding...
Just wanted to share a little experiment in C# illustrating one rather effortless way of getting the special characters into the document info dictionary keys.
string inputString = "My key with åäö";
byte[] inputBytes = Encoding.UTF8.GetBytes(inputString);
string convertedString = Encoding.UTF7.GetString(inputBytes);
info.Add(convertedString, "My value with åäö");
(info is the Dictionary used for adding metadata) Then just use the PdfStamper to get the info into the PDF. The metadata is stored correctly in the PDF and can be interpreted by Adobe Reader.
I have an ETL process that regularly extracts code from an ODBC data source, manipulates it, and inserts it into my postgres database. One of the columns from this data source regularly has odd characters in it.
For the most part I can catch and convert all of the characters appropriately, but I have one character that exists in the ODBC data source, cannot be brought into postgres (all of the text after that character gets truncated), and I'm having a hard time identifying what the character is.
I can't even insert an example of the character directly into this post because it gets stripped out :/ The closest I can get is a screen shot of the character in textmate (the only application I can actually see the character in):
There character is the diamond between the 1 and 0. When my data comes in, everything after the 0 is truncated.
Is there a good way of identifying what this character is so I can figure out a way of stripping it out?
Per tripleee's comment on the original question post:
To identify the character I grabbed the hex value of the text to identify the hex value of the offending character in question.
There are a number of ways to do this, but the quickest way for me was to use a utility application I have called HexFiend so dump the text into. Once the text was in and I highlighted the character it returned the hex value "00".
A bit more investigation pointed towards the hex null value being used as a line terminator in C applications (which makes sense given the context of my project).
I've fit this null value into my ETL process so that it gets switched out with a new line and now everything is sunshine and daises.
Thanks again for the help!