not java specific, but when I say OutputStream os = sock.getOutputStream();
is there a way to determine stream's encoding charset? or do I have to know encoding charset ahead of time to properly read it? This is for arbitrary socket connection.
There is ways to detect text encoding, for example web browsers do that.
This is an implementation in Python(Universal Encoding Detector) that might give you some help.
Edit:
Here is one for java: http://jchardet.sourceforge.net/
Edit2:
Here is another SO question: How can I detect the encoding/codepage of a text file
Streams do not have associated charsets. They just pass around arbitrary data. You have to know the data's charset ahead of time in order to interpret the data.
Related
I don't have any background with this subject.
To try to understand them better, I read:
Multihash
CIDv1: Multicodec prefix
From what I understand, the multihash is the algorithm used to hash (one way) the value. so it means, we can't go back (we can't decode the hash to the value).
Questions
I don't understand, in simple words, what is multicodec and if it's related to decoding the hash to a value (which makes no sense).
what is the motivation to multicodec prefix?
The multicodec is related to decoding the value the hash points to, if that makes it easier to understand. Don't worry, no magic hash decoding is happening ;). Remember we're making CIDs, and we can use CIDs to lookup content. However then we have the question of "how do we decode this data we just retrieved?", the multicodec solves that problem for us. Reading From Data to Data Structures might help clear up some confusion.
The multicodec prefix allows IPFS to evolve to support new and different encodings for the data that's actually put into IPFS. This refers to IPLD, and you can actually find the answer you're looking for under Links (with information about the codecs under Codecs):
For links we use a CID. A CID is an extension of multihash, in fact a multihash is part of a CID. We simply add a codec to a multihash that tells us what format the data is in (JSON, CBOR, Bitcoin, Ethereum, etc). This way, we can actually link between data in different formats and any link to data anyone ever gives us can be decoded so that it can become more than just a series of bytes.
CID is a standard that anyone can implement, even people that have no other interest in IPLD beyond the need for hash links to different data types can use it.
I apologize in advance for sounding ignorant when I ask this question, but I'm not very good at conceptualizing the concept of encoding and decoding data.
As an example, I have access to a MIME encoded text with the following value:
=?GBK?B?xqw=?=
I know that (or pretty sure I know that) it's encoded using GB2312. Translated using online decoders tells me it's the word "sheet" in English. Is there a way to decode this into even its source language characters that I can even put into a 3rd party translator to read it in English from PowerShell? I feel like an idiot for asking this question because I'm not even sure I'm asking it in an intelligent way due to my general lack of understanding of all the core pieces involved.
I've tried looking through the Encoding class, but it doesn't have anything from what I can tell that will support this type. Are there any other modules or something available that I'm not aware of that can facilitate this?
Thank you for any assistance someone may provide and appreciate being schooled on this topic.
Easiest way is by using either System.Web.HttpUtility.UrlDecode or System.Net.Mail.Attachment class in Microsoft framework. With the later you can do:
$unicodeString = "=?GBK?B?xqw=?="
$attachment = [System.Net.Mail.Attachment]::CreateAttachmentFromString("", $unicodeString)
Write-Host $attachment.Name
It prints 片 that is "sheet" in Chinese, according to Google Translate.
I have a kinda-lossless modem record in ogg. The main problem is that I don't know how to translate the data to some understandable form. Is there any modem tool capable of converting the data to tome meaningful format?
If not, what's the best practice to bruteforcefully find out what the original content before modulation was? I mean like generating possible data until the original data file and the generated one matches.
Thank you
I would ask this guy: http://www.whence.com/minimodem/
He has implemented early protocols in minimodem but might be aware of another tool for latest protocols.
My requirement is to compress xml file into a binary format, transmit it and decompress it (lightening fast) before i start parsing it.
There are quite a few binary xml protocols and tools available. I found EXI (efficient xml interchange) better as compared to others. Tried its open source version Exificient and found it good.
I heard about google protocol buffers and facebook's thrift, can any one tell me if these two can do the job i am looking for?
OR just let me know if there is anything better then EXI i should look for.
Also, There is a good XML parser VTD-XML (haven't tried myself, just googled about it and read some articles) that accomplishes better parsing performances as compared to DOM,SAX and Stax.
I want best of both worlds, best compression + best parsing performance, any suggestions?
One more thing regarding EXI, how can EXI claim to be fast at parsing a decoded XML file? Because it is being parsed by DOM, SAX or STax? I would have believed this to be true if there was another binary parser for reading the decoded version. Correct me if i am wrong.
ALSO, is there any good C++ open source implementation for EXI format? A version in java is available by EXIficient, but i am not able to spot a C++ open source implementation?
There is one by agile delta but that's commercial.
You mention protocol buffers (protobuf); this is a binary format, but has no direct relationship to XML. In partiular, no member-names (element names / attribute names / namespaces) are encoded - it is just the data (with numeric markers for identifiers).
As such, you cannot reconstruct arbitrary XML from a protobuf stream unless you already know how to map "field 3" etc.
However! If you have an object-model that works with both XML and protobuf, the transform is trivial; deserialize with either - serialize with either. How well this works depends on the implementation; for example, it is trivial with protobuf-net and is actually how I do the codegen (load the binary; write as XML; run the XML through an xslt layer to emit code).
If you actually just want to transfer object data (and XML is just a proposed implementation detail), then I thoroughly recommend protobuf; platform independent, a wide range of implementations, version-tolerant, very small output, and very fast processing at both read and write.
Nadeem,
These are very good questions. You might be new to the domain, but the same questions are frequently asked by XML veterans. I'll try to address each of them.
I heard about google protocol buffers and facebook's thrift, can any one tell me if these two can do the job i am looking for?
As mentioned by Marc, Protocol Buffers and Thrift are binary data formats, but they are not XML formats designed to transport XML data. E.g., they have no support for XML concepts like namespaces, attributes, etc., so mapping between XML and these binary formats would require a fair bit of work on your part.
OR just let me know if there is anything better then EXI i should look for.
EXI is likely your best bet. The W3C completed a pretty thorough analysis of XML format implementations and found the EXI implementation (Efficient XML) consistently achieved the best compactness and was one of the fastest. They also found it consistently achieved better compactness than GZIP compression and even packed binary formats like ASN.1 PER (see W3C EXI Evaluation). None of the other XML formats were able to do that. In the tests I've seen comparing EXI with Protocol Buffers, EXI was at least 2-4 times smaller.
I want best of both worlds, best compression + best parsing performance, any suggestions??
If it is an option, you might want to consider the commercial products. The W3C EXI tests mentioned above used Efficient XML, which is much faster than EXIficient (sometimes >10 times faster parsing and >20 times faster serializing). Your mileage may vary, so you should test it yourself if it is an option.
One more thing regarding EXI, how can EXI claim to be fast at parsing a decoded XML file?
The reason EXI can be smaller and faster to parse than XML is because EXI can be streamed directly to/from memory via the standard XML APIs without ever producing the data in an intermediate XML format. So, instead of serializing your data as XML via a standard API, compressing the XML, sending the compressed XML, decompressing the XML on the other end, then parsing it through one of the XML APIs, ... you can serialize your data directly as EXI via a standard XML API, send the EXI, then parse the EXI directly through one of the XML APIs on the other side. This is a fundamental difference between compression and EXI. EXI is not compression per-se -- it is a more efficient XML format that can be streamed directly to/from your application.
Hope this helps!
Compression is unified with the grammar system in EXI format. The decoder API generally give you a sequence of events such as SAX events when you let decoders process EXI streams, however, decoders are not internally converting EXI back into XML text to feed into another parser. Instead, the decoder does all the convoluted decompression/scanning process to yield an API event sequence such as SAX. Because EXI and XML are compatible at the event level, it is fairly straightforward to write out XML text given an event sequence.
I'm doing some simple socket programming in C#. I am attempting to authenticate a user by reading the username and password from the client console, sending the credentials to the server, and returning the authentication status from the server. Basic stuff. My question is, how do I ensure that the data is in a format that both the server and client expect?
For example, here's how I read the user credentials on the client:
Console.WriteLine("Enter username: ");
string username = Console.ReadLine();
Console.WriteLine("Enter plassword: ");
string password = Console.ReadLine();
StreamWriter clientSocketWriter = new StreamWriter(new NetworkStream(clientSocket));
clientSocketWriter.WriteLine(username + ":" + password);
clientSocketWriter.Flush();
Here I am delimiting the username and password with a colon (or some other symbol) on the client side. On the server I simply split the string using ":" as the token. This works, but it seems sort of... unsafe. Shouldn't there be some sort of delimiter token that is shared between client and server so I don't have to just hard-code it in like this?
It's a similar matter for the server response. If the authentication is successful, how do I send a response back in a format that the client expects? Would I simply send a "SUCCESS" or "AuthSuccessful=True/False" string? How would I ensure the client knows what format the server sends data in (other than just hard-coding it into the client)?
I guess what I am asking is how to design and implement an application-level protocol. I realize it is sort of unique to your application, but what is the typical approach that programmers generally use? Furthermore, how do you keep the format consistent? I would really appreciate some links to articles on this matter as well.
Rather than reinvent the wheel. Why not code up an XML schema and send and receive XML "files".
Your messages will certainly be longer, but with gigabyte Ethernet and ADSL this hardly matters these days. What you do get is a protocol where all the issues of character sets, complex data structures have already been solved, plus, an embarrassing choice of tools and libraries to support and ease your development.
I highly recommend using plain ASCII text if at all possible.
It makes bugs much easier to detect and fix.
Some common, machine-readable ASCII text protocols (roughly in order of complexity):
netstring
Tab Delimited Tables
Comma Separated Values (CSV) (strings that include both commas and double-quotes are a little awkward to handle correctly)
INI file format
property list format
JSON
YAML Ain't Markup Language
XML
The world is already complicated enough, so I try to use the least-complex protocol that would work.
Sending two user-generated strings from one machine to another -- netstrings is the simplest protocol on my list that would work for that, so I would pick netstrings.
(netstrings will will work fine even if the user types in a few colons or semi-colons or double-quotes or tabs -- unlike other formats that choke on certain commonly-typed characters).
I agree that it would be nice if there existed some way to describe a protocol in a single shared file such that that both the server and the client could somehow "#include" or otherwise use that protocol.
Then when I fix a bug in the protocol, I could fix it in one place, recompile both the server and the client, and then things would Just Work -- rather than digging through a bunch of hard-wired constants on both sides.
Kind of like the way well-written C code and C++ code uses function prototypes in header files so that the code that calls the function on one side, and the function itself on the other side, can pass parameters in a way that both sides expect.
Tell me if you discover anything like that, OK?
Basically, you're looking for a standard. "The great thing about standards is that there are so many to choose from". Pick one and go with it, it's a lot easier than rolling your own. For this particular situation, look into Apache "basic" authentication, which joins the username and password and base64-encodes it, as one possibility.
I have worked with two main approaches.
First is ascii based protocol.
Ascii based protocol is usally based on a set of text commands that terminate on some defined delimiter (like a carriage return or semicolon or xml or json). If your protocol is a command based protocol where there is not a lot of data being transferred back and forth then this is the best way to go.
FIND\r
DO_SOMETHING\r
It has the advantage of being easy to read and understand because it is text based.
The disadvantage (may not be a problem but can be) is that there can be an unknown number of bytes being transferred back and forth from the client and the server. So if you need to know exactly how many bytes are being sent and received this may not be the type of protocol you want.
The other type of protocol is binary based with fixed sized messages that are sent in the header. This has the advantage of knowing exactly how much data the client is expected to receive. It also can potentially save you bandwith depending on what your sending across. Although, ascii can also save you space too, it depends on your application requirements. The disadvantage of a binary based protocol is that it is difficult to understand by just looking at it....requiring you to constantly look at documentation.
In practice, I tend to mix both strategies in protocols I have defined based on my application's requirements.