.net - How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

ID : 338

viewed : 155

Tags : c#.netstringcharacter-encodingc#

Top 5 Answer for .net - How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

vote vote

99

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str) {     byte[] bytes = new byte[str.Length * sizeof(char)];     System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);     return bytes; }  // Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system static string GetString(byte[] bytes) {     char[] chars = new char[bytes.Length / sizeof(char)];     System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);     return new string(chars); } 

As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

vote vote

81

It depends on the encoding of your string (ASCII, UTF-8, ...).

For example:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString); byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString); 

A small sample why encoding matters:

string pi = "\u03a0"; byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi); byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);  Console.WriteLine (ascii.Length); //Will print 1 Console.WriteLine (utf8.Length); //Will print 2 Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?' 

ASCII simply isn't equipped to deal with special characters.

Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).

See Character Encoding in the .NET Framework (MSDN) for more information.

vote vote

78

The accepted answer is very, very complicated. Use the included .NET classes for this:

const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢"; var bytes = System.Text.Encoding.UTF8.GetBytes(data); var decoded = System.Text.Encoding.UTF8.GetString(bytes); 

Don't reinvent the wheel if you don't have to...

vote vote

70

BinaryFormatter bf = new BinaryFormatter(); byte[] bytes; MemoryStream ms = new MemoryStream();  string orig = "喂 Hello 谢谢 Thank You"; bf.Serialize(ms, orig); ms.Seek(0, 0); bytes = ms.ToArray();  MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());  MessageBox.Show("Original string Length: " + orig.Length.ToString());  for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt  BinaryFormatter bfx = new BinaryFormatter(); MemoryStream msx = new MemoryStream();             msx.Write(bytes, 0, bytes.Length); msx.Seek(0, 0); string sx = (string)bfx.Deserialize(msx);  MessageBox.Show("Still intact :" + sx);  MessageBox.Show("Deserialize string Length(still intact): "      + sx.Length.ToString());  BinaryFormatter bfy = new BinaryFormatter(); MemoryStream msy = new MemoryStream(); bfy.Serialize(msy, sx); msy.Seek(0, 0); byte[] bytesy = msy.ToArray();  MessageBox.Show("Deserialize bytes Length(still intact): "     + bytesy.Length.ToString()); 
vote vote

54

You need to take the encoding into account, because 1 character could be represented by 1 or more bytes (up to about 6), and different encodings will treat these bytes differently.

Joel has a posting on this:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Top 3 video Explaining .net - How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

Related QUESTION?