How to unescape an UTF-8 escaped byte array to an unescaped byte array without allocating a String

Question

I have a Span<byte> representing an escaped string UTF-8 like:

Binary represention: byte[20] { 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 32, 92, 117, 50, 48, 97, 99, 32, 33 }

Escaped represention:"Hello world \u20ac !"

Desired binary result: byte[17] { 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 32, 226, 130, 172, 32, 33 }

I tried to transcode the escaped \u20ac by using the GetString() method: Encoding.UTF8.GetBytes(Encoding.UTF8.GetString())

But this is not unescaping the input.

Is there any way to achieve to the same result ?

// Not working solution
public void NotWorkingUnescape(ReadOnlySpan<byte> source, Span<byte> destination)
{
    var tmp = Encoding.UTF8.GetString(source);
    Encoding.UTF8.GetBytes(tmp, destination);
}

// Unknown solution
// UTF-8 escaped byte array -> UTF8-8 unescaped byte array
public void FastUnescape(ReadOnlySpan<byte> source, Span<byte> destination)
{
    // ?
}

I tried the code you posted and it doesn't unescape anything. The result is the same as input. — silkfire, Dec 08 '19 at 20:06
You are right, I missed the point. Will update the question. — ycrumeyrolle, Dec 09 '19 at 09:17
The solution may come from the JsonReader parser: https://github.com/dotnet/runtime/blob/8fe5240a400898530a17f03b7ec544f54e538fcf/src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.Unescaping.cs#L297 — ycrumeyrolle, Dec 09 '19 at 10:25

Julián · Answer 1 · 2019-12-08T21:54:23.853

Are you looking for a method that does all the work?

You could simply use this:

public void FastUnescape(ReadOnlySpan<byte> source, Span<byte> destination)
{
    Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(source), destination);
}

Or prevent any exception:

public void FastUnescape(ReadOnlySpan<byte> source, Span<byte> destination)
{
    if (source.Length <= destination.Length)
    {
        Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(source), destination);
    }
}

Update:

There is another way to do the conversion without using Encoding.UTF8, by seeing the @JonSkeet response you could implement the following:

public static void AnotherMethod(ReadOnlySpan<byte> source, Span<byte> destination)
{
    for (int i = 0; i < source.Length; i++)
    {
        destination[i] = (byte) (Convert.ToChar(source[i]));
    }
}

The problem with this code is that when using Convert.toChar, the conversion is done to an equivalent Unicode character no UTF-8 character, which is why & 0x7f is used in the post of the answer to obtain values in the ASCII range.

I did not do many tests in terms of performance or functionality with other special characters that you want to escape, however I have achieved the same results

Thanks for your code, but your solution is still allocating a `String`. My goal is to avoid this temporary allocation and manipulation. — ycrumeyrolle, Dec 08 '19 at 19:14
That temporary value is the conversion, you will always have to do it when using the `GetBytes` method, I don't see another way, unless you use another totally different design. I have updated my answer — Julián, Dec 08 '19 at 21:52
The updated solution looks fine for ASCII but, as you said, for UTF-8 this may not work as there is not 1-1 relationship — ycrumeyrolle, Dec 09 '19 at 10:23

How to unescape an UTF-8 escaped byte array to an unescaped byte array without allocating a String

1 Answers1