Grauw’s blog

Safely dividing a UTF-8 string

May 29th, 2009

Rick DeNatale blogs about Safely dividing a UTF-8 String in Ruby. The goal is to split a string on a certain number of characters in a language with byte-oriented strings (such as Ruby 1.8 and PHP 5). As UTF-8 encodes certain characters as a sequence of bytes, one needs to make sure this split doesn’t happen mid-sequence.

While working, his method is a bit crude and not as efficient as it could be. The method described iteratively splits the string and then does a ‘valid UTF-8’ test using some built-in Ruby functionality. This creates a number of objects and loops over them only to discard them soon after. Instead, with a little better knowledge of UTF-8, this can be done more efficiently.

Basically what you need to do in order to see if you can split a UTF-8 string at a certain point is to see whether the byte after the split is not within the range 0x80-0xC0 (or 128-192 in decimal notation). If it is, move one byte back (or forward).

UTF-8 is an ASCII-compatible encoding, so bytes < 0x80 represent regular ASCII characters. Characters >= 0xC0 indicate the start of a multi-byte character. Breaking before either of these is safe. Characters in the range 0x80-0xC0 however are always the second, third or fourth byte of an encoded character, and thus you should not break before them or you will be breaking an encoded character mid-sequence. You can take a look at the UTF-8 Wikipedia page if you are interested in a little more details.

Now I am not actually familiar with Ruby, but this should be the code necessary to do a UTF-8-safe split in Ruby:

class String
  def utf8_safe_split_on_char(n)?
    self[n] < 0x80 || self[n] >= 0xC0
  end

  def utf8_safe_split(n)
    if length <= n
      [self, nil]
    else
      until self.utf8_safe_split_on_char(n)
        n = n - 1
      end
      before = self[0, n]
      after = self[n..-1]
      [before, after.empty? ? nil : after]
    end
  end
end

Grauw

Comments

Crude? Huh by Rick DeNatale at 2009-05-30 05:25

Laurens,

Actually my code isn’t as inefficient as you think, and is the result of carefully balancing several considerations. Ruby uses copy on write semantics for strings, so using the slice method (a.k.a []) doesn’t do anything but make a string pointing the the right bytes in the original string. Nothing is copied.

Now, as it turns out checking the character value is a bit more efficient than having unpack, I’d actually considered that but I don’t like the ‘magic’ number and think that unpack is clearer and puts the burden of understanding utf-8 on the standard library, for a small price in performance for a function which is far from the critical performance path.

It’s not a ‘magic’ number by GuyveR800 at 2009-05-31 23:51

It’s not a ‘magic’ number, it’s designed that way on purpose. It’s already a utf8 specific function, so why not make use of the specific standardized detection mechanism? Off-loading everything to an external library and expecting it to magically be optimal in every situation is not very realistic.

I’m not sure what you mean with the critical performance path comment. Splitting a string is a fairly low-level function, so who knows how performance critical it can be in situations other than the one you had in mind.

Code is wrong? by Stephan Deibel at 2012-09-14 15:03

Doesn’t “self[n] < 0x80 || self[n] >= 0xC0” evaluate to true for all n? Should be && not || I think.

Re: Code is wrong? by Grauw at 2012-09-19 13:27

Er, no? A value can’t be both less than 128 and larger than 191 at the same time... :)