A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization.
The Unicode version that is supported by the implementation
Hangul character boundaries and properties
All the unicode whitespace
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
The default normalization used for operations that require normalization. It can be set to any of the normalizations in NORMALIZATION_FORMS.
Example:
ActiveSupport::Multibyte::Unicode.default_normalization_form = :c
Compose decomposed characters to the composed form.
# File lib/active_support/multibyte/unicode.rb, line 166 166: def compose_codepoints(codepoints) 167: pos = 0 168: eoa = codepoints.length - 1 169: starter_pos = 0 170: starter_char = codepoints[0] 171: previous_combining_class = 1 172: while pos < eoa 173: pos += 1 174: lindex = starter_char - HANGUL_LBASE 175: # -- Hangul 176: if 0 <= lindex and lindex < HANGUL_LCOUNT 177: vindex = codepoints[starter_pos+1] - HANGUL_VBASE rescue vindex = 1 178: if 0 <= vindex and vindex < HANGUL_VCOUNT 179: tindex = codepoints[starter_pos+2] - HANGUL_TBASE rescue tindex = 1 180: if 0 <= tindex and tindex < HANGUL_TCOUNT 181: j = starter_pos + 2 182: eoa -= 2 183: else 184: tindex = 0 185: j = starter_pos + 1 186: eoa -= 1 187: end 188: codepoints[starter_pos..j] = (lindex * HANGUL_VCOUNT + vindex) * HANGUL_TCOUNT + tindex + HANGUL_SBASE 189: end 190: starter_pos += 1 191: starter_char = codepoints[starter_pos] 192: # -- Other characters 193: else 194: current_char = codepoints[pos] 195: current = database.codepoints[current_char] 196: if current.combining_class > previous_combining_class 197: if ref = database.composition_map[starter_char] 198: composition = ref[current_char] 199: else 200: composition = nil 201: end 202: unless composition.nil? 203: codepoints[starter_pos] = composition 204: starter_char = composition 205: codepoints.delete_at pos 206: eoa -= 1 207: pos -= 1 208: previous_combining_class = 1 209: else 210: previous_combining_class = current.combining_class 211: end 212: else 213: previous_combining_class = current.combining_class 214: end 215: if current.combining_class == 0 216: starter_pos = pos 217: starter_char = codepoints[pos] 218: end 219: end 220: end 221: codepoints 222: end
Decompose composed characters to the decomposed form.
# File lib/active_support/multibyte/unicode.rb, line 145 145: def decompose_codepoints(type, codepoints) 146: codepoints.inject([]) do |decomposed, cp| 147: # if it's a hangul syllable starter character 148: if HANGUL_SBASE <= cp and cp < HANGUL_SLAST 149: sindex = cp - HANGUL_SBASE 150: ncp = [] # new codepoints 151: ncp << HANGUL_LBASE + sindex / HANGUL_NCOUNT 152: ncp << HANGUL_VBASE + (sindex % HANGUL_NCOUNT) / HANGUL_TCOUNT 153: tindex = sindex % HANGUL_TCOUNT 154: ncp << (HANGUL_TBASE + tindex) unless tindex == 0 155: decomposed.concat ncp 156: # if the codepoint is decomposable in with the current decomposition type 157: elsif (ncp = database.codepoints[cp].decomp_mapping) and (!database.codepoints[cp].decomp_type || type == :compatability) 158: decomposed.concat decompose_codepoints(type, ncp.dup) 159: else 160: decomposed << cp 161: end 162: end 163: end
Reverse operation of g_unpack.
Example:
Unicode.g_pack(Unicode.g_unpack('क्षि')) # => 'क्षि'
# File lib/active_support/multibyte/unicode.rb, line 124 124: def g_pack(unpacked) 125: (unpacked.flatten).pack('U*') 126: end
Unpack the string at grapheme boundaries. Returns a list of character lists.
Example:
Unicode.g_unpack('क्षि') # => [[2325, 2381], [2359], [2367]] Unicode.g_unpack('Café') # => [[67], [97], [102], [233]]
# File lib/active_support/multibyte/unicode.rb, line 90 90: def g_unpack(string) 91: codepoints = u_unpack(string) 92: unpacked = [] 93: pos = 0 94: marker = 0 95: eoc = codepoints.length 96: while(pos < eoc) 97: pos += 1 98: previous = codepoints[pos-1] 99: current = codepoints[pos] 100: if ( 101: # CR X LF 102: ( previous == database.boundary[:cr] and current == database.boundary[:lf] ) or 103: # L X (L|V|LV|LVT) 104: ( database.boundary[:l] === previous and in_char_class?(current, [:l,:v,:lv,:lvt]) ) or 105: # (LV|V) X (V|T) 106: ( in_char_class?(previous, [:lv,:v]) and in_char_class?(current, [:v,:t]) ) or 107: # (LVT|T) X (T) 108: ( in_char_class?(previous, [:lvt,:t]) and database.boundary[:t] === current ) or 109: # X Extend 110: (database.boundary[:extend] === current) 111: ) 112: else 113: unpacked << codepoints[marker..pos-1] 114: marker = pos 115: end 116: end 117: unpacked 118: end
Detect whether the codepoint is in a certain character class. Returns true when it’s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.
Primarily used by the grapheme cluster support.
# File lib/active_support/multibyte/unicode.rb, line 81 81: def in_char_class?(codepoint, classes) 82: classes.detect { |c| database.boundary[c] === codepoint } ? true : false 83: end
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
string - The string to perform normalization on.
form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is ActiveSupport::Multibyte.default_normalization_form
# File lib/active_support/multibyte/unicode.rb, line 282 282: def normalize(string, form=nil) 283: form ||= @default_normalization_form 284: # See http://www.unicode.org/reports/tr15, Table 1 285: codepoints = u_unpack(string) 286: case form 287: when :d 288: reorder_characters(decompose_codepoints(:canonical, codepoints)) 289: when :c 290: compose_codepoints(reorder_characters(decompose_codepoints(:canonical, codepoints))) 291: when :kd 292: reorder_characters(decompose_codepoints(:compatability, codepoints)) 293: when :kc 294: compose_codepoints(reorder_characters(decompose_codepoints(:compatability, codepoints))) 295: else 296: raise ArgumentError, "#{form} is not a valid normalization variant", caller 297: end.pack('U*') 298: end
Re-order codepoints so the string becomes canonical.
# File lib/active_support/multibyte/unicode.rb, line 129 129: def reorder_characters(codepoints) 130: length = codepoints.length- 1 131: pos = 0 132: while pos < length do 133: cp1, cp2 = database.codepoints[codepoints[pos]], database.codepoints[codepoints[pos+1]] 134: if (cp1.combining_class > cp2.combining_class) && (cp2.combining_class > 0) 135: codepoints[pos..pos+1] = cp2.code, cp1.code 136: pos += (pos > 0 ? 1 : 1) 137: else 138: pos += 1 139: end 140: end 141: codepoints 142: end
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
Passing true will forcibly tidy all bytes, assuming that the string’s encoding is entirely CP1252 or ISO-8859-1.
# File lib/active_support/multibyte/unicode.rb, line 227 227: def tidy_bytes(string, force = false) 228: if force 229: return string.unpack("C*").map do |b| 230: tidy_byte(b) 231: end.flatten.compact.pack("C*").unpack("U*").pack("U*") 232: end 233: 234: bytes = string.unpack("C*") 235: conts_expected = 0 236: last_lead = 0 237: 238: bytes.each_index do |i| 239: 240: byte = bytes[i] 241: is_cont = byte > 127 && byte < 192 242: is_lead = byte > 191 && byte < 245 243: is_unused = byte > 240 244: is_restricted = byte > 244 245: 246: # Impossible or highly unlikely byte? Clean it. 247: if is_unused || is_restricted 248: bytes[i] = tidy_byte(byte) 249: elsif is_cont 250: # Not expecting contination byte? Clean up. Otherwise, now expect one less. 251: conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1 252: else 253: if conts_expected > 0 254: # Expected continuation, but got ASCII or leading? Clean backwards up to 255: # the leading byte. 256: (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])} 257: conts_expected = 0 258: end 259: if is_lead 260: # Final byte is leading? Clean it. 261: if i == bytes.length - 1 262: bytes[i] = tidy_byte(bytes.last) 263: else 264: # Valid leading byte? Expect continuations determined by position of 265: # first zero bit, with max of 3. 266: conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3 267: last_lead = i 268: end 269: end 270: end 271: end 272: bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*") 273: end
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn’t valid UTF-8.
Example:
Unicode.u_unpack('Café') # => [67, 97, 102, 233]
# File lib/active_support/multibyte/unicode.rb, line 68 68: def u_unpack(string) 69: begin 70: string.unpack 'U*' 71: rescue ArgumentError 72: raise EncodingError, 'malformed UTF-8 character' 73: end 74: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.