Что такое rune golang
Перейти к содержимому

Что такое rune golang

  • автор:

Go: Руны

В современном мире невозможно работать только со строками, состоящими исключительно из ASCII символов. Везде используются нестандартные знаки, языки отличные от латиницы и эмодзи. Для работы с такими Юникод символами в Go представлен тип данных rune :

package main import ( "fmt" ) func main() < emoji := []rune("привет��") for i := 0; i < len(emoji); i++ < fmt.Println(emoji[i], string(emoji[i])) // выводим код символа и его строковое представление >> 
1087 п 1088 р 1080 и 1074 в 1077 е 1090 т 128512 �� 

rune — это алиас к int32 . Как и байты, руны были созданы для отличия от встроенного типа данных. Каждая руна представляет собой код символа стандарта Юникод. Строка свободно преобразуется в []byte и []rune , но эти 2 типа данных не конвертируются между собой напрямую:

s := "hey��" rs := []rune([]byte(s)) // cannot convert ([]byte)(s) (type []byte) to type []rune bs := []byte([]rune(s)]) // cannot convert ([]rune)(s) (type []rune) to type []byte 

В Go присутствует синтаксический сахар при обходе строки. Если использовать конструкцию for range , строка автоматически будет преобразована в []rune , то есть обход будет по Юникод символам:

package main import ( "fmt" ) func main() < emoji := []rune("cool��") for _, ch := range emoji < fmt.Println(ch, string(ch)) // выводим код символа и его строковое представление >> 
99 c 111 o 111 o 108 l 128512 �� 

Задание

Реализуйте функцию isASCII(s string) bool , которая возвращает true , если строка s состоит только из ASCII символов.

Упражнение не проходит проверку — что делать? ��

Если вы зашли в тупик, то самое время задать вопрос в «Обсуждениях». Как правильно задать вопрос:

  • Обязательно приложите вывод тестов, без него практически невозможно понять что не так, даже если вы покажете свой код. Программисты плохо исполняют код в голове, но по полученной ошибке почти всегда понятно, куда смотреть.

В моей среде код работает, а здесь нет ��

Тесты устроены таким образом, что они проверяют решение разными способами и на разных данных. Часто решение работает с одними входными данными, но не работает с другими. Чтобы разобраться с этим моментом, изучите вкладку «Тесты» и внимательно посмотрите на вывод ошибок, в котором есть подсказки.

Мой код отличается от решения учителя ��

Это нормально ��, в программировании одну задачу можно выполнить множеством способов. Если ваш код прошел проверку, то он соответствует условиям задачи.

В редких случаях бывает, что решение подогнано под тесты, но это видно сразу.

Прочитал урок — ничего не понятно ��

Создавать обучающие материалы, понятные для всех без исключения, довольно сложно. Мы очень стараемся, но всегда есть что улучшать. Если вы встретили материал, который вам непонятен, опишите проблему в «Обсуждениях». Идеально, если вы сформулируете непонятные моменты в виде вопросов. Обычно нам нужно несколько дней для внесения правок.

Кстати, вы тоже можете участвовать в улучшении курсов: внизу есть ссылка на исходный код уроков, который можно править прямо из браузера.

Полезное

Strings, bytes, runes and characters in Go

The previous blog post explained how slices work in Go, using a number of examples to illustrate the mechanism behind their implementation. Building on that background, this post discusses strings in Go. At first, strings might seem too simple a topic for a blog post, but to use them well requires understanding not only how they work, but also the difference between a byte, a character, and a rune, the difference between Unicode and UTF-8, the difference between a string and a string literal, and other even more subtle distinctions.

One way to approach this topic is to think of it as an answer to the frequently asked question, “When I index a Go string at position n, why don’t I get the nth character?” As you’ll see, this question leads us to many details about how text works in the modern world.

An excellent introduction to some of these issues, independent of Go, is Joel Spolsky’s famous blog post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Many of the points he raises will be echoed here.

What is a string?

Let’s start with some basics.

In Go, a string is in effect a read-only slice of bytes. If you’re at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we’ll assume here that you have.

It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

Printing strings

Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce ugly output. The simple print statement

fmt.Println(sample)

produces this mess (whose exact appearance varies with the environment):

To find out what that string really holds, we need to take it apart and examine the pieces. There are several ways to do this. The most obvious is to loop over its contents and pull out the bytes individually, as in this for loop:

for i := 0; i

As implied up front, indexing a string accesses individual bytes, not characters. We’ll return to that topic in detail below. For now, let’s stick with just the bytes. This is the output from the byte-by-byte loop:

bd b2 3d bc 20 e2 8c 98 

Notice how the individual bytes match the hexadecimal escapes that defined the string.

A shorter way to generate presentable output for a messy string is to use the %x (hexadecimal) format verb of fmt.Printf . It just dumps out the sequential bytes of the string as hexadecimal digits, two per byte.

fmt.Printf("%x\n", sample)

Compare its output to that above:

bdb23dbc20e28c98 

A nice trick is to use the “space” flag in that format, putting a space between the % and the x . Compare the format string used here to the one above,

fmt.Printf("% x\n", sample)

and notice how the bytes come out with spaces between, making the result a little less imposing:

bd b2 3d bc 20 e2 8c 98 

There’s more. The %q (quoted) verb will escape any non-printable byte sequences in a string so the output is unambiguous.

fmt.Printf("%q\n", sample)

This technique is handy when much of the string is intelligible as text but there are peculiarities to root out; it produces:

"\xbd\xb2=\xbc ⌘" 

If we squint at that, we can see that buried in the noise is one ASCII equals sign, along with a regular space, and at the end appears the well-known Swedish “Place of Interest” symbol. That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes after the space (hex value 20 ): e2 8c 98 .

If we are unfamiliar or confused by strange values in the string, we can use the “plus” flag to the %q verb. This flag causes the output to escape not only non-printable sequences, but also any non-ASCII bytes, all while interpreting UTF-8. The result is that it exposes the Unicode values of properly formatted UTF-8 that represents non-ASCII data in the string:

fmt.Printf("%+q\n", sample)

With that format, the Unicode value of the Swedish symbol shows up as a \u escape:

"\xbd\xb2=\xbc \u2318" 

These printing techniques are good to know when debugging the contents of strings, and will be handy in the discussion that follows. It’s worth pointing out as well that all these methods behave exactly the same for byte slices as they do for strings.

Here’s the full set of printing options we’ve listed, presented as a complete program you can run (and edit) right in the browser:

 // Copyright 2013 The Go Authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. 
package main import "fmt" func main() < const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98" fmt.Println("Println:") fmt.Println(sample) fmt.Println("Byte loop:") for i := 0; i < len(sample); i++ < fmt.Printf("%x ", sample[i]) >fmt.Printf("\n") fmt.Println("Printf with %x:") fmt.Printf("%x\n", sample) fmt.Println("Printf with % x:") fmt.Printf("% x\n", sample) fmt.Println("Printf with %q:") fmt.Printf("%q\n", sample) fmt.Println("Printf with %+q:") fmt.Printf("%+q\n", sample) >

[Exercise: Modify the examples above to use a slice of bytes instead of a string. Hint: Use a conversion to create the slice.]

[Exercise: Loop over the string using the %q format on each byte. What does the output tell you?]

UTF-8 and string literals

As we saw, indexing a string yields its bytes, not its characters: a string is just a bunch of bytes. That means that when we store a character value in a string, we store its byte-at-a-time representation. Let’s look at a more controlled example to see how that happens.

Here’s a simple program that prints a string constant with a single character three different ways, once as a plain string, once as an ASCII-only quoted string, and once as individual bytes in hexadecimal. To avoid any confusion, we create a “raw string”, enclosed by back quotes, so it can contain only literal text. (Regular strings, enclosed by double quotes, can contain escape sequences as we showed above.)

 // Copyright 2013 The Go Authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. package main import "fmt" 
func main() < const placeOfInterest = `⌘` fmt.Printf("plain string: ") fmt.Printf("%s", placeOfInterest) fmt.Printf("\n") fmt.Printf("quoted string: ") fmt.Printf("%+q", placeOfInterest) fmt.Printf("\n") fmt.Printf("hex bytes: ") for i := 0; i < len(placeOfInterest); i++ < fmt.Printf("%x ", placeOfInterest[i]) >fmt.Printf("\n") >
plain string: ⌘ quoted string: "\u2318" hex bytes: e2 8c 98 

which reminds us that the Unicode character value U+2318, the “Place of Interest” symbol ⌘, is represented by the bytes e2 8c 98 , and that those bytes are the UTF-8 encoding of the hexadecimal value 2318.

It may be obvious or it may be subtle, depending on your familiarity with UTF-8, but it’s worth taking a moment to explain how the UTF-8 representation of the string was created. The simple fact is: it was created when the source code was written.

Source code in Go is defined to be UTF-8 text; no other representation is allowed. That implies that when, in the source code, we write the text

the text editor used to create the program places the UTF-8 encoding of the symbol ⌘ into the source text. When we print out the hexadecimal bytes, we’re just dumping the data the editor placed in the file.

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.

Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes.

To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

Code points, characters, and runes

We’ve been very careful so far in how we use the words “byte” and “character”. That’s partly because strings hold bytes, and partly because the idea of “character” is a little hard to define. The Unicode standard uses the term “code point” to refer to the item represented by a single value. The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘. (For lots more information about that code point, see its Unicode page.)

To pick a more prosaic example, the Unicode code point U+0061 is the lower case Latin letter ‘A’: a.

But what about the lower case grave-accented letter ‘A’, à? That’s a character, and it’s also a code point (U+00E0), but it has other representations. For example we can use the “combining” grave accent code point, U+0300, and attach it to the lower case letter a, U+0061, to create the same character à. In general, a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.

The concept of character in computing is therefore ambiguous, or at least confusing, so we use it with care. To make things dependable, there are normalization techniques that guarantee that a given character is always represented by the same code points, but that subject takes us too far off the topic for now. A later blog post will explain how the Go libraries address normalization.

“Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as “code point”, with one interesting addition.

The Go language defines the word rune as an alias for the type int32 , so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression

is rune with integer value 0x2318 .

To summarize, here are the salient points:

  • Go source code is always UTF-8.
  • A string holds arbitrary bytes.
  • A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
  • Those sequences represent Unicode code points, called runes.
  • No guarantee is made in Go that characters in strings are normalized.

Range loops

Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

We’ve seen what happens with a regular for loop. A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. Here’s an example using yet another handy Printf format, %#U , which shows the code point’s Unicode value and its printed representation:

 // Copyright 2013 The Go Authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. package main import "fmt" func main()  

const nihongo = «日本語» for index, runeValue := range nihongo

The output shows how each code point occupies multiple bytes:

U+65E5 '日' starts at byte position 0 U+672C '本' starts at byte position 3 U+8A9E '語' starts at byte position 6 

[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?) What happens to the iterations of the loop?]

Libraries

Go’s standard library provides strong support for interpreting UTF-8 text. If a for range loop isn’t sufficient for your purposes, chances are the facility you need is provided by a package in the library.

The most important such package is unicode/utf8 , which contains helper routines to validate, disassemble, and reassemble UTF-8 strings. Here is a program equivalent to the for range example above, but using the DecodeRuneInString function from that package to do the work. The return values from the function are the rune and its width in UTF-8-encoded bytes.

 // Copyright 2013 The Go Authors. All rights reserved. // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. package main import ( "fmt" "unicode/utf8" ) func main()  

const nihongo = «日本語» for i, w := 0, 0; i

Run it to see that it performs the same. The for range loop and DecodeRuneInString are defined to produce exactly the same iteration sequence.

Look at the documentation for the unicode/utf8 package to see what other facilities it provides.

Conclusion

To answer the question posed at the beginning: Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters. In fact, the definition of “character” is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters.

There’s much more to say about Unicode, UTF-8, and the world of multilingual text processing, but it can wait for another post. For now, we hope you have a better understanding of how Go strings behave and that, although they may contain arbitrary bytes, UTF-8 is a central part of their design.

What is a rune?

What is a rune in Go? I’ve been googling but Golang only says in one line: rune is an alias for int32 . But how come integers are used all around like swapping cases? The following is a function swapcase. What is all the and but what is r

func SwapRune(r rune) rune < switch < case 'a' > 

Most of them are from http://play.golang.org/p/H6wjLZj6lW

func SwapCase(str string) string

I understand this is mapping rune to string so that it can return the swapped string. But I do not understand how exactly rune or byte works here.

16k 18 18 gold badges 99 99 silver badges 162 162 bronze badges
asked Oct 11, 2013 at 5:14
user2671513 user2671513

Sidenote: This doesn’t do what younger readers might want it to do for the English word «café» and others — let alone other languages. Go has libraries with decent support for actually useful variants of this kind of transformation.

Aug 24, 2018 at 16:06

In case anyone wants to know where the word «rune» came from: en.wikipedia.org/wiki/Runic_(Unicode_block)

Sep 20, 2018 at 18:51

A []rune can be set to a boolean, numeric, or string type. See stackoverflow.com/a/62739051/12817546.

– user12817546
Jul 9, 2020 at 7:51

10 Answers 10

Rune literals are just 32-bit integer values (however they’re untyped constants, so their type can change). They represent unicode codepoints. For example, the rune literal ‘a’ is actually the number 97 .

Therefore your program is pretty much equivalent to:

package main import "fmt" func SwapRune(r rune) rune < switch < case 97 > func main()

It should be obvious, if you were to look at the Unicode mapping, which is identical to ASCII in that range. Furthermore, 32 is in fact the offset between the uppercase and lowercase codepoint of the character. So by adding 32 to ‘A’ , you get ‘a’ and vice versa.

25.6k 9 9 gold badges 85 85 silver badges 101 101 bronze badges
answered Oct 11, 2013 at 5:58
16.3k 15 15 gold badges 67 67 silver badges 99 99 bronze badges

This obviously works only for ASCII characters and not for accended characters such as ‘ä’, let alone more complicated cases like the ‘ı’ (U+0131). Go has special functions to map to lower case such as unicode.ToLower(r rune) rune .

Oct 11, 2013 at 6:06

And to add to @topskip’s correct answer with a SwapCase function that works for all codepoints and not just a-z: func SwapRune(r rune) rune < if unicode.IsUpper(r) < r = unicode.ToLower(r) >else < r = unicode.ToUpper(r) >; return r >

Oct 11, 2013 at 6:33
Runes are int32 values. That’s the entire answer. They’re not «mapped».
Oct 11, 2013 at 11:38
So runes are similar to C chars?
Feb 23, 2017 at 16:54

@KennyWorden Runes are 32-bit which means one rune can hold any unicode character. However, c chars I believe are typically only 8-bit which means one char can only represent a character in the extended-ascii range

Jul 30, 2019 at 9:28

Rune is a Type. It occupies 32bit and is meant to represent a Unicode CodePoint. As an analogy the english characters set encoded in ‘ASCII’ has 128 code points. Thus is able to fit inside a byte (8bit). From this (erroneous) assumption C treated characters as ‘bytes’ char , and ‘strings’ as a ‘sequence of characters’ char* .

But guess what. There are many other symbols invented by humans other than the ‘abcde..’ symbols. And there are so many that we need 32 bit to encode them.

In golang then a string is a sequence of bytes . However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune , or vice versa.

The unicode package http://golang.org/pkg/unicode/ can give a taste of the richness of the challenge.

25.6k 9 9 gold badges 85 85 silver badges 101 101 bronze badges
answered Oct 11, 2013 at 19:28
46.9k 15 15 gold badges 102 102 silver badges 119 119 bronze badges

With the recent Unicode 6.3, there are over 110,000 symbols defined. This requires at least 21-bit representation of each code point, so a rune is like int32 and has plenty of bits.

Oct 12, 2013 at 12:08

You say «a string is a sequence of rune s» — I don’t think that’s true? Go blog: «a string is just a bunch of bytes»; Go lang spec: «A string value is a (possibly empty) sequence of bytes»

May 16, 2016 at 22:52
I’m still confused, so is string an array of runes or an array of bytes? Are they interchangeable?
Aug 11, 2017 at 9:30

@prvn That’s wrong. It’s like saying an image is not a sequence of bytes, it’s a sequence of pixels. But, actually, underneath, it’s a series of bytes. A string is a series of bytes, not runes. Please read the spec.

Aug 26, 2018 at 15:30

@prvn But, you can’t say not bytes . Then, you might say: «Strings are made up of runes and runes made up of bytes» Something like that. Then again. it’s not completely true.

Aug 27, 2018 at 9:20

I have tried to keep my language simple so that a layman understands rune .

A rune is a character. That’s it.

It is a single character. It’s a character from any alphabet from any language from anywhere in the world.

To get a string we use

double-quotes "" 
back-ticks `` 

A string is different than a rune. In runes we use

single-quotes '' 

Now a rune is also an alias for int32 . Uh What?

enter image description here

The reason rune is an alias for int32 is because we see that with coding schemes such as below

each character maps to some number and so it’s the number that we are storing. For example, a maps to 97 and when we store that number it’s just the number and so that’s way rune is an alias for int32. But is not just any number. It is a number with 32 ‘zeros and ones’ or ‘4’ bytes. (Note: UTF-8 is a 4-byte encoding scheme)

How runes relate to strings?

A string is a collection of runes. In the following code:

 package main import ( "fmt" ) func main()

We try to convert a string to a stream of bytes. The output is:

[72 101 108 108 111] 

We can see that each of the bytes that makes up that string is a rune.

answered Sep 16, 2017 at 12:04
Suhail Gupta Suhail Gupta
22.5k 64 64 gold badges 200 200 silver badges 330 330 bronze badges

A string is not a collection of runes this is not correct strictly speaking. Instead, string is a byte slice, encoded with utf8. Each char in string actually takes 1 ~ 3 bytes, while each rune takes 4 bytes. You can convert between string and []rune, but they are different.

Jul 31, 2018 at 10:08

Rune is not a character, a rune represents a unicode codepoint. And a codepoint doesn’t necessarily point to one character.

Oct 10, 2018 at 14:18

Worth to add that «a rune is also an alias for int32» yes, but it doesn’t mean it’s useful for poor-man compression. If you hit something like 55296 the string conversion goes astray: Go Playground

Nov 24, 2019 at 0:55

Note: UTF-8 is not a 4-byte encoding scheme; I believe you’re thinking about Unicode code points (which are 32 bits). The beauty of UTF-8 is that each character takes as few bytes as needed, or, in other words, each character has a variable size. Characters up to 127 (i.e. ASCII) are just encoded in a single byte. All characters on the old ANSI code pages will take 2 bytes. And so forth — up to 6 bytes (for some complex emojis with variants, for instance). That means that «Hello» just takes 5 bytes, in ASCII and UTF-8.

Jul 10 at 21:38

(Got a feeling that the above answers still didn’t state the differences & relationships between string and []rune very clearly, so I would try to add another answer with an example.)

As @Strangework ‘s answer said, string and []rune are quite different.

Differences — string & []rune :

  • string value is a read-only byte slice. And, a string literal is encoded in utf-8. Each char in string actually takes 1 ~ 3 bytes, while each rune takes 4 bytes
  • For string , both len() and index are based on bytes.
  • For []rune , both len() and index are based on rune (or int32).

Relationships — string & []rune :

  • When you convert from string to []rune , each utf-8 char in that string becomes a rune .
  • Similarly, in the reverse conversion, when converting from []rune to string , each rune becomes a utf-8 char in the string .

Tips:

  • You can convert between string and []rune , but still they are different, in both type & overall size.

(I would add an example to show that more clearly.)

Code

string_rune_compare.go:

// string & rune compare, package main import "fmt" // string & rune compare, func stringAndRuneCompare() < // string, s := "hello你好" fmt.Printf("%s, type: %T, len: %d\n", s, s, len(s)) fmt.Printf("s[%d]: %v, type: %T\n", 0, s[0], s[0]) li := len(s) - 1 // last index, fmt.Printf("s[%d]: %v, type: %T\n\n", li, s[li], s[li]) // []rune rs := []rune(s) fmt.Printf("%v, type: %T, len: %d\n", rs, rs, len(rs)) >func main()

Execute:

Output:

hello你好, type: string, len: 11 s[0]: 104, type: uint8 s[10]: 189, type: uint8 [104 101 108 108 111 20320 22909], type: []int32, len: 7 

Explanation:

  • The string hello你好 has length 11, because the first 5 chars each take 1 byte only, while the last 2 Chinese chars each take 3 bytes.
    • Thus, total bytes = 5 * 1 + 2 * 3 = 11
    • Since len() on string is based on bytes, thus the first line printed len: 11
    • Since index on string is also based on bytes, thus the following 2 lines print values of type uint8 (since byte is an alias type of uint8 , in go).
    • Since len() on []rune is based on rune, thus the last line printed len: 7 .
    • If you operate []rune via index, it will access base on rune.
      Since each rune is from a utf8 char in the original string, thus you can also say both len() and index operation on []rune are based on utf8 chars.

    694 7 7 silver badges 21 21 bronze badges
    answered Jul 31, 2018 at 10:47
    22.5k 20 20 gold badges 145 145 silver badges 201 201 bronze badges

    «For string, both len() and index are based on bytes.» Could you explain that a little more? When I do fmt.Println(«hello你好»[0]) it returns the actual UTF-8 code point instead of bytes.

    Oct 13, 2018 at 11:32

    @Julian Please take a look at the output of the program in the answer, for s[0] , it print s[0]: 104, type: uint8 , the type is uint8 , means its a byte. For ASCII chars like h utf-8 also use a single byte to represent it, so the code point is the same as the single byte; but for chinese chars like 你 , it use 3 bytes.

    Oct 13, 2018 at 18:06
    Clarifying example. I quoted you here stackoverflow.com/a/62739051/12817546.
    – user12817546
    Jul 7, 2020 at 8:49

    I do not have enough reputation to post a comment to fabrizioM’s answer, so I will have to post it here instead.

    Fabrizio’s answer is largely correct, and he certainly captured the essence of the problem — though there is a distinction which must be made.

    A string is NOT necessarily a sequence of runes. It is a wrapper over a ‘slice of bytes’, a slice being a wrapper over a Go array. What difference does this make?

    A rune type is necessarily a 32-bit value, meaning a sequence of values of rune types would necessarily have some number of bits x*32. Strings, being a sequence of bytes, instead have a length of x*8 bits. If all strings were actually in Unicode, this difference would have no impact. Since strings are slices of bytes, however, Go can use ASCII or any other arbitrary byte encoding.

    String literals, however, are required to be written into the source encoded in UTF-8.

    What is rune in Golang?

    Stings are always defined using characters or bytes. In Golang, strings are always made of bytes. Go uses UTF-8 encoding, so any valid character can be represented in Unicode code points.

    What is Golang rune?

    A character is defined using “code points” in Unicode. Go language introduced a new term for this code point called rune.

    Go rune is also an alias of type int32 because Go uses UTF-8 encoding. Some interesting points about rune and strings.

    • Strings are made of bytes and they can contain valid characters that can be represented using runes.
    • We can use the rune() function to convert string to an array of runes.
    • For ASCII characters, the rune value will be the same as the byte value.

    Finding rune of a character in Go

    Let’s look at a program to print rune of a character.

    package main import ( «fmt» ) func main()

    Go Rune Character

    Golang String to rune

    Let’s print the rune values of string characters using rune() function.

    package main import ( "fmt" ) func main() < s := "Golang" s_rune := []rune(s) fmt.Println(s_rune) // [71 111 76 97 110 103] >

    The integer array output looks same if we use the byte() function to convert into byte values array. So, what is the difference between byte and rune, let’s look into that in the next section.

    Understanding difference between byte and rune

    Let’s print byte array and rune of a string having non-ascii characters.

    package main import ( "fmt" ) func main() < s := "GÖ" s_rune := []rune(s) s_byte := []byte(s) fmt.Println(s_rune) // [71 214] fmt.Println(s_byte) // [71 195 150] >

    The special Unicode character Ö rune value is 214 but it’s taking two bytes for encoding.

    Reference: GoDocs on Rune

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *