The twisted road through right-to-left language support

No readers like this yet.
left and right brain

Opensource.com

I saw Moriel Schottlender give a talk about this topic at linux.conf.au 2016 in Geelong last month and asked her to submit an article about it. When you watch her talk video and read her article, you'll see why I couldn't take proper notes to write about the talk, so I'm glad she contributed her story directly to us instead. —Rikki Endsley

 

English is written left to right. Hebrew is written right to left. We know that. Browsers—for the most part—know that too, just like they know that the default directionality of a web page is left-to-right (LTR), and that if there is a setting that explicitly defines the direction to right-to-left, the page should flip like a mirror. Browsers are smart like that. Mostly.

But even browsers have problems when deciding what to do when languages are mixed up, and that, my friends, is a recipe for really weird issues when typing and viewing bidirectional text.

Bidirectionality of characters and strings

Before I delve into some interesting examples of mixed-up directionality problems, I should first go over how browsers consider directionality at all.

I already said that English is recognized as an "LTR" language (Left-to-Right), and Hebrew, Arabic, Urdu (and some others) as RTL languages (Right-to-Left). These are fairly clear, and if you type a string that consists of these languages on their own, the situation is more or less okay (but I'll go over some issues with that later)

But not all characters in strings are equal.**

Hebrew and English (and a couple of other languages) are of the "strong" directionality types, the ones that not only have direction but also affect their surroundings. Some characters have "weak" directionality, in that, although they have directionality internally, they don't affect characters around them. And some characters are merely neutral, which means they get their directionality by their surroundings. Oh, and there are also some characters that may (and do) flip around visually depending on the text they're in.

Don't worry. I'm going to explain eeeeeeeeeeeeeeverything. Well, I'm going to try, so just keep reading.

Character Directionality Types

Unicode, which is the encoding system most common online, defines character type directionality for groups of characters as either strong, weak, or neutral. These types control how these characters are presented inside a string.

In the beginning days of the Internet—way, way back, when dinosaurs roamed the earth and half of you who are reading this post were probably in diapers—the Internet assumed pretty much everything is left-to-right.

I remember building web pages in raw HTML that most of us would cringe at today. There were no sites, really—only a collection of static HTML pages that, more often than not, included horrendous tags such as <blink> and <marquee> and featured pages in which one-font-served-all and the backgrounds were tiled. Ah, the good ol' days.

Those days, Hebrew was, in fact, typed backwards. If I wanted to write the Hebrew word "שלום", which starts with the hebrew letter "ש", I would have to type it backwards, starting with the letter "ם", and produce "םולש"—because the letters would appear sequentially from left to right. This might be doable when typing one or two words, but if you had an entire paragraph or an entire article, it could get annoying fast.

There were several tools you could download in those ancient days that would take your text and flip it. 'Cause that's how we rolled back then.

Luckily, Unicode came in and defined directionality, and although Unicode still has problems, RTL users can at least type their language normally, rather than learn to write backwards. That helps.

Strong types

Strong types are character sets that have explicit directionality. Hebrew is right-to-left, always. English is left-to-right, always.** When I type in either of those character-sets, my characters would appear in sequence, one after the other, according to the directionality. This is how the word "Hello" appears from left to right, whereas the word "שלום" appears from right to left.

Strong types also set the directionality of the space they're in, meaning that if I inserted any characters that have weak or neutral directionality in the middle of the sentence you're reading now (and I have already done that), they will assume the direction of the strongly typed string—in this case, English. So, strong type isn't just about the character itself, but also its surroundings.

Weak types

Weak types are fun. These are sequences of characters that might have a direction, but it doesn't affect their surroundings, and may be adjusted based on their surrounding text. In this group are characters such as numbers, plus and minus signs, colon, comma, period, and other control characters.

According to the Unicode bidirectionality algorithm specifications, weak types resolve according to the previous characters.

Neutral types

Neutral types are the funnest. Neutral characters are character types that can be either right-to-left or left-to-right, so they completely depend on what string surrounds them. These include things such as new-line characters, tabs, and white-space.

According to the Unicode bidirectionality algorithm specifications, neutral types resolve their directionality according to the surrounding text.

Implicit level types: When what you type is not quite what you get

So we have strong types, weak types, and neutral types, but that's not where our directionality double-take ends. In fact, the real doozies are characters that are resolved differently (as in, they take literally different shapes) in either RTL or LTR.

Yes, you read that right: They actually literally and quite visibly look different when written inside an LTR string versus inside an RTL string.

The best examples for this are parentheses and (my personal best friend) the bracket. These symbols are, in fact, icons that represent direction already. The button on your keyboard that has "(" on it is not quite that, but rather a symbol of "open parentheses." In English (which is left-to-right) the symbol is naturally ( to open parentheses, and ) to close them. But in Hebrew and Arabic and the other RTL languages, the "open parentheses" symbol is the reverse ), because the string is right to left. So this symbol would appear on your screen either ( or ) depending where you typed it.

I know, right?

Mishmashing both ways

In general, if one uses only one direction in a document (specifically online), the problems are not as noticeable, because the strongly typed text surrounds all other weak and implicit-level character types, making them its own type by default.

The issues come up when we have to mix languages and directions, or use RTL language inside a block that is meant for LTR. This happens a lot online—if there is no explicit dir="rtl" anywhere in the HTML document, the document defaults to LTR directionality. The directionality of the page (either by using dir='rtl' or dir='ltr' or not using dir= attribute at all and relying on it's default fallback to 'LTR') is considered to explicitly set the directionality of the expected text. So, any characters of ambiguous directionality will take on the direction that was set by that attribute.

If, say, I try to type an RTL language inside a textbox in a page that has dir='ltr', I can run into a lot of annoying problems with punctuation, the positions of segments of the sentence, and mixing languages of a strong type. The same happens the other way around, if I try to type an LTR language (say, English) inside an RTL-set textbox.

It can get so confusing, that, quite often, as I try to figure out how to type LTR text into an RTL box and see how my text actually organizes itself, my state of mind is pretty much blown.

The good, the bad, and the ugly

So, obviously, the creation of Unicode was much superior to the reverse-typing (and the need to use multiple individual fonts) that existed before it. Browsers tend to follow the Unicode rules (though apps that do their own rendering sometimes don't, but that's a different issue.) And this Unicode directionality algorithm gives us a lot of really Good Things to work with when typing different directions, but it also has Bad Things, and occasionally, even really Ugly Things.

Good things

There are, indeed, a bunch of good things that happen due to Unicode's bidirectionality algorithm. As I've already mentioned, the fact RTL users can type their language normally (and not backwards) is already a good thing (and I know from experience because I used the system when it didn't have that nice feature.)

Other benefits of the bidirectionality algorithm is the fact we can use numbers (which are weakly typed LTR) inside RTL text. So, for instance, consider this text:

ניפגש ב09:35 בחוף הים

Literally, this means "we will meet at 09:35 at the beach." Notice, though, that even without any directionality fixes, the numbers 09 and 35 are left-to-right as they should be, because that's how numbers are read—but I didn't really need to manually reverse my typing when I wrote this sentence. The browser did it for me.

Here's a nice exercise, though. Select that sentence. When you do, you can see exactly what piece has what directionality. Which leads me to ...

Bad things

Selections

Selections are a major part of the problem of bidirectional text. As you can see from the example of the "good thing" (that I don't need to reverse typing), there is also a bad side, which is how to select my text. Selection can be logical or visual. This is also true of cursor movement, which I will go over in a second.

Visual selection is simply that—visual—which means that the selection treats the segment of text as if it's one continuous block, regardless of directions.

Logical selection means the text is divided to its bidirectional pieces. That means that if I start my selection at the beginning of an RTL text (at the right), and drag my mouse toward its end (to the left), the selection will split when I reach the number part, because the numbers are left-to-right.

This is, indeed, logical, because it goes from logical start to logical end, and because the text is bidirectional, those two pointers are different for each of the sections. This makes a lot of sense, but it can be confusing.

Cursor Movement

Similarly, the cursor can also move either logically or visually. This can be a little confusing, and sometimes this behavior is inconsistent across platforms. Most of the time, though, the movement is logical.

So, here's a quick test of where this behavior can become really weird. Consider the following sentence. It is inside a textbox. so you can select it and move your cursor within it properly.

 

Try to select the text from the start (left) to the end (right). See what happens when you hover over the Hebrew words?

Now, if you move your marker inside the given textbox, the cursor (in Chrome and Firefox in Windows, for example) will move visually and not logically. That is, you can just move from end to start as if there are no two different languages there.

But try to copy/paste this string into Notepad (or equivalent simple software) and move the cursor from start to end. Usually, those editors would move the mouse logically, which, to be fair, makes more sense than visual movement.

It also shows you how RTL behavior can be somewhat unpredictable; some programs do it this way, some that way. Some browsers will go visual, some logical, and there are some CSS rules that can override those decisions, too, so it may change on a website-to-website basis.

Nice, eh?

Punctuation marks

Well, that was a textbox that was "LTR" to begin with. What happens, though, if I write a Hebrew sentence in an LTR box, or the other way around—an English sentence in an RTL textbox? That's when our lovely friends—the weakly typed punctuation marks—come out to play.

Whoops, where's the final period?

Here's the reverse version:

Where'd that final period go?

Two languages together, Kumbaya

Here's something even better, though, that relates to both selections and cursor movement (and rendering, and usage, and ...).

The above examples featured strong type (English or Hebrew) that is mixed with weak typed (numbers) and is mixed up by the neutral type (white space). But what if I create a string that has two opposite strong types mixed with neutral type white-spaces and weak type punctuation?

Go ahead, try to select that sentence from beginning to end:

Or the reverse:

(Hat tip to Amir Aharoni for this one)

Let's go over what goes on in that horrific textbox for a minute. First of all, part of the problem in the first textbox is that the textbox was forced RTL, and because most of the text in it was English, it broke in weird places. Here's the sentence when it is forced to be LTR:

Remember that English is strongly typed for LTR but עברית is strongly typed for RTL. When mixing English ועברית together you may get some surprising results.

Notice, though, that the textbox problem also happened just the same in the reverse case, where the box was LTR and the sentence was mostly RTL.

With a forced-RTL textbox (and majority of text strongly typed for LTR) the spaces took the directionality of the text they were surrounded by, which is LTR. Then we had a strongly-typed RTL word in Hebrew, which made the space inside it turn RTL, but the surrounding white space (the one between the RTL word and LTR sentence) was still affected by the surrounding text, which is LTR.

If you're still with me here, this may help drive the point home. Essentially, you had this:

[ENGLISH_CHUNK 3] hebrew [ENGLISH_CHUNK 2] hebrew [ENGLISH_CHUNK 1]

The entire sentence structure was right-to-left, but the small English segment was left-to-right. Overall "chunk" direction was RTL. Each chunk had its own internal direction. When you read it, it looks all jumbled—because it is.

And that happened exactly the same (only in reverse) in the second textbox. With LTR instead of RTL, and vice versa.

I know. I... know.

Ugly things

Now we move to the ugly area, the things that are not just difficult behavior, but are also producing visually different results. Remember those weak typed and implicit-level types? That's where these come in, and they, I tell you, they have a blast confusing us thoroughly.

White spaces

White spaces are implicit-level types, which means they are defined by the text they live in. The spaces in the sentence you're reading right now are implicitly LTR, because they are inside an English text. The white spaces here: במשפט הזה יש רווחים ואלה מוגדרים ימין לשמאל Are implicitly RTL because they're inside Hebrew, even though the page itself is LTR.

This is good, but it also produces some weird results. Consider the situation in which I have a set of numbers inside a text. The numbers are separated by whitespace, and the whitespace is defined by the surrounding text. But numbers themselves are "weak" typed, which means they do not affect their own surroundings (even though they are internally LTR). The whitespaces would have to take their directionality from whatever words surround the entire segment of numbers.

This sounds weird? The behavior is even weirder. See this, for instance:

I purposefully encapsulated those numbers in an LTR text, and so the whitespaces that separate these are still LTR. What do you think would happen, though, if I replace those English words with Hebrew (RTL) ones? Well, this example is exactly the same sentence and sequence of numbers, in the same exact order, with the single difference that "Start" and "End" were replaced by their prospective Hebrew words.

The numbers are reversed! The numbers... are... Head spinning yet? This might be weird, but it makes sense; the spaces are now encapsulated in an RTL text, which means they are now RTL. The space in RTL sentences is right-to-left, so the grouping of numbers go from the right and to the left.

But I think your head isn't spinning fast enough just yet. What would happen if we added spaces inside the number grouping itself? I mean, the numbers are internally LTR, but the space is RTL, so we will add a space to break the group and ... and the group will go spinning?

Try it. Add spaces to the number groups below.

See it? SEEEEEEEEEE it?

Yeah. Exactly.

Parentheses and Brackets

As I discussed earlier in this post, brackets and parentheses are, in fact, representing "start-of" and "end-of," which means that depending on where they are inserted, they may appear on different directions on your screen.

So, if I press the button that has a nice little [ on it on my keyboard (below the { and near the P), I will get different results in LTR and RTL.

This means that this code:

LTR:
<span dir="ltr">[</span>
RTL:
<span dir="rtl">[</span>

Becomes this: LTR: [ RTL: [ Yes, I clicked the same button. Yes, I'm sure. You're welcome to go over the source.

More than being a weird thing, this effect makes it incredibly frustrating when, inside an RTL textbox, there's a need to add some html <tags>. And, yes, this happens in Wikipedia, and in the RTL Wikipedias too.

Try adding a <span style="font-size: 2em"> to some segment of the text below. Good luck, stay sane, and remember to breathe. If you feel especially adventurous, you could also try to insert some wikitext, like a link to a page "Somewhere" (English link) with a Hebrew caption.

Want to go even wilder? Add some English text after the Hebrew one, and try to set some <a href="https://opensource.com/something.html"> </a> starting from the Hebrew string, and ending at the English one.

Type it all, don't cheat and copy/paste. Try it for real. Go ahead, play. Experiment. Go RTL crazy.

Online text editors and Wikimedia's VisualEditor

Now that we've gone through a slew of horrible interesting challenges with working with Right-to-Left text online, we can see how those can affect the development work of an online text editor. In the Wikimedia Foundation, we've been working on VisualEditor—a WYSIWYG system for editing Wikipedia articles. Not only does it work with converting HTML to Wikipedia's "wikitext" syntax, but it also has to handle multiple languages in multiple directionalities, platforms, browsers and localization environments. Basically, we need to support all the cases we discussed above, and then some. How hard can that be?

As a text editor, VisualEditor expects users to type into it, and that they do. They also do that in multiple languages, and, more often than not, in mixed languages inside the same article. Mixing languages is extremely common, especially in Wikipedia, when there's a need to provide the original script of a word taken from another language, or a city name in its native script, etc.

But as we saw, typing can be tricky, especially when we mix directions. We have to make sure we allow the users to type while seeing the result they will get in the page logically. We also have to make sure that their typing makes sense, and that if there is a need to describe a specific span of text as a different direction, they can do that easily. We have to make sure their input is interpreted correctly, that RTL appears properly in the ContentEditable screen, and then renders properly in the article that is saved.

Also, as you can see from my above example with the [ character, there's a difference between the HTML code and the resulting rendering. That is, I typed [ but got ] and [ appeared in the code, but ] appeared in my resulting rendered markup. Which should happen inside VisualEditor? WYSIWYG is quite different when what you type is expected to be flipped.

These things aren't impossible to deal with, but they are quite challenging and they often require decision-making about what a user should expect. Most applications online (and offline) have problems dealing with LTR/RTL typing, making these strategic decisions even more complicated. The behavior needs to be designed according to what we think is the best way to do it, and not what the RTL users expect, because as you can see from the current behavior, RTL users usually expect horrendous behavior.

It's the good kind of challenge, though. The kind a lot of people care about finding a good way to fix.

But wait, there's more

There's a bunch of other issues with bidirectional text, some of which are problems that exist in published software and apps online and make RTL'er's lives rather annoying. I may write about that at some point, and share my RTL frustration. If you're interested in how those challenges translate to everyday life, you can also visit http://rtl.wtf and witness for yourself what RTL users experience online regularly.

In this article, I went over issues with RTL strings inside LTR boxes, problems with characters of ambiguous directionality, with selections and cursor movements and general "huh"isms. There are, of course, more RTL hardships, but this post was meant to serve as a sort of introduction to the main and most common bidirectionality issues.

I hope you've enjoyed it. At least, I hope you now understand what the programmers (and RTL users!) must deal with.

Sidebars:

Languages and scripts

In this article, I use the term "Language" to refer to English and Hebrew letters. In fact, I should be using the term "Script" to refer to the letters and characters themselves. The difference comes mostly from the fact that, although Hebrew and English are languages, they each use characters that may be used in other languages. For instance, English uses Latin script, and Hebrew script can be used in Yiddish language as well.

So, take into account that this is the case, and that the actual letters that are used and are LTR or RTL are really "script" and not quite the language, because the browser doesn't really care what words you literally type using these scripts.

For the sake of simplicity and to try and reduce confusion, however, I made a tactical decision to group it all up to the most familiar terminology of "Language." (Thanks MatmaRex for pointing out I should at least mention this difference.)

Useful links

User profile image.
Moriel is a physicist-turned Software Engineer who speaks and thinks right-to-left. She earned a B.Sc in Physics from City College of New York in 2011, with her research focused on “modeling the kinematic equation of a loosely bound spring bouncing down an inclined plane”, or, in plain English, finding the equation of the Slinky.

1 Comment

Wow. Doubly wow! And more: RTL WOW!

IMHO, problems arise from mixing writing systems. If I write a question: "¿Que pasa?", people immediately recognize not only because of the extra interrogation symbol at the start, but mainly because of the surrounding double quotes. This is traditional language usage AFAIK and helps to interpret words more easily in their right context.

Thus, there's no need for overcomplicated AI to discover what is being talked; we just need to use delimiters (and perhaps some sort of added convention to annotate the language of a text segment) and life will be joyful again.

Except for some problems which are really hairy, like using Hindu numbers with Hebraic. Personally, I think people should go one way or the other... e.g. "Easy as 123" should be something like "321 sa ysaE" when written in Hebraic. There. Simple. Easy. Consistent. But no, it had to be "123 sa ysaE"... (actually the right way to write must be 3-2-1, so I'm forcing things a bit to make my point). This cannot be fixed, but at least, inside a Hebraic context, the complicated behavior can somewhat be predictable... the problem only arises IMHO when text is written in both languages/directions without explicit markings (e.g. using quotes).
Now, for another aspect, it is important to separate a symbol denomination from its function. For example, as shown in the video the symbols "(" and ")" pose a problem related to how interpret them in LTR and RTL systems. But the terminologies discussed really don't solve the problem: before/backwards, before/after, start/end... all have to do with position. So, they won't solve this conundrum. We have to use the idea, the desired function. Like in "open/close". These words are more generic and don't indicate any kind of orientation. Which is the LTR open parenthesis? It's "(". And the RTL one? It's ")". When reading a text in both systems, don't say or use "left parenthesis", use "open"... and the software needs to understand (as in searching for a closing match) what is being meant. A special problem is brought by keys which do have a directional nature like the arrows. This is illustrated in the video by the difference between just moving to the right versus selecting text to the right. We also should have (IMHO) different keys: advance and backup. An economic way of doing it is redefining the working of the arrows in text processing contexts, so that they work the same way both in moving and in extending a selection. But what if someone really wants to go to the right? I guess something like AltGr+Right could be used. It's important to distinguish we cannot use "advance to the next character" as the same key used to go to the right. A key used as "advance" will produce different results depending on context. Of course, there will be further problems to address, like in the case of rubber band selection.
Regarding HTML edition, we have a rehash of the same problem, which again can be solved using quotes to escape a string as being represented in a different system. Regarding HTML syntax, well, I believe it cannot go both ways: either it is written LTR or RTL. For convenience, I suppose we could have tags to markup a section of HTML code to be shown/edited according to a direction. If the standard will require a long time to have these marks (if ever!), some editing programs might using their own internal markers.
Finally, let me say I found the subject important and would like to praise Moriel for a very careful and high-quality preparation. It certainly underlined the importance of the theme. I hope someone fix that in Linux.
Finally, some observations:
1. Emoticons / emojis will have to be rethought... D: really is not the same as :D
2. This is just the start of the problem. What about RTL and downward (vertical), like traditional Japanese?
3. If you really write mixed English and Hebrew, I guess at some point some kind of AI will ask "Is this English or Hebrew?" should the doubt arise. It may well insert the tags RTL or LTR in the text for you. So there.

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.