Accent Folding and Unicode Transliteration

The article Accent Folding for Auto-Complete, written by Carlos Bueno, discusses some ambiguities introduced by unicode character sets. Go on and read it, before continuing here.

Done? Fine.

One of the article’s thesis is: You need a unified representaton for your textual data, since this will make searches against this data more accurat. A unified representation should for example remove accents, effectively mapping á, à and â to a.

This is called accent folding.

Accent folding using String::plain_ascii()

Gyro-PHP supports this in core already: There is the function plain_ascii() on the GyroString class. If $removewhitespace is set to true, which is the default, this function will replace some accents by its ASCII counterparts. This works with UTF-8 only, though.

String::plain_ascii() will additionally convert everything to lower case. It also removes whitespace and punctuation characters and replaces them by the given separator, which is usually ‘-‘. If you pass an empty string as separator, non-letters and non-numbers are removed completely. This actually unifies the source even more. Think for example of the Cologne airport, which is called “Köln Bonn”. Since however camel casing is very trendy at the moment, some people may rather try “KölnBonn”, while others may use “Köln-Bonn”. Using String::plain_ascii() with an empty separator will match all these cases to “koelnbonn”.

Note, however, that German umlauts are translated to a combination of characters. Ü turns into ue, ä turns into ae etc. In above example “Koln-Bonn” will get translated to “kolnbonn”, while “Köln-Bonn” becomes “koelnbonn”.

Let’s have a look at what String::plain_ascii() outputs:

<?php
print String::plain_ascii('Hello, Jürgen!');
// prints "hello-juergen"

URL building using String::plain_ascii()

String::plain_ascii() originally was introduced to generate pretty URLs, like WordPress or other content management systems do. It therefore can be used as a type for parameterized routes right away:

<?php
new ParameterizedRoute(
  'articles/{id:ui>}-{title:sp}.html',
  $this,
  'articles_view'
);

For an article with title “Hello, Jürgen!” and an id of 5, this will generate the URL /articles/5-hello-juergen.html, which is pretty nice,

However, when it comes to handle this route, one needs to know, that actually any URL starting with “/articles/5-“ and ending with “.html” will match, like, e.g. /articles/5-dont-remember-title.html. Therefore, the title must be validated in the action handler function:

<?php
public function action_articles_view(PageData $page_data, $id, $title) {
  $article = Articles::get($id);
  if ($article == false) {
    return self::NOT_FOUND;
  }
  // ActionMapper returns the URL for $article
  $check = ActionMapper::get_url($article, 'view');
  if ($check !== Url::current()->build()) {
    // If not equal to the one invoked, redirect!
    Url::create($check)->redirect(Url::PERMANENT);
  }
  ... Here goes the code ..
}

A broader approach: Unidecode

Accent folding like provided by String::plain_ascii() only covers some often used accented characters from what can be called “extended latin”. But it doesn’t cope with cyrillic letters, for example. Carlos Bueno mentions Perl’s Text::Unidecode module as a solution. There is a Python port of it, and also a PHP port.

The existing PHP port, however, requires the Perl module to be installed and a script to be run against this installation. Since I’d like to have a more self-contained version, I started and wrote my own, using the Python port as a base. The result has been released as module text.unidecode in contributions.

text.unidecode offers a converter that either can be used stand alone or through the ConverterFactory like this:

<?php
print ConverterFactory::encode("北京", CONVERTER_UNIDECODE);
// prints: Bei Jing
print ConverterFactory::encode('Jürgen\'s Café offers à la carte', CONVERTER_UNIDECODE);
// prints: Jurgen's Cafe offers a la carte

As you see, there is no transformation to lower case and no whitespace removal. If you want this, you should pass unidecode converter’s output to String::plain_ascii().

URL building using unidecoding

The module text.unidecode also offers a type to be used by parameterized route. Its called spu, which stands for “string plain unidecode”. This type will first unidecode the value passed, and then pass it to String::plain_ascii().

Given the above example:

<?php
new ParameterizedRoute(
  'articles/{id:ui>}-{title:spu}.html',
  $this,
  'articles_view'
);

An article with title “Hello, Jürgen!” will now generate the URL /articles/5-hello-jurgen.html. If the title is “Visiting 北京”, the URL will be /articles/5-visiting-bei-jing.html.

Of cource, these URLs must be validated when called, the same way as with the sp-Route-Type!