Accent Folding and Unicode Transliteration
The article Accent Folding for Auto-Complete, written by Carlos Bueno, discusses some ambiguities introduced by unicode character sets. Go on and read it, before continuing here.
Done? Fine.
One of the article’s thesis is: You need a unified representaton for your textual data, since this will make searches against this data more accurat. A unified representation should for example remove accents, effectively mapping á, à and â to a.
This is called accent folding.
Accent folding using String::plain_ascii()
Gyro-PHP supports this in core already: There is the function plain_ascii() on the GyroString class.
If $removewhitespace
is set to true, which is the default, this function will replace some accents by its
ASCII counterparts. This works with UTF-8 only, though.
String::plain_ascii()
will additionally convert everything to lower case. It also removes whitespace and punctuation
characters and replaces them by the given separator, which is usually ‘-‘. If you pass an empty string as separator,
non-letters and non-numbers are removed completely. This actually unifies the source even more. Think for example of
the Cologne airport, which is called “Köln Bonn”. Since however camel casing is very trendy at the moment, some people
may rather try “KölnBonn”, while others may use “Köln-Bonn”. Using String::plain_ascii()
with an empty separator
will match all these cases to “koelnbonn”.
Note, however, that German umlauts are translated to a combination of characters. Ü turns into ue, ä turns into ae etc. In above example “Koln-Bonn” will get translated to “kolnbonn”, while “Köln-Bonn” becomes “koelnbonn”.
Let’s have a look at what String::plain_ascii()
outputs:
URL building using String::plain_ascii()
String::plain_ascii()
originally was introduced to generate pretty URLs, like WordPress or other content management
systems do. It therefore can be used as a type for parameterized routes right away:
For an article with title “Hello, Jürgen!” and an id of 5, this will generate the URL /articles/5-hello-juergen.html, which is pretty nice,
However, when it comes to handle this route, one needs to know, that actually any URL starting with “/articles/5-“ and ending with “.html” will match, like, e.g. /articles/5-dont-remember-title.html. Therefore, the title must be validated in the action handler function:
A broader approach: Unidecode
Accent folding like provided by String::plain_ascii()
only covers some often used accented characters
from what can be called “extended latin”. But it doesn’t cope with cyrillic letters, for example.
Carlos Bueno mentions Perl’s Text::Unidecode module as a solution. There is a Python port of it, and also a PHP port.
The existing PHP port, however, requires the Perl module to be installed and a script to be run against this installation.
Since I’d like to have a more self-contained version, I started and wrote my own, using the Python port as a base.
The result has been released as module text.unidecode
in contributions.
text.unidecode
offers a converter that either can be used stand alone or through the ConverterFactory like this:
As you see, there is no transformation to lower case and no whitespace removal. If you want this, you should pass
unidecode converter’s output to String::plain_ascii()
.
URL building using unidecoding
The module text.unidecode
also offers a type to be used by parameterized route. Its called spu, which stands for
“string plain unidecode”. This type will first unidecode the value passed, and then pass it to String::plain_ascii()
.
Given the above example:
An article with title “Hello, Jürgen!” will now generate the URL /articles/5-hello-jurgen.html. If the title is “Visiting 北京”, the URL will be /articles/5-visiting-bei-jing.html.
Of cource, these URLs must be validated when called, the same way as with the sp-Route-Type!