[Lazarus] UTF8 string compare with correct locale sorting

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Lazarus] UTF8 string compare with correct locale sorting

Jürgen Hestermann
I fully aggree on this

http://www.utf8everywhere.org/

and therefore want to use UTF8 in all my programs.
But the problem is sorting UTF8 strings.
According to

http://forum.lazarus.freepascal.org/index.php?topic=15256.0

UTF8CompareText would be the best choice and it runs quite fast.
But it does not obey sorting by locale (i.e. german umlauts end
up at the end of the list although they need to be sorted together with
their corresponding non-umlaut  characters (Ü at U, Ä at A, and so on).

Does *any* Pascal UTF8 string compare function exist that does this?




--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Hans-Peter Diettrich
Jürgen Hestermann schrieb:

> I fully aggree on this
>
> http://www.utf8everywhere.org/
>
> and therefore want to use UTF8 in all my programs.
> But the problem is sorting UTF8 strings.
> According to
>
> http://forum.lazarus.freepascal.org/index.php?topic=15256.0
>
> UTF8CompareText would be the best choice and it runs quite fast.
> But it does not obey sorting by locale (i.e. german umlauts end
> up at the end of the list although they need to be sorted together with
> their corresponding non-umlaut  characters (Ü at U, Ä at A, and so on).
>
> Does *any* Pascal UTF8 string compare function exist that does this?

Such functions should be provided by the system's
Unicode/internationalization library. When they can be located there, a
wrapper can be added to the RTL.

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Jürgen Hestermann
Am 2013-10-17 21:56, schrieb Hans-Peter Diettrich:
 > Jürgen Hestermann schrieb:
 >> But it does not obey sorting by locale (i.e. german umlauts end
 >> up at the end of the list although they need to be sorted together with
 >> their corresponding non-umlaut  characters (Ü at U, Ä at A, and so on).
 >>
 >> Does *any* Pascal UTF8 string compare function exist that does this?
 >
 > Such functions should be provided by the system's Unicode/internationalization library. When they can be located there, a wrapper can be added to the RTL.

Yes, it exists. But on Windows for example is not available for UTF8, only for UTF16. When I store millions of file names I already have the problem to convert from UTF16 to UTF8 and now I need to convert back again only to get a correct sorting? That would be a real performance hit. It's strange that so many UTF8 string functions exist but none that sorts correctly dependend on locale.


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Michael Schnell
In reply to this post by Hans-Peter Diettrich
On 10/17/2013 09:56 PM, Hans-Peter Diettrich wrote:
> Jürgen Hestermann schrieb:
>> I fully aggree on this
>>
>> http://www.utf8everywhere.org/
>>
>  When they can be located there, a wrapper can be added to the RTL.

The OP seems to clam that with Unicode, localization is obsolete.

If this is not the case, why then use Unicode ?
- Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Hans-Peter Diettrich
In reply to this post by Jürgen Hestermann
Jürgen Hestermann schrieb:

> It's strange that so many UTF8 string functions exist but none that
> sorts correctly dependend on locale.

Sorting can be done not only by locale and alphabetically, but also in
phone book and more sort orders. Just in German you have a couple of
options to sort umlauts, and such national stuff is covered by unicode
standard libraries.

I didn't realize that such functions are supplied only for the target
Unicode representation, thanks for mentioning this topic.

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Jürgen Hestermann
Am 2013-10-18 11:39, schrieb Hans-Peter Diettrich:
 > Sorting can be done not only by locale and alphabetically, but also in phone book and more sort orders.

But I don't know any sort order that sorts german umlauts at the end of the whole list (as UTF8CompareText does).
Such a sort order is not usable for me.

 > Just in German you have a couple of options to sort umlauts, and such national stuff is covered by unicode standard libraries.

As it seems these libraries are not available for UTF8 directly (only after time consuming conversion to UTF16).


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Jürgen Hestermann
In reply to this post by Michael Schnell
Am 2013-10-18 10:43, schrieb Michael Schnell:
 > The OP seems to clam that with Unicode, localization is obsolete.

Who claims this?


 > If this is not the case, why then use Unicode ?

I thought Unicode is just for international *coding* of characters but not for sort order definition.


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Hans-Peter Diettrich
In reply to this post by Jürgen Hestermann
Jürgen Hestermann schrieb:
> Am 2013-10-18 11:39, schrieb Hans-Peter Diettrich:
>  > Sorting can be done not only by locale and alphabetically, but also
> in phone book and more sort orders.
>
> But I don't know any sort order that sorts german umlauts at the end of
> the whole list (as UTF8CompareText does).

Most probably it sorts by codepoints, or by bytes.

> Such a sort order is not usable for me.

Such sorting is useless in almost every language.

>  > Just in German you have a couple of options to sort umlauts, and such
> national stuff is covered by unicode standard libraries.
>
> As it seems these libraries are not available for UTF8 directly (only
> after time consuming conversion to UTF16).

Perhaps the new UnicodeStrings will provide more useful solutions?

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Hans-Peter Diettrich
In reply to this post by Jürgen Hestermann
Jürgen Hestermann schrieb:

> Am 2013-10-18 10:43, schrieb Michael Schnell:
>  > The OP seems to clam that with Unicode, localization is obsolete.
>
> Who claims this?
>
>
>  > If this is not the case, why then use Unicode ?
>
> I thought Unicode is just for international *coding* of characters but
> not for sort order definition.

The Unicode Consortium found many tools and functions required for
handling and display of Unicode texts. Many are implemented as (open
source) libraries, to be supported and provided by a Unicode-aware platform.

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Jy V
In reply to this post by Jürgen Hestermann

> Sorting can be done not only by locale and alphabetically, but also in phone book and more sort orders.

But I don't know any sort order that sorts german umlauts at the end of the whole list (as UTF8CompareText does).
Such a sort order is not usable for me.

You are looking for Collation support,
Comparing/Sorting requires additional infos to load the proper collation for your language (which is not provided in the string encoded as UTF8)

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Michael Schnell
In reply to this post by Jürgen Hestermann
On 10/18/2013 06:16 PM, Jürgen Hestermann wrote:
>
> Who claims this?
Sorry if I over-interpreted your wording.
>
>
> > If this is not the case, why then use Unicode ?
>
> I thought Unicode is just for international *coding* of characters but
> not for sort order definition.

In a Unicode aware programming language, the handling of Unicode encoded
strings needs to provides compare (besides many other string operation,
potentially including conversion between multiple Unicode and
non-Unicode encoding schemes. )

If string compare only allows for "equal" vs "not equal" results (in
some imaginary language) this is complicated enough, as there can be
multiple different encodeings for the same "visual  character".
Additionally, it might be viable to do a "case aware" and/or a "not case
aware" operation. To me it's not clear what "case aware" might mean with
characters for ancient Egyptian language,

If string compare also allows for "greater" vs "smaller" results the
programming language needs to impose some sort order (and maybe a lot
more "locale"-depending complex algorithms). This to me seems horribly
complicated. Rather obviously you can't define a natural sort order for
the complete set of Unicode characters. Thus a kind of "localization" is
necessary and supposedly needs to be selectable/definable by the user
via "locale" or whatever.

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Jy V

On Mon, Oct 21, 2013 at 10:24 AM, Michael Schnell <[hidden email]> wrote:

If string compare also allows for "greater" vs "smaller" results the programming language needs to impose some sort order (and maybe a lot more "locale"-depending complex algorithms). This to me seems horribly complicated. Rather obviously you can't define a natural sort order for the complete set of Unicode characters. Thus a kind of "localization" is necessary and supposedly needs to be selectable/definable by the user via "locale" or whatever.

this is the purpose of "Collations"

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Michael Schnell
On 10/21/2013 01:00 PM, Jy V wrote:
>
> this is the purpose of "Collations"
>
I see:

http://www.unicode.org/reports/tr10/

As expected: horribly complicated.

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Jy V

this is the purpose of "Collations"

I see:

http://www.unicode.org/reports/tr10/

As expected: horribly complicated.

DUCET support has been submitted by clever developers in FPC and/or Lazarus source tree,
http://bugs.freepascal.org/view.php?id=24856
and I guess it will become available,
it may require the user to provide 1 additional parameter to compare 2 strings
it should not be that difficult to use.



--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Michael Schnell
On 10/21/2013 07:12 PM, Jy V wrote:
>
> it may require the user to provide 1 additional parameter to compare 2
> strings
> it should not be that difficult to use.

Yep.

Only that traditional Pascal programmers are not used to do

if compareUTF8String(s1, s2, comparemode) < 0 then ...

but

if s1 < s2 then ....


Thus Delphi and FPC introduce quasi-dynamically encoded Strings with
automatic encoding-type handling.

Unexpected compare-results are hardly avoidable here :-( .

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Sven Barth
Am 22.10.2013 09:37, schrieb Michael Schnell:

> On 10/21/2013 07:12 PM, Jy V wrote:
>>
>> it may require the user to provide 1 additional parameter to compare
>> 2 strings
>> it should not be that difficult to use.
>
> Yep.
>
> Only that traditional Pascal programmers are not used to do
>
> if compareUTF8String(s1, s2, comparemode) < 0 then ...
>
> but
>
> if s1 < s2 then ....

I myself am more used to "CompareText/Str" than "<" or ">" as I didn't
know until around 1 or 2 years ago that "<" and ">" are supported on
strings at all ^^

Regards,
Sven

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Michael Schnell
On 10/22/2013 10:24 AM, Sven Barth wrote:
> I didn't know until around 1 or 2 years ago that "<" and ">" are
> supported on strings at all

Nice try O:-)

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Lukasz Sokol
On 22/10/13 09:35, Michael Schnell wrote:
> On 10/22/2013 10:24 AM, Sven Barth wrote:
>> I didn't know until around 1 or 2 years ago that "<" and ">" are supported on strings at all
>
> Nice try O:-)
>
> -Michael
>
> --

And (probably) overloaded operators are your friends here?

operator < (S1, S2: UTF8String) b : boolean;
begin
  b := (compareUTF8String(s1, s2, comparemode) < 0)
end;

-L.


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Sven Barth
Am 22.10.2013 11:36, schrieb Lukasz Sokol:

> On 22/10/13 09:35, Michael Schnell wrote:
>> On 10/22/2013 10:24 AM, Sven Barth wrote:
>>> I didn't know until around 1 or 2 years ago that "<" and ">" are supported on strings at all
>> Nice try O:-)
>>
>> -Michael
>>
>> --
> And (probably) overloaded operators are your friends here?
>
> operator < (S1, S2: UTF8String) b : boolean;
> begin
>    b := (compareUTF8String(s1, s2, comparemode) < 0)
> end;
Nope, because Strings already have the "<" and ">" operators defined you
can not overload them.

Regards,
Sven

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] UTF8 string compare with correct locale sorting

Michael Schnell
I get the feeling that _Closed_/_Open_Strings_ (->
http://en.wikipedia.org/wiki/String_theory#Strings ) are easier to
understand and of more practical use than _Unicode_Strings_ .

Thus an IDE / Language / Library that not completely hides the
complexity behind Unicode (and it's different encoding schemes) should
provide a "non-Unicode" Mode that allows for happy programming (at least
in Europe, America and Australia).

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
12