[Lazarus] substr return wrong string with some utf8 char

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Lazarus] substr return wrong string with some utf8 char

C Pomalo
hello

i use the substr function to get some truncated string.
these string are in french language  and sometimes contains "à" or "é"
etc.. char
When these charaters are in the string the wrong substr is return
substr (pierre à feu,11) return 'pierre à  f' not 'pierre à fr'

length('à') return 2
utf8length('à') return 1

is a utf8Subst() like function exist?

thank
Claude



--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Vincent Snijders
2011/2/10 claude Pomalo <[hidden email]>:

> hello
>
> i use the substr function to get some truncated string.
> these string are in french language  and sometimes contains "à" or "é" etc..
> char
> When these charaters are in the string the wrong substr is return
> substr (pierre à feu,11) return 'pierre à  f' not 'pierre à fr'
>
> length('à') return 2
> utf8length('à') return 1
>
> is a utf8Subst() like function exist?
>
>

http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/utf8copy.html

Vincent

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Graeme Geldenhuys
In reply to this post by C Pomalo
Op 2011-02-10 12:27, claude Pomalo het geskryf:
>
> length('à') return 2

Length() returns bytes

> utf8length('à') return 1

this returns the number of visual characters.


> is a utf8Subst() like function exist?

utf8copy() somewhere in the LCL units.



Regards,
  - Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich
In reply to this post by C Pomalo
claude Pomalo schrieb:

> i use the substr function to get some truncated string.
> these string are in french language  and sometimes contains "à" or "é"
> etc.. char
> When these charaters are in the string the wrong substr is return
> substr (pierre à feu,11) return 'pierre à  f' not 'pierre à fr'
>
> length('à') return 2
> utf8length('à') return 1

It looks to me as if Length returns the byte count, while Utf8Length
returns the character count.

> is a utf8Subst() like function exist?

I'd use Pos to get the byte index of the part to copy or remove.

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich
In reply to this post by Vincent Snijders
Vincent Snijders schrieb:

>> is a utf8Subst() like function exist?
>>
>>
>
> http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/utf8copy.html

Do you realize how useless that "documentation" is?

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Graeme Geldenhuys
Op 2011-02-10 16:24, Hans-Peter Diettrich het geskryf:
>>
>> http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/utf8copy.html
>
> Do you realize how useless that "documentation" is?

:-)
I think Michael van Canneyt got it right with FPC and FCL documentation
- don't publish what isn't documented. Currently code-navigation is more
useful than those pages.


I would love to know what MakeMinMax() does (I did not look at the
code)? From the procedure name itself, it doesn't give a lot of clues. I
would guess it returns some Min and Max values (based on the var
parameters), but from what source?  :-)

http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/makeminmax.html


Regards,
  - Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Graeme Geldenhuys
In reply to this post by Hans-Peter Diettrich
Op 2011-02-10 16:20, Hans-Peter Diettrich het geskryf:
>
>> is a utf8Subst() like function exist?
>
> I'd use Pos to get the byte index of the part to copy or remove.

Make that UTF8Pos() (in combination with UTF8Copy) because a UTF-8
character can be anything from 1–4 bytes long.



Regards,
  - Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Vincent Snijders
In reply to this post by Hans-Peter Diettrich
2011/2/10 Hans-Peter Diettrich <[hidden email]>:
>>
>> http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/utf8copy.html
>
> Do you realize how useless that "documentation" is?

I think it tells people which unit to add to the uses clause. I think
that is useful information.

Vincent

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Michael Schnell
On 02/10/2011 04:31 PM, Vincent Snijders wrote:
> I think it tells people which unit to add to the uses clause.
Often enough it does not do that when requested for, as "F1" on an LCL
procedure only gets you to that documentation, when the unit already is  
in the uses clause. =-O

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Michael Schnell
In reply to this post by Hans-Peter Diettrich
On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
>
>> length('à') return 2
>> utf8length('à') return 1
>
I thinks according to the definition of UTF8String it's correct that
Length(s) provides the byte count. I do hope that with "NewStrings" this
some day might change, as it's quite confusing for anybody who does not
want to be bothered with the Uniocde internals.

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Michael Schnell
In reply to this post by Graeme Geldenhuys
On 02/10/2011 04:00 PM, Graeme Geldenhuys wrote:
>
> Make that UTF8Pos() (in combination with UTF8Copy) because a UTF-8
> character can be anything from 1–4 bytes long.
>
Obviously UTF8Pos() and UTF8Copy() are a lot slower than Pos() and
Copy(), and in many cases (i.e when that arguments of copy are obtained
by Pos(), Length, ... ) Copy does what is requested. If single
characters are the point of interest, this will not work. (Some easy to
find and to read entry-level documentation on this would be nice, though
:-) .)

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

José Mejuto
Hello Lazarus-List,

Friday, February 11, 2011, 9:24:21 AM, you wrote:

>> Make that UTF8Pos() (in combination with UTF8Copy) because a UTF-8
>> character can be anything from 1–4 bytes long.
MS> Obviously UTF8Pos() and UTF8Copy() are a lot slower than Pos() and
MS> Copy(), and in many cases (i.e when that arguments of copy are obtained

If no checks about utf8 integrity are performed they should not be
that "lot slower", only a bit slower, at least utf8pos, utf8copy is
for sure slower.
A different thing is that current implementation is a bit overengined
which add some overhead.

Is it logical/safe that utf8 functions do not check utf8 integrity ?
I'm talking about utf8pos, utf8copy, etc...

--
Best regards,
 José


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich
In reply to this post by Michael Schnell
Michael Schnell schrieb:
> On 02/10/2011 03:20 PM, Hans-Peter Diettrich wrote:
>>
>>> length('à') return 2
>>> utf8length('à') return 1
>>
> I thinks according to the definition of UTF8String it's correct that
> Length(s) provides the byte count. I do hope that with "NewStrings" this
> some day might change, as it's quite confusing for anybody who does not
> want to be bothered with the Uniocde internals.

Length() is bound to the physical (array) size, a redefinition would
break this established rule.

MBCS users had to live with this problem since ever, and UTF-8 is a
MBCS. I'm not sure whether the difference between number of characters
(glyphs) and number of codepoints can be eliminated by any approved
convention.

IMO it's a good idea to forget about "char" in dealing with Unicode/UTF
strings, and only use (sub)strings. This is not a major problem, since
Pascal does not distinguish between char and string literals.

Obviously this code will fail with UTF-8 encoding:
   var a: char = 'à'; //or '`a'?
and even UTF-32 may fail with ligatures or other character combinations.

Some "NewStrings" model IMO should at least distinguish between ASCII,
ANSI and UTF strings:

ASCII: never convert, codes above #$7F are undefined (maybe raw data).
ANSI: SBCS according to a specific codepage.
UCS2: a possible Unicode subset (BMP) of 2-byte (WideChar) characters.
UTF: anything else, with unrelated character and byte counts.

This would make at least those coders happy, that are used to deal with
SBCS, and writing applications for local/national use. All coders, in
detail the English (ASCII) speakers, have to learn about UTF and MBCS
when dealing with UTF strings (apart from assignment and display).

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Lukasz Sokol
In reply to this post by José Mejuto
On 11/02/2011 09:41, José Mejuto wrote:

> If no checks about utf8 integrity are performed they should not be
> that "lot slower", only a bit slower, at least utf8pos, utf8copy is
> for sure slower.
> A different thing is that current implementation is a bit overengined
> which add some overhead.
>
> Is it logical/safe that utf8 functions do not check utf8 integrity ?
> I'm talking about utf8pos, utf8copy, etc...
>

Maybe make the sanity check optional with default true ?
Or some unit flag, default true and the utf* routines could force check
if told so ?

Not that I know anything about this code but why not let people who
know what they are doing to skip the check or call check when they know they need to?
(and as I said, in order not to break existing code, default would be to check always)

My £0.000002.

Lukasz


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Michael Schnell
In reply to this post by Hans-Peter Diettrich
On 02/11/2011 12:49 PM, Hans-Peter Diettrich wrote:
>
>
> Some "NewStrings" model IMO should at least distinguish between ASCII,
> ANSI and UTF strings:
With a future "NewStrings" implementation I mean a dynamically coded
string typed that can hold e.g. "ASCII code page xxxx", "UTF8", "UTF16",
or "UTF32" content and knows about what is stored and how. So "Length"
with this type can be defined as "character count" and copy can work on
character length and position, and automatically convert strings if they
are coded differently.

Of course certain operations might be really slow if the encoding of the
data is not appropriate.

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich
Michael Schnell schrieb:

> With a future "NewStrings" implementation I mean a dynamically coded
> string typed that can hold e.g. "ASCII code page xxxx", "UTF8", "UTF16",
> or "UTF32" content and knows about what is stored and how.

How would you determine the byte count for reading and writing text?

> So "Length"
> with this type can be defined as "character count" and copy can work on
> character length and position, and automatically convert strings if they
> are coded differently.

I don't like automatic string conversion, because:
> Of course certain operations might be really slow if the encoding of the
> data is not appropriate.

Consider what will happen when every procedure or component has its
*own* idea of the "appropriate" encoding...

DoDi

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

José Mejuto
In reply to this post by Lukasz Sokol
Hello Lazarus-List,

Friday, February 11, 2011, 1:38:58 PM, you wrote:

>> Is it logical/safe that utf8 functions do not check utf8 integrity ?
>> I'm talking about utf8pos, utf8copy, etc...
LS> Maybe make the sanity check optional with default true ?
LS> Or some unit flag, default true and the utf* routines could force check
LS> if told so ?
LS> Not that I know anything about this code but why not let people who
LS> know what they are doing to skip the check or call check when they know they need to?
LS> (and as I said, in order not to break existing code, default would be to check always)

Current code does not perform sanity check, and I think most functions
should not perform it, only conversion functions and a "sanitize"
function should perform the checks, otherwise most functions will
degradate in speed even when you know that the data is utf8 compliant.
The same applies to UTF16.

--
Best regards,
 José


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich
In reply to this post by José Mejuto
José Mejuto schrieb:

> If no checks about utf8 integrity are performed they should not be
> that "lot slower", only a bit slower, at least utf8pos, utf8copy is
> for sure slower.

I see no need for integrity checks, when the procedures are called with
reasonable arguments. Before e.g. Copy can be called, the required
parameters have to be determined, and *this* is where the use of the
appropriate functions will automatically return valid arguments.

> A different thing is that current implementation is a bit overengined
> which add some overhead.
>
> Is it logical/safe that utf8 functions do not check utf8 integrity ?
> I'm talking about utf8pos, utf8copy, etc...

There exists no need for an utf8pos function, for use with an utf8copy,
when Pos already returns the correct start index for Copy. Only the
count parameter deserves different handling in utf8copy - where the
determination of the byte count can be done once, e.g. in an
(UTF8)ByteCount function. Then Copy can allocate immediately the
requested number of bytes, then move the same number of bytes. The use
of the ByteCount function is not required when the end index is already
known, from e.g. another Pos call.

It also would help to ensure text integrity when indexed access to
bytes/chars in (MBCS/UTF) strings simply would be dropped. Then either a
different string type or different access methods have to be used, at
the choice of the coder.

DoDi


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

Michael Schnell
In reply to this post by Hans-Peter Diettrich
On 02/11/2011 02:26 PM, Hans-Peter Diettrich wrote:
>
> How would you determine the byte count for reading and writing text?
e.g. when using a stream. good question. As, AFAIK, this is no more than
a yet incomplete project in the svn, I don't know.

>
>> So "Length" with this type can be defined as "character count" and
>> copy can work on character length and position, and automatically
>> convert strings if they are coded differently.
>
> I don't like automatic string conversion, because:
>> Of course certain operations might be really slow if the encoding of
>> the data is not appropriate.
>
> Consider what will happen when every procedure or component has its
> *own* idea of the "appropriate" encoding...

As always, comfort can be traded against speed. If the user wants speed
he needs to take care that as few conversions as possible are done.

If he just uses this string type and does not explicitly enforce
encoding no encoding is necessary but on exit and entry of his code. And
the same code will work without re-coding for all codes used and entry
and exit, provided they all are identical.

E.g. the Windows System API will use UTF-16, while the Linux System API
uses UTF-8 for things like "caption" and "Text". The (even binary)
unmodified user code will not need to do conversions for this kind of
GUI work. (AFAIK, string constants are re-encoded on the first use, if
necessary).

-Michael

--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
Reply | Threaded
Open this post in threaded view
|

Re: [Lazarus] substr return wrong string with some utf8 char

José Mejuto
In reply to this post by Hans-Peter Diettrich
Hello Lazarus-List,

Friday, February 11, 2011, 3:06:50 PM, you wrote:

>> Is it logical/safe that utf8 functions do not check utf8 integrity ?
>> I'm talking about utf8pos, utf8copy, etc...
HPD> There exists no need for an utf8pos function, for use with an utf8copy,

Nothing is needed for utf8copy ;) utf8pos is needed to return the
"characters" position of an string it use for utf8copy or to display
the information somewhere is a different matter.

HPD> when Pos already returns the correct start index for Copy. Only the

If not integrity checks are performed, yes, it returns a valid index
position.

--
Best regards,
 José


--
_______________________________________________
Lazarus mailing list
[hidden email]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
12