| Author |
Message |
Guest
|
Posted:
Wed Jan 26, 2005 3:47 am Post subject:
utf-8 to ascii |
|
|
I have a question. how to generate two files, one in UTF-8, the other
in ASCII with the same column length
SO that when i do the conversion from utf-8 to ascii, the column length
does not change . any help is appreciated
thanks |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Jan 26, 2005 1:13 pm Post subject:
Re: utf-8 to ascii |
|
|
mail2atulmehta@yahoo.com wrote:
| Quote: | I have a question. how to generate two files, one in UTF-8, the other
in ASCII with the same column length
SO that when i do the conversion from utf-8 to ascii, the column length
does not change . any help is appreciated
|
I have absolutely _no_ idea what this has to do with comp.arch!?
However, the solution is simple:
Make sure that your utf8-encoded data consists of nothing but 7-bit US
ASCII! :-)
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Wed Jan 26, 2005 2:25 pm Post subject:
Re: utf-8 to ascii |
|
|
Terje Mathisen wrote:
| Quote: | However, the solution is simple:
Make sure that your utf8-encoded data consists of nothing but 7-bit US
ASCII! :-)
|
;-). Or use wcwidth() for each character that is > 0x7F, and emit that
number of '?'s.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
HP
Guest
|
Posted:
Thu Jan 27, 2005 1:28 am Post subject:
Re: utf-8 to ascii |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message news:ct7jfm$nc9$2@osl016lin.hda.hydro.com...
| Quote: | mail2atulmehta@yahoo.com wrote:
I have a question. how to generate two files, one in UTF-8, the other
in ASCII with the same column length
SO that when i do the conversion from utf-8 to ascii, the column length
does not change . any help is appreciated
I have absolutely _no_ idea what this has to do with comp.arch!?
However, the solution is simple:
Make sure that your utf8-encoded data consists of nothing but 7-bit US ASCII! :-)
|
That fact could be used as a low CPU power test of the subversivness of
a text - you just need to check the top bits in each character.
If any of them are set, tag the text for further processing. Otherwise the text
must have come from an American and therefore be OK. It's like the evil
bit proposal, only much more cunning.
HP |
|
| Back to top |
|
 |
mail2atulmehta@yahoo.com
Guest
|
Posted:
Tue Feb 01, 2005 10:44 pm Post subject:
Re: utf-8 to ascii |
|
|
Sorry for the cofusion. Here is what I meant to say.
I am genrating a file(.txt file, which is being opened with notepad),
the file has some data from some tables. The tables has fixed column
length, yet When i open in the notepad the column length changes. For
ex the data in one of the column is Republique Française. now the
field length in the table ( FoxPro database) is suppose 75. Yet when i
open it in the notepad it becomes 74. My problem is that when the
encoding changes from ASCII to UTF-8 , the field length ( or the column
length ) for that value also changes. I know it is happening because no
of bits used in ASCII & UTF-8 are different. Is there soem way I can
keep the column length fixed to 75 only
Any help is appreciated |
|
| Back to top |
|
 |
mail2atulmehta@yahoo.com
Guest
|
Posted:
Tue Feb 01, 2005 10:44 pm Post subject:
Re: utf-8 to ascii |
|
|
Sorry for the cofusion. Here is what I meant to say.
I am genrating a file(.txt file, which is being opened with notepad),
the file has some data from some tables. The tables has fixed column
length, yet When i open in the notepad the column length changes. For
ex the data in one of the column is Republique Française. now the
field length in the table ( FoxPro database) is suppose 75. Yet when i
open it in the notepad it becomes 74. My problem is that when the
encoding changes from ASCII to UTF-8 , the field length ( or the column
length ) for that value also changes. I know it is happening because no
of bits used in ASCII & UTF-8 are different. Is there soem way I can
keep the column length fixed to 75 only
Any help is appreciated |
|
| Back to top |
|
 |
Bill Todd
Guest
|
Posted:
Tue Feb 01, 2005 11:11 pm Post subject:
Re: utf-8 to ascii |
|
|
mail2atulmehta@yahoo.com wrote:
| Quote: | Sorry for the cofusion. Here is what I meant to say.
I am genrating a file(.txt file, which is being opened with notepad),
the file has some data from some tables. The tables has fixed column
length, yet When i open in the notepad the column length changes. For
ex the data in one of the column is Republique Française. now the
field length in the table ( FoxPro database) is suppose 75. Yet when i
open it in the notepad it becomes 74. My problem is that when the
encoding changes from ASCII to UTF-8 , the field length ( or the column
length ) for that value also changes. I know it is happening because no
of bits used in ASCII & UTF-8 are different. Is there soem way I can
keep the column length fixed to 75 only
Any help is appreciated
|
Analyze the UTF-8 input and compensate explicitly in the padding for
multi-byte characters (this may only require examining the high bit of
each input byte),
or
set up your columns filled with spaces and use overstrikes to populate them.
- bill |
|
| Back to top |
|
 |
Stefan Monnier
Guest
|
Posted:
Wed Feb 02, 2005 1:07 am Post subject:
Re: utf-8 to ascii |
|
|
| Quote: | length ) for that value also changes. I know it is happening because no
of bits used in ASCII & UTF-8 are different. Is there soem way I can
keep the column length fixed to 75 only
Any help is appreciated
|
Is this on a recent processor? 64bit? 32bit?
Maybe it's the infamous problem of unaligned access?
Stefan |
|
| Back to top |
|
 |
Ketil Malde
Guest
|
Posted:
Wed Feb 02, 2005 1:46 pm Post subject:
Re: utf-8 to ascii |
|
|
"mail2atulmehta@yahoo.com" <mail2atulmehta@yahoo.com> writes:
| Quote: | length ) for that value also changes. I know it is happening because no
of bits used in ASCII & UTF-8 are different. Is there soem way I can
keep the column length fixed to 75 only
|
Doesn't Notepad output a BOM¹ in UTF-8? Look for 0xFEFF at the start
of the file. Could that be the explanation?
-kzm
¹ Apparently the relevant Unicode committees have, in their infinite
wisdom, decided that just because a stream is byte-oriented doesn't
mean it has to suffer the injustice -- nay, the insult! -- of lacking a
defined byte order. If somebody can explain the rationale behind
this, I'd be very interested to hear it.
--
If I haven't seen further, it is by standing in the footprints of giants |
|
| Back to top |
|
 |
Anton Ertl
Guest
|
Posted:
Wed Feb 02, 2005 2:07 pm Post subject:
Re: utf-8 to ascii |
|
|
Ketil Malde <ketil+news@ii.uib.no> writes:
| Quote: | Doesn't Notepad output a BOM¹ in UTF-8? Look for 0xFEFF at the start
of the file.
|
The BOM is the Unicode character U+FEFF, encoded as UTF-8, resulting
in the sequence 0xEF 0xBB 0xBF.
| Quote: | ¹ Apparently the relevant Unicode committees have, in their infinite
wisdom, decided that just because a stream is byte-oriented doesn't
mean it has to suffer the injustice -- nay, the insult! -- of lacking a
defined byte order. If somebody can explain the rationale behind
this, I'd be very interested to hear it.
|
I am not sure that any Unicode committee had anything to do with this.
Using this with UTF-8 is a Microsoft convention. From
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>:
|One influential non-POSIX PC operating system vendor (whom we shall
|leave unnamed here) suggested that all Unicode files should start with
|the character ZERO WIDTH NOBREAK SPACE (U+FEFF), which is in this role
|also referred to as the "signature" or "byte-order mark (BOM)", in
|order to identify the encoding and byte-order used in a
|file. Linux/Unix does not use any BOMs and signatures.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
| Back to top |
|
 |
Jean-Marc Bourguet
Guest
|
Posted:
Wed Feb 02, 2005 2:31 pm Post subject:
Re: utf-8 to ascii |
|
|
Ketil Malde <ketil+news@ii.uib.no> writes:
| Quote: | ¹ Apparently the relevant Unicode committees have, in their infinite
wisdom, decided that just because a stream is byte-oriented doesn't
mean it has to suffer the injustice -- nay, the insult! -- of lacking a
defined byte order. If somebody can explain the rationale behind
this, I'd be very interested to hear it.
|
I understood that starting with a BOM is an inband way of signaling
Unicode data when other kind of character data is possible (as FEFF is
improbable for most of the other kind of encoding). Additionnaly it
allows to know which kind of UTF is in used and for form having two
kinds (UTF-16) detect the variant.
I think that MS decided to use that for simple text file. Not a
really bad solution, but when the file is not in Unicode, you still
need outband information to know how to decode it so only part of the
problem is solved. In Unix world, the outband information is given by
LC_* environement variables and so it is not common to put a BOM at
start. Complicating the portability of files between the
environments.
BTW, I don't know if other encoding than UTF-8 is supported by MS
software for unicode text files.
Yours,
--
Jean-Marc |
|
| Back to top |
|
 |
Ketil Malde
Guest
|
Posted:
Wed Feb 02, 2005 5:13 pm Post subject:
Re: utf-8 to ascii |
|
|
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
| Quote: | Doesn't Notepad output a BOM¹ in UTF-8? Look for 0xFEFF at the start
of the file.
The BOM is the Unicode character U+FEFF, encoded as UTF-8, resulting
in the sequence 0xEF 0xBB 0xBF.
|
Right you are, of course. Sloppy of me.
| Quote: | I am not sure that any Unicode committee had anything to do with this.
Using this with UTF-8 is a Microsoft convention.
|
But from <http://www.unicode.org/faq/utf_bom.html>, it seems clear
that U+FEFF at the start of a UTF-8 stream is a BOM, and not a ZWNBSP
being part of the text. Unlike the other Unicode formats, there
doesn't seem to be a UTF-8 variant without a BOM.
| Quote: | file. Linux/Unix does not use any BOMs and signatures.
|
I think MS is actually more compliant with the standard here (although
the use of UTF-8 BOM is apparently recommended against in some cases).
-kzm
--
If I haven't seen further, it is by standing in the footprints of giants |
|
| Back to top |
|
 |
HP
Guest
|
Posted:
Thu Feb 03, 2005 6:04 am Post subject:
Re: utf-8 to ascii |
|
|
"Ketil Malde" <ketil+news@ii.uib.no> wrote in message news:egfz0fgxi8.fsf@dverghimalayaeiner.ii.uib.no...
| Quote: | "mail2atulmehta@yahoo.com" <mail2atulmehta@yahoo.com> writes:
length ) for that value also changes. I know it is happening because no
of bits used in ASCII & UTF-8 are different. Is there soem way I can
keep the column length fixed to 75 only
Doesn't Notepad output a BOM¹ in UTF-8? Look for 0xFEFF at the start
of the file. Could that be the explanation?
|
Actually Notepad adds the UTF-8 encoding of 0xFEFF, 0xEF 0xBB 0xBF
http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx
HP |
|
| Back to top |
|
 |
|
|
|
|