[Scons-dev] Merge PR #235 before release
Gary Oberbrunner
garyo at oberbrunner.com
Thu May 28 06:54:14 EDT 2015
If you're interested in this problem, I suggest reading
https://docs.python.org/2/howto/unicode.html which has all the details
(including how to ignore decode errors), and of course check out the
python3 branch of scons where a lot of unicode handling has been done (but
much is still left to do iirc). I don't think pretending strings are in
the cp437 encoding is a particularly good plan. ISO 8859-1 or Windows
CP1252 would probably give better results in some cases but you still need
to ignore errors in the decode. And of course if the string actually is
utf-8 with non-ascii chars, either of these encodings will return a string
of the wrong length, not just wrong characters; and re-encoding it for
output or storage will completely mangle it.
Of course we _can_ know the encoding of the filenames in the filesystem,
that's what sys.getfilesystemencoding() is for (see the unicode link
above). Reading file contents and handling stdout/stderr from SCons
subprocesses is much more of a challenge.
On Thu, May 28, 2015 at 3:28 AM, anatoly techtonik <techtonik at gmail.com>
wrote:
> I found a way to convert any binary string to Unicode without crashing -
> http://stackoverflow.com/a/27527728/239247 That would correctly
> convert all `ascii` characters (and will probably make it possible to use
> ANSI graphics if unicode font supports that), but it will not work for
> other
> utf-8 characters.
>
> Python 3 adds some surrogateescape, but that is not present in Python 2.
>
> http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2
> I don't know why they called it "surrogate" - it is a freaky word.
>
> On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L <jason.l.kenny at intel.com>
> wrote:
> > I would agree with this.
> >
> >
> >
> > In general the OS today store file data ( ie the file system data not the
> > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is
> not
> > always the case it could be big5 or some other locale encoding. On Linux
> > there are means to see what the “native” encoding is to use it.
> >
> >
> >
> > I should note that the idea of converting binary to Unicode does not
> really
> > exist. The point of a binary string to is to hold random data ( ie like a
> > double in the raw form 64-bit vs the dec values of 1.2385). One can
> assume
> > that it is a certain code page encoding and convert from that. And like I
> > stated above there are api to see what the locale code page encoding is
> and
> > that can be used to convert the code to the local ANSI/OEM encoding.
> This is
> > different from a binary string.
> >
> >
> >
> > Jason
> >
> >
> >
> >
> >
> >
> >
> > From: Scons-dev [mailto:scons-dev-bounces at scons.org] On Behalf Of Gary
> > Oberbrunner
> > Sent: Wednesday, May 27, 2015 7:43 AM
> > To: SCons developer list
> > Subject: Re: [Scons-dev] Merge PR #235 before release
> >
> >
> >
> >
> >
> > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik <techtonik at gmail.com>
> > wrote:
> >
> > What I need is a bulletproof way to convert from anything to unicode.
> This
> > requires some kind of escaping to go forward and back. Some helper
> > methods like u2b() (unicode to binary) and b2u(). I am quite surprised
> that
> > so far I found nothing for this "simple" case.
> >
> >
> > That's because in general the encoding of the "binary" string is unknown.
> > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You
> > can't decode such a string to Unicode without knowing the encoding.
> Check
> > out the python-3 branch where we've been working through some of those
> > issues. Your u2b is "easy" if you assume you want the binary to be utf-8
> > encoded, which is normally safe; this conversion is guaranteed to work.
> > Your b2u is not so easy. You can't just assume utf-8 as you might
> think; if
> > the string has invalid utf-8 bytes it'll raise an error or generate dummy
> > chars depending on the args you pass to str.decode(). At least it'll get
> > mangled if it's in a different encoding than you expect.
> >
> >
> >
> > --
> >
> > Gary
> >
> >
> > _______________________________________________
> > Scons-dev mailing list
> > Scons-dev at scons.org
> > https://pairlist2.pair.net/mailman/listinfo/scons-dev
> >
>
>
>
> --
> anatoly t.
> _______________________________________________
> Scons-dev mailing list
> Scons-dev at scons.org
> https://pairlist2.pair.net/mailman/listinfo/scons-dev
>
--
Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/scons-dev/attachments/20150528/fde03351/attachment.html>
More information about the Scons-dev
mailing list