python 3 open() default encoding
回答1
The default UTF-8 encoding of Python 3 only extends to byte->str conversions. open()
instead uses your environment to choose an appropriate encoding:
From the Python 3 docs for open()
:
encoding
is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
In your case, as you're on Windows with a Western Europe/North America, you will be given the 8bit Windows-1252 character set. Setting encoding
to utf-8
overrides this.
Fortunately there are recent attempts to end this madness... someday. – Jeyekomon Apr 28, 2020 at 14:05
Motivation
Using the default encoding is a common mistake
Developers using macOS or Linux may forget that the default encoding is not always UTF-8.
For example, using long_description = open("README.md").read()
in setup.py
is a common mistake. Many Windows users cannot install such packages if there is at least one non-ASCII character (e.g. emoji, author names, copyright symbols, and the like) in their UTF-8-encoded README.md
file.
Of the 4000 most downloaded packages from PyPI, 489 use non-ASCII characters in their README, and 82 fail to install from source on non-UTF-8 locales due to not specifying an encoding for a non-ASCII file. [1]
Another example is logging.basicConfig(filename="log.txt")
. Some users might expect it to use UTF-8 by default, but the locale encoding is actually what is used. [2]
Even Python experts may assume that the default encoding is UTF-8. This creates bugs that only happen on Windows; see [3], [4], [5], and [6] for example.
Emitting a warning when the encoding
argument is omitted will help find such mistakes.
Explicit way to use locale-specific encoding
open(filename)
isn’t explicit about which encoding is expected:
- If ASCII is assumed, this isn’t a bug, but may result in decreased performance on Windows, particularly with non-Latin-1 locale encodings
- If UTF-8 is assumed, this may be a bug or a platform-specific script
- If the locale encoding is assumed, the behavior is as expected (but could change if future versions of Python modify the default)
From this point of view, open(filename)
is not readable code.
encoding=locale.getpreferredencoding(False)
can be used to specify the locale encoding explicitly, but it is too long and easy to misuse (e.g. one can forget to pass False
as its argument).
This PEP provides an explicit way to specify the locale encoding.
Prepare to change the default encoding to UTF-8
Since UTF-8 has become the de-facto standard text encoding, we might default to it for opening files in the future.
However, such a change will affect many applications and libraries. If we start emitting DeprecationWarning
everywhere the encoding
argument is omitted, it will be too noisy and painful.
Although this PEP doesn’t propose changing the default encoding, it will help enable that change by:
- Reducing the number of omitted
encoding
arguments in libraries before we start emitting aDeprecationWarning
by default. - Allowing users to pass
encoding="locale"
to suppress the current warning and anyDeprecationWarning
added in the future, as well as retaining consistent behavior if later Python versions change the default, ensuring support for any Python version >=3.10.
Which encoding should Python open function use?
回答1
As clearly stated in Python's open documentation:
In text mode, if encoding is not specified the encoding used is platform dependent:
locale.getpreferredencoding(False)
is called to get the current locale encoding.
Windows defaults to a localized encoding (cp1252
on US and Western European versions). Linux typically defaults to utf-8
.
Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.
https://docs.python.org/3/library/functions.html#open
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getencoding()
returns), but any text encoding supported by Python can be used. See the codecs
module for the list of supported encodings.
locale.
getpreferredencoding
(do_setlocale=True)-
Return the locale encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess.
On some systems, it is necessary to invoke
setlocale()
to obtain the user preferences, so this function is not thread-safe. If invoking setlocale is not necessary or desired, do_setlocale should be set toFalse
.On Android or if the Python UTF-8 Mode is enabled, always return
'UTF-8'
, the locale encoding and the do_setlocale argument are ignored.The Python preinitialization configures the LC_CTYPE locale. See also the filesystem encoding and error handler.
Changed in version 3.7: The function now always returns
UTF-8
on Android or if the Python UTF-8 Mode is enabled.