-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
CI: Replace the existing check for non-ASCII characters with a Python script. #12393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For example, in the current version of #12287,
|
And in this PR, the log for the (correctly) failing TravisCI lint check tells us where the offending character in
|
1e203ab
to
8276019
Compare
And for fun, you can run the script with the option
(For brevity, I removed some of the output.) |
8276019
to
3a869a3
Compare
There must be a precedence for unicode versions of |
@ilayn, there's a whole wikipedia page about the box-drawing characters. |
Yes that's why I said there must be a precedence, see this http://www.asciitable.com/ "Unicode" as a quantifier won't be a tight description in my opinion |
The only test failure is the intentional lint failure in the TravisCI tests. |
Nice! |
17d380e
to
ac6e6ea
Compare
I pushed an update that makes the script a bit more careful about how it reads the files. The previous version would have crashed if there was a file with latin-1 encoding that contained arbitrary latin-1 characters greater than 127. The file
which says the file's encoding is latin-1. The file has a literal string that, in my editor (which by default assumes UTF-8) appears as
Instead of a single 0xF6, we have two bytes, 0xC3 0xB6 . Those two bytes are the UTF-8 representation of ö (that explains why the string looked correct in my editor). If I inform my editor that the file is latin-1 encoded, the string is displayed as 'Schrödinger'. That is also what the updated version of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to fix the test_mio.py
problem, too? Maybe just set to utf-8
encoding at the top for consistency.
Yes, I want to fix that. It looks like it is just a mistaken encoding comment. But changing the comment to specify that the encoding is UTF-8 changes the data that is used in the unit test, so it would be nice if someone familiar with that code could take a look (ping @matthew-brett?). |
ac6e6ea
to
b9efa0a
Compare
I pushed an update. The new script is now |
b9efa0a
to
81a39b7
Compare
Thanks @WarrenWeckesser ! I'll go ahead and merge so that we get this check going, and we can relax the checks / make exceptions as need be going forward |
Thanks Eric, and everyone else who commented. I think we should have no problem adding printable characters from the latin1 character set (e.g. ç, ï, ñ, etc) when they occur in proper names of authors referenced in docstrings. The main goal is to avoid non-printing characters or non-ASCII variations of dashes and other punctuation that tend to sneak into docstrings with text that was cut-and-pasted from other sources. |
Replace the check that was added in #9615 with a Python script that prints more information about the non-ASCII characters found.