Skip to content

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Jun 21, 2020

Replace the check that was added in #9615 with a Python script that prints more information about the non-ASCII characters found.

@WarrenWeckesser WarrenWeckesser changed the title Ascii check CI: Replace the existing check for non-ASCII characters with a Python script. Jun 21, 2020
@WarrenWeckesser
Copy link
Member Author

For example, in the current version of #12287, ascii-check.py prints:

scipy/io/wavfile.py (no explicit encoding)
... line 573, position 21: character 'µ' has code point 181.

@WarrenWeckesser
Copy link
Member Author

And in this PR, the log for the (correctly) failing TravisCI lint check tells us where the offending character in foo.py is:

$ tools/ascii-check.py
scipy/constants/foo.py (no explicit encoding)
... line 3, position 1: character 'λ' has code point 955.
The command "tools/ascii-check.py" exited with 1.

@WarrenWeckesser
Copy link
Member Author

And for fun, you can run the script with the option --showall and it will show all the non-ASCII characters in the Python and Cython files, even those that have an encoding declared at the beginning of the file:

$ python tools/ascii-check.py --showall
scipy/constants/foo.py (no explicit encoding)
... line 3, position 1: character 'λ' has code point 955.
scipy/io/__init__.py (explicit encoding 'utf-8')
... line 14, position 7: character '®' has code point 174.
... line 24, position 4: character '®' has code point 174.
scipy/io/matlab/tests/test_mio.py (explicit encoding 'latin-1')
... line 1134, position 31: character 'ö' has code point 246.
scipy/linalg/_decomp_cossin.py (explicit encoding 'utf-8')
... line 20, position 36: character '┌' has code point 9484.
... line 20, position 56: character '┐' has code point 9488.
... line 21, position 36: character '│' has code point 9474.

✂... snip ... ✂

... line 889, position 45: character 'λ' has code point 955.
... line 892, position 32: character 'λ' has code point 955.
... line 893, position 32: character 'λ' has code point 955.
scipy/signal/filter_design.py (explicit encoding 'utf-8')
... line 4928, position 18: character '–' has code point 8211.
scipy/signal/fir_filter_design.py (explicit encoding 'utf-8')
... line 1009, position 24: character '−' has code point 8722.
... line 1014, position 26: character 'π' has code point 960.
... line 1018, position 20: character 'π' has code point 960.
... line 1018, position 22: character '∫' has code point 8747.
... line 1018, position 25: character 'ω' has code point 969.

✂... snip ... ✂

... line 1043, position 43: character 'π' has code point 960.
... line 1043, position 48: character 'π' has code point 960.
scipy/signal/tests/test_signaltools.py (explicit encoding 'utf-8')
... line 1100, position 75: character '≠' has code point 8800.
scipy/stats/_continuous_distns.py (explicit encoding 'utf-8')
... line 4004, position 46: character 'é' has code point 233.
... line 4675, position 59: character '’' has code point 8217.
... line 6938, position 26: character '–' has code point 8211.

(For brevity, I removed some of the output.)

@ilayn
Copy link
Member

ilayn commented Jun 21, 2020

There must be a precedence for unicode versions of , because those are from extended ASCII table and go by the name box-drawing characters used in DOS programs to draw fences and frames (windows cmd can display those unlike other unicode).

@WarrenWeckesser
Copy link
Member Author

@ilayn, there's a whole wikipedia page about the box-drawing characters.

@ilayn
Copy link
Member

ilayn commented Jun 21, 2020

Yes that's why I said there must be a precedence, see this http://www.asciitable.com/

"Unicode" as a quantifier won't be a tight description in my opinion

@WarrenWeckesser
Copy link
Member Author

The only test failure is the intentional lint failure in the TravisCI tests.

@ev-br
Copy link
Member

ev-br commented Jun 21, 2020

Nice!

@AtsushiSakai AtsushiSakai added the CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure label Jun 21, 2020
@WarrenWeckesser WarrenWeckesser force-pushed the ascii-check branch 2 times, most recently from 17d380e to ac6e6ea Compare June 21, 2020 16:23
@WarrenWeckesser
Copy link
Member Author

I pushed an update that makes the script a bit more careful about how it reads the files. The previous version would have crashed if there was a file with latin-1 encoding that contained arbitrary latin-1 characters greater than 127. The file scipy/io/matlab/tests/test_mio.py is interesting: the first line of the file is

# -*- coding: latin-1 -*-

which says the file's encoding is latin-1. The file has a literal string that, in my editor (which by default assumes UTF-8) appears as Schrödinger (line 1134). The latin-1 encoding of the character ö ("latin small letter o with diaeresis") is 0xF6. So if we read test_mio.py as bytes, and look for the sequence of bytes that represents Schrödinger, we might expect to find the byte 0xF6 representing ö. Here are the actual bytes in the file:

In [9]: with open('test_mio.py', 'rb') as f: 
   ...:     content = f.read()
   ...: 

In [10]: content[38307:38320]                                                                                            
Out[10]: b"Schr\xc3\xb6dinger'"

Instead of a single 0xF6, we have two bytes, 0xC3 0xB6 . Those two bytes are the UTF-8 representation of ö (that explains why the string looked correct in my editor). If I inform my editor that the file is latin-1 encoded, the string is displayed as 'Schrödinger'. That is also what the updated version of ascii-check.py --showall will show for that file:

[...]
scipy/io/matlab/tests/test_mio.py (explicit encoding 'latin-1')
... line 1134, position 31: character 'Ã' has code point 195.
... line 1134, position 32: character '¶' has code point 182.
[...]

Copy link
Member

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to fix the test_mio.py problem, too? Maybe just set to utf-8 encoding at the top for consistency.

@WarrenWeckesser
Copy link
Member Author

Yes, I want to fix that. It looks like it is just a mistaken encoding comment. But changing the comment to specify that the encoding is UTF-8 changes the data that is used in the unit test, so it would be nice if someone familiar with that code could take a look (ping @matthew-brett?).

@WarrenWeckesser WarrenWeckesser marked this pull request as draft June 22, 2020 14:25
@WarrenWeckesser WarrenWeckesser marked this pull request as ready for review June 22, 2020 16:10
@WarrenWeckesser
Copy link
Member Author

@larsoner, I went ahead and created a PR to remove the latin-1 encoding comment for test_mio.py: #12404

@WarrenWeckesser
Copy link
Member Author

I pushed an update. The new script is now tools/unicode-check.py. It implements the conservative approach that I described in #11925 (comment). As it is now, the unicode-check.py test will not pass until #12404 and #12399 are merged. Until then, you can see what the log looks like when this check fails by looking at the test results for the "Lint" run of the TravisCI tests.

@WarrenWeckesser
Copy link
Member Author

The two blockers, #12404 and #12399 have been merged. I rebased on master, and now the tests pass, so this is ready to go.

@larsoner
Copy link
Member

larsoner commented Jul 2, 2020

Thanks @WarrenWeckesser ! I'll go ahead and merge so that we get this check going, and we can relax the checks / make exceptions as need be going forward

@larsoner larsoner merged commit 393a192 into scipy:master Jul 2, 2020
@WarrenWeckesser
Copy link
Member Author

Thanks Eric, and everyone else who commented.

I think we should have no problem adding printable characters from the latin1 character set (e.g. ç, ï, ñ, etc) when they occur in proper names of authors referenced in docstrings. The main goal is to avoid non-printing characters or non-ASCII variations of dashes and other punctuation that tend to sneak into docstrings with text that was cut-and-pasted from other sources.

@WarrenWeckesser WarrenWeckesser deleted the ascii-check branch July 2, 2020 19:18
@tylerjereddy tylerjereddy added this to the 1.6.0 milestone Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants