CI: Replace the existing check for non-ASCII characters with a Python script. #12393

WarrenWeckesser · 2020-06-21T10:05:56Z

Replace the check that was added in #9615 with a Python script that prints more information about the non-ASCII characters found.

WarrenWeckesser · 2020-06-21T10:13:49Z

For example, in the current version of #12287, ascii-check.py prints:

scipy/io/wavfile.py (no explicit encoding)
... line 573, position 21: character 'µ' has code point 181.

WarrenWeckesser · 2020-06-21T10:18:41Z

And in this PR, the log for the (correctly) failing TravisCI lint check tells us where the offending character in foo.py is:

$ tools/ascii-check.py
scipy/constants/foo.py (no explicit encoding)
... line 3, position 1: character 'λ' has code point 955.
The command "tools/ascii-check.py" exited with 1.

WarrenWeckesser · 2020-06-21T10:42:06Z

And for fun, you can run the script with the option --showall and it will show all the non-ASCII characters in the Python and Cython files, even those that have an encoding declared at the beginning of the file:

$ python tools/ascii-check.py --showall
scipy/constants/foo.py (no explicit encoding)
... line 3, position 1: character 'λ' has code point 955.
scipy/io/__init__.py (explicit encoding 'utf-8')
... line 14, position 7: character '®' has code point 174.
... line 24, position 4: character '®' has code point 174.
scipy/io/matlab/tests/test_mio.py (explicit encoding 'latin-1')
... line 1134, position 31: character 'ö' has code point 246.
scipy/linalg/_decomp_cossin.py (explicit encoding 'utf-8')
... line 20, position 36: character '┌' has code point 9484.
... line 20, position 56: character '┐' has code point 9488.
... line 21, position 36: character '│' has code point 9474.

✂... snip ... ✂

... line 889, position 45: character 'λ' has code point 955.
... line 892, position 32: character 'λ' has code point 955.
... line 893, position 32: character 'λ' has code point 955.
scipy/signal/filter_design.py (explicit encoding 'utf-8')
... line 4928, position 18: character '–' has code point 8211.
scipy/signal/fir_filter_design.py (explicit encoding 'utf-8')
... line 1009, position 24: character '−' has code point 8722.
... line 1014, position 26: character 'π' has code point 960.
... line 1018, position 20: character 'π' has code point 960.
... line 1018, position 22: character '∫' has code point 8747.
... line 1018, position 25: character 'ω' has code point 969.

✂... snip ... ✂

... line 1043, position 43: character 'π' has code point 960.
... line 1043, position 48: character 'π' has code point 960.
scipy/signal/tests/test_signaltools.py (explicit encoding 'utf-8')
... line 1100, position 75: character '≠' has code point 8800.
scipy/stats/_continuous_distns.py (explicit encoding 'utf-8')
... line 4004, position 46: character 'é' has code point 233.
... line 4675, position 59: character '’' has code point 8217.
... line 6938, position 26: character '–' has code point 8211.

(For brevity, I removed some of the output.)

ilayn · 2020-06-21T11:29:34Z

There must be a precedence for unicode versions of ┌, ┐ because those are from extended ASCII table and go by the name box-drawing characters used in DOS programs to draw fences and frames (windows cmd can display those unlike other unicode).

WarrenWeckesser · 2020-06-21T11:40:36Z

@ilayn, there's a whole wikipedia page about the box-drawing characters.

ilayn · 2020-06-21T12:02:21Z

Yes that's why I said there must be a precedence, see this http://www.asciitable.com/

"Unicode" as a quantifier won't be a tight description in my opinion

WarrenWeckesser · 2020-06-21T13:43:39Z

The only test failure is the intentional lint failure in the TravisCI tests.

ev-br · 2020-06-21T14:10:12Z

Nice!

WarrenWeckesser · 2020-06-21T16:48:33Z

I pushed an update that makes the script a bit more careful about how it reads the files. The previous version would have crashed if there was a file with latin-1 encoding that contained arbitrary latin-1 characters greater than 127. The file scipy/io/matlab/tests/test_mio.py is interesting: the first line of the file is

# -*- coding: latin-1 -*-

which says the file's encoding is latin-1. The file has a literal string that, in my editor (which by default assumes UTF-8) appears as Schrödinger (line 1134). The latin-1 encoding of the character ö ("latin small letter o with diaeresis") is 0xF6. So if we read test_mio.py as bytes, and look for the sequence of bytes that represents Schrödinger, we might expect to find the byte 0xF6 representing ö. Here are the actual bytes in the file:

In [9]: with open('test_mio.py', 'rb') as f: 
   ...:     content = f.read()
   ...: 

In [10]: content[38307:38320]                                                                                            
Out[10]: b"Schr\xc3\xb6dinger'"

Instead of a single 0xF6, we have two bytes, 0xC3 0xB6 . Those two bytes are the UTF-8 representation of ö (that explains why the string looked correct in my editor). If I inform my editor that the file is latin-1 encoded, the string is displayed as 'SchrÃ¶dinger'. That is also what the updated version of ascii-check.py --showall will show for that file:

[...]
scipy/io/matlab/tests/test_mio.py (explicit encoding 'latin-1')
... line 1134, position 31: character 'Ã' has code point 195.
... line 1134, position 32: character '¶' has code point 182.
[...]

larsoner

Want to fix the test_mio.py problem, too? Maybe just set to utf-8 encoding at the top for consistency.

WarrenWeckesser · 2020-06-22T14:25:02Z

Yes, I want to fix that. It looks like it is just a mistaken encoding comment. But changing the comment to specify that the encoding is UTF-8 changes the data that is used in the unit test, so it would be nice if someone familiar with that code could take a look (ping @matthew-brett?).

WarrenWeckesser · 2020-06-22T16:26:38Z

@larsoner, I went ahead and created a PR to remove the latin-1 encoding comment for test_mio.py: #12404

WarrenWeckesser · 2020-06-22T16:33:51Z

I pushed an update. The new script is now tools/unicode-check.py. It implements the conservative approach that I described in #11925 (comment). As it is now, the unicode-check.py test will not pass until #12404 and #12399 are merged. Until then, you can see what the log looks like when this check fails by looking at the test results for the "Lint" run of the TravisCI tests.

…ipt.

WarrenWeckesser · 2020-07-02T16:54:35Z

The two blockers, #12404 and #12399 have been merged. I rebased on master, and now the tests pass, so this is ready to go.

larsoner · 2020-07-02T18:07:49Z

Thanks @WarrenWeckesser ! I'll go ahead and merge so that we get this check going, and we can relax the checks / make exceptions as need be going forward

WarrenWeckesser · 2020-07-02T19:18:48Z

Thanks Eric, and everyone else who commented.

I think we should have no problem adding printable characters from the latin1 character set (e.g. ç, ï, ñ, etc) when they occur in proper names of authors referenced in docstrings. The main goal is to avoid non-printing characters or non-ASCII variations of dashes and other punctuation that tend to sneak into docstrings with text that was cut-and-pasted from other sources.

WarrenWeckesser changed the title ~~Ascii check~~ CI: Replace the existing check for non-ASCII characters with a Python script. Jun 21, 2020

WarrenWeckesser force-pushed the ascii-check branch from 1e203ab to 8276019 Compare June 21, 2020 10:30

WarrenWeckesser force-pushed the ascii-check branch from 8276019 to 3a869a3 Compare June 21, 2020 11:04

AtsushiSakai added the CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure label Jun 21, 2020

WarrenWeckesser force-pushed the ascii-check branch 2 times, most recently from 17d380e to ac6e6ea Compare June 21, 2020 16:23

WarrenWeckesser mentioned this pull request Jun 22, 2020

MAINT: remove character encoding check in CI? #11925

Closed

larsoner approved these changes Jun 22, 2020

View reviewed changes

WarrenWeckesser marked this pull request as draft June 22, 2020 14:25

WarrenWeckesser force-pushed the ascii-check branch from ac6e6ea to b9efa0a Compare June 22, 2020 15:59

WarrenWeckesser marked this pull request as ready for review June 22, 2020 16:10

WarrenWeckesser mentioned this pull request Jun 22, 2020

MAINT: io: Change the encoding comment of test_mio.py to utf-8. #12404

Merged

WarrenWeckesser mentioned this pull request Jul 2, 2020

MAINT: Replace some Unicode dash-like chars with an ASCII hyphen. #12399

Merged

WarrenWeckesser closed this Jul 2, 2020

WarrenWeckesser reopened this Jul 2, 2020

CI: Replace TravisCI check for non-ASCII characters with a Python scr…

81a39b7

…ipt.

WarrenWeckesser force-pushed the ascii-check branch from b9efa0a to 81a39b7 Compare July 2, 2020 14:09

larsoner merged commit 393a192 into scipy:master Jul 2, 2020

WarrenWeckesser deleted the ascii-check branch July 2, 2020 19:18

tylerjereddy added this to the 1.6.0 milestone Jul 2, 2020

steppi mentioned this pull request Jun 27, 2021

ENH: improvements to the Stable distribution #9523

Merged

Uh oh!

CI: Replace the existing check for non-ASCII characters with a Python script. #12393

CI: Replace the existing check for non-ASCII characters with a Python script. #12393

Uh oh!

Conversation

WarrenWeckesser commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented Jun 21, 2020

Uh oh!

WarrenWeckesser commented Jun 21, 2020

Uh oh!

WarrenWeckesser commented Jun 21, 2020

Uh oh!

ilayn commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented Jun 21, 2020

Uh oh!

ilayn commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented Jun 21, 2020

Uh oh!

ev-br commented Jun 21, 2020

Uh oh!

WarrenWeckesser commented Jun 21, 2020

Uh oh!

larsoner left a comment

Choose a reason for hiding this comment

Uh oh!

WarrenWeckesser commented Jun 22, 2020

Uh oh!

WarrenWeckesser commented Jun 22, 2020

Uh oh!

WarrenWeckesser commented Jun 22, 2020

Uh oh!

WarrenWeckesser commented Jul 2, 2020

Uh oh!

larsoner commented Jul 2, 2020

Uh oh!

WarrenWeckesser commented Jul 2, 2020

Uh oh!

Uh oh!

WarrenWeckesser commented Jun 21, 2020 •

edited

Loading

ilayn commented Jun 21, 2020 •

edited

Loading

ilayn commented Jun 21, 2020 •

edited

Loading