Grepping for a single, arbitrary character in a bunch of files.

I had a random error complaining about being unable to read a source file without any real explanation of which file was the problem.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 25: ordinal not in range(128)

The most logical way to rip through a number of files is find. On linux xxd is the simplest tool for hex dumping and grep can then look at it’s output. The problem occurs when you realise you can’t use | in an -exec with find, or at least I haven’t figured out how. The simplest way to work around that is to put your command into a tiny shell script and -exec that. Obvious really, but it always takes me a moment to remember that.

Note that this is for grepping for a single byte. Grepping for multiple bytes would require a different approach entirely.

cat > bingrep.sh
#!/bin/sh
xxd -g 1 $2 | grep -i "\<$1\>" -q
^d
chmod +x bingrep.sh
find . -type f -name "*.py" -exec ./bingrep.sh c3 {} \; -print

That allows me to look for the rogue 0xc3 characters in the source code I was dealing with.

I suspect I ought to be able to do something similar with regex’s, but you run the risk of stupid interpretation problems. Sometimes it’s simpler to just look at it in the raw, relatively speaking.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s