Mastodon History

History

02 Sep 2019 | code data history

I’ve been trying to think about what I do.

It’s not uncommon to find writeups on the web describing how to preserve bash history. Set a HISTFILE value which can differentiate your sessions (by pty, user id, date, &c), and you’ve got some durability. Throw in some HISTSIZE and you’ve got history for days.

I’ve been capturing my shell history for a couple years on my personal laptop.

$ head ~/.history/20170629.hist
#1498795792
cd Resilio\ Sync/
#1498795792
ls
#1498795800
cd fool/Games/RPG/Cypher/
#1498795800
ls
#1498795812
zipinfo TftNW-Decompress-Me.zip

My own swiped-from-all-over version of history preservation looks like this:

HISTCONTROL=$HISTCONTROL${HISTCONTROL+:}ignoredups

if [ ! -d ~/.history ]; then
  mkdir ~/.history
fi

export HISTTIMEFORMAT="%F %T "
export HISTFILE=~/.history/`date +%Y%m%d`.hist
export HISTSIZE=100000

shopt -s histappend

But as much fun as it is to fill storage volumes with the events of the past for their own sake, often there’s some analysis shown. Using grep, wc, maybe some awk action if the blogger / stackoverflow essayist / mastodon tooter is of a certain vintage. I went a different way with it. My first sloppy run at it:

"""Read and count commands issued"""
import glob
import operator
import os
import re

"""Get list of files in ${HOME}/.history"""
HOME = os.environ['HOME']

"""Count repeats across all files"""
fileCount = 0
lineCount = 0
commandCount = {}
for session in glob.glob(HOME + '/.history/\*.hist'):
    fileCount+=1
    with open(session) as f:
        for l in f:
            lineCount+=1
            # fuck time
            timestamp = re.match('\#\d{10}', l)
            if not timestamp:
                firstCommand = re.match('\w+', l)
                if firstCommand:

                     commandCount[firstCommand.group(0)] = commandCount.get(firstCommand.group(0),0) + 1
    f.close()

print('found ', fileCount, ' files')
print('found ', lineCount,  ' lines')
sortedCommands = sorted(commandCount.items(), key=operator.itemgetter(1))
for k,v in dict(sortedCommands).items():
    if v > 100:
        print (k,': ', v, 'times')

Output looks like this:

found  165  files
found  21748  lines
tf :  108 times
ga :  109 times
file :  110 times
find :  120 times
brew :  127 times
gc :  139 times
heroku :  151 times
ssh :  154 times
pushd :  155 times
rbenv :  168 times
gem :  175 times
gd :  186 times
type :  198 times
rm :  229 times
vagrant :  239 times
open :  245 times
bundle :  284 times
cat :  293 times
mv :  318 times
...

Then I thought “what if I have shell history for 100 years? what then?”. Because that’s a totally reasonable thing to plan for. Then, “what if bash shell speaks natural language in 100 years?” So I took a run at using a Python tokenizing library. Specifically, nltk. No particular reason beyond it having pretty good gJuice when I was hunting for options. That implementation looks like this:

from nltk.corpus import PlaintextCorpusReader
import nltk
corpus_root = '/Users/stp/.history/'
command_history = PlaintextCorpusReader(corpus_root, '.\*hist')

command_freq = nltk.FreqDist(command_history.words())

print (len(command_freq), ' distinct tokens')
command_freq.most_common(20)

I’m not ignoring the timestamps here, so the token distribution heavily favors ‘#’, but the actual command distribution ends up looking not too different from the simplistic first word counting version:

13707  distinct tokens
Out[15]:
[('#', 10800),
 ('/', 5165),
 ('-', 4391),
 ('.', 4292),
 ('ls', 1750),
 ('\\', 1011),
 ('vi', 888),
 ('docker', 849),
 ('git', 626),
 ('--', 616),
 ('pdf', 510),
 ('cd', 454),
 ("'", 453),
 ('bundle', 433),
 ('l', 407),
 ('rm', 385),
 ('gs', 345),
 ('mv', 328),
 ('bin', 325),
 ('0', 322)]
In [ ]:

So what’s the point of looking at this data? Well, right off the top, I sure do seem to spend a lot of time checking what files exist in my ${CWD}. So maybe I should do alias l=ls and reduce my keystrokes by half at stroke.

Or maybe set my ${PROMPT_COMMAND} to auto-list any directory I pushd into.

The possibilities for bikeshedding are endless, given the many places I could optimize my command line environment using my own history.