9 Dec 2015

LMLE : How I created a compilation of "Invisible Man" notes

LMLE stands for "Linux Makes Life Easy". If you don't understand any of the commands used, please google them.

So today I was going through some websites looking for notes for "The Invisible Man" that I have to study as a part of out 12th standard syllabus. I found a good resource here http://thebestnotes.com/booknotes/Invisible_Man_Wells/The_Invisible_Man_Study_Guide01.html but realized that it had tons of different HTML pages each having distracting ads and stuff. I wanted a neat solution to read this stuff quickly without too much clicking.

Since I had 3 pre-boards left and I'd have to go through these notes at least 3 more times, I though it would be worth investing some time and get it formatted. But obviously, copy pasting 30+ times wasn't an option for me. I needed something much quicker.

So, Here goes.

I booted up Linux (sadly Windows is still my working OS), and opened up the terminal. I remembered Arjun telling me about wget a few weeks back, so I decided to give it a shot.

This is what I tried first

wget http://thebestnotes.com/booknotes/Invisible_Man_Wells/The_Invisible_Man_Study_Guide01.html"

Plain and simple, but it gave me only an offline version of the website itself. Then I tried this

wget -r

It gave me the same thing but inside a folder and it wasn't any use. Being the noob I still am, I tried man wget and tried figuring it out. But I didn't really have the time. So I consulted a programmers best friend - Stack Overflow!

I found this neat combination of options that helped me download all the files into one html file all at once.

wget -r -l1 -H -t1 -nd -N -np -A.html -erobots=off -O file.txt http://thebestnotes.com/booknotes/Invisible_Man_Wells/The_Invisible_Man_Study_Guide01.html

But guess what? It was no use because an extra "0" in "01,02,03" in the naming of the html links on the page ruined the ordering of the content in the output file.

So I just used the messy version of the previous command - download all html files separately and then parse them

wget -r -l1 -H -t1 -nd -N -np -A.html -erobots=off http://thebestnotes.com/booknotes/Invisible_Man_Wells/The_Invisible_Man_Study_Guide01.html

So now I had all the content locally. But they were all individual html files. I researched a little bit and found that there was a 3rd party utility available called "html2text" that did exactly what I wanted.

sudo apt-get install html2text

Boom. Installed. Just like that. Is it that easy on Windows? Definitely not. 

Only one step left.

html2text -utf8 -nobs -style pretty The_Invisible_Man_Study_Guide01.html > file.txt

But this gave me the output only for one file. So I just wrote a small script to automate it for all the files.

for x in 0{1,2,3,4,5,6,7,8,9} 10 11 12 13 14 15 16 17
      do
  html2text -utf8 -nobs -style pretty The_Invisible_Man_Study_Guide$x.html >> file.txt       done

After all this, I had perfectly readable content but the problem of some unnecessary text still remained. So I manually opened up the file and within 2 minutes, removed all the unnecessary stuff and it was finally perfectly formatted to my liking. (file.txt was the final output file.)

When I opened it up using gedit, the scrolling flicker was hurting my eyes, so I used a terminal based alternative.
To read it now, All I have to say is 

vi file.txt

If I'm looking for a particular section, I just use grep and read only that part. 
Works pretty well, worth spending half an hour.

Any questions? Leave a comment.