Friday, 17 February 2012

SSH passwordless login WITHOUT public keys

I was recently in a situation where I needed SSH & rsync over SSH be to able to log into a remote site without prompting for a password (as it was being called from within a script and would have been non-trivial to make the script pass in a password, especially as OpenBSD-SSH does not provide a trivial mechanism for scripts to pass in passwords - see below).

Normally in this situation one would generate a public / private keypair and use that to log in without a prompt, either by leaving the private key unencrypted (ie, not protected by a passphrase), or by loading the private key into an SSH agent prior to attempting to log in (e.g. with ssh-add).

Unfortunately the server in question did not respect my ~/.ssh/authorized_keys file, so public key authentication was not an option (boo).


Well, it turns out that you can pre-authenticate SSH sessions such that an already open session is used to authenticate new sessions (actually, new sessions are basically tunnelled over the existing connection).

The option in question needs a couple of things set up to work, and it isn't obviously documented as a way to allow passwordless authentication - I had read the man page multiple times and hadn't realised what it could do until Mikey at work pointed it out to me.

To get this to work you first need to create (or modify) your ~/.ssh/config as follows:

Host *
  ControlPath ~/.ssh/master_%h_%p_%r


Now, manually connect to the host with the -M flag to ssh and enter your password as normal:

ssh -M user@host

Now, as long as you leave that connection open, further normal connections (without the -M flag) will use that connection instead of creating their own one, and will not require authentication.


Edit:
Note that you may instead edit your ~/.ssh/config as follows to have SSH always create and use Master connections automatically without having to specify -M. However, some people like to manually specify when to use shared connections so that the bandwidth between the low latency interactive sessions and high throughput upload/download sessions doesn't mix as that can have a huge impact on the interactive session.

Host *
  ControlPath ~/.ssh/master_%h_%p_%r

  ControlMaster auto



Alternate method, possibly useful for scripting


Another method I was looking at using was specifying a program to return the password in the SSH_ASKPASS environment variable. Unfortunately, this environment variable is only used in some rare circumstances (namely, when no tty is present, such as when a GUI program calls SSH or rsync), and would not normally be used when running SSH from a terminal (or in the script as I was doing).

Once I found out about the -M option I stopped pursuing this line of thinking, but it may be useful in a script if the above pre-authentication method is not practical (perhaps for unattended machines).

To make SSH respect the SSH_ASKPASS environment variable when running from a terminal, I wrote a small LD_PRELOAD library libnotty.so that intercepts calls to open("/dev/tty") and causes them to fail.

If anyone is interested, the code for this is in my junk repository (libnotty.so & notty.sh). You will also need a small script that echos the password (I hope it goes without saying that you should check the permissions on it) and point the SSH_ASKPASS environment variable to it.

https://github.com/DarkStarSword/junk

Git trick: Deleting non-ancestor tags

Today I cloned the git tree for the pandaboard kernel, only to find that it didn't include the various kernel version tags from upstream, so running things like git describe or git log v3.0.. didn't work.

My first thought was to fetch just the tags from an upstream copy of the Linux kernel I had on my local machine:

git fetch -t ~/linus

Unfortunately I hadn't thought that though very well, as that local tree also contained all the tags from the linux-next tree, the tip tree as well as a whole bunch more from various distro trees and several other random ones, which I didn't want cluttering up my copy of the pandaboard kernel tree.

This lead me to try to find a way to delete all the non-ancestor tags (compared to the current branch) to simplify the tree. This may be useful to others to remove unused objects and make the tree smaller after a git gc -- that didn't factor into my needs as I had specified ~/linus to git clone with --reference so the objects were being shared.

Anyway, this is the script I came up with, note that this only compares the tags with the ancestors of the *current HEAD*, so you should be careful that you are on a branch with all the tags you want to keep first. Alternatively you could modify this script to collate the ancestor tags of every local/remote branch first, though this is left as an exercise for the reader.


#!/bin/sh

ancestor_tags=$(mktemp)
echo -n Looking up ancestor tags...\ 
git log --simplify-by-decoration --pretty='%H' > $ancestor_tags
echo done.

for tag in $(git tag --list); do
 echo -n "$tag"
 commit=$(git show "$tag" | awk '/^commit [0-9a-f]+$/ {print $2}' | head -n 1)
 echo -n ...\ 
 if [ -z "$commit" ]; then
  echo has no commit, deleting...
  git tag -d "$tag"
  continue
 fi
 if grep $commit $ancestor_tags > /dev/null; then
  echo is an ancestor
 else
  echo is not an ancestor, deleting...
  git tag -d "$tag"
 fi
done

rm -fv $ancestor_tags


Also note that this may still leave unwanted tags in if they are a direct ancestor of the current HEAD - for instance, I found a bunch of tags from the tip tree had remained afterwards, but they were much more manageable to delete with a simple for loop and a pattern.

Sunday, 14 November 2010

Bluetooth 3G Modems on Debian Linux: Chatscripts and rfcomm bluez

I've been using 3G mobile broadband to my primary Internet connection for a couple of years now, and ever since I moved out of college it has become my only Internet connection at home - It's saved me the cost, delays and headache of dealing with Telstra to sort out some kind of wired link.

In my particular setup I removed the 3G data SIM card from the USB modem that came with my plan and placed it in my Nokia N900, which I use as a bluetooth modem for my various computers (my N95 used to fill this role) as well as having the convenience of having the N900 itself connected wherever and whenever I want.

Every now and again I get asked about my setup - a lot of people seem to have had trouble setting up bluetooth modems in Linux. This is understandable - last time I checked out Network Manager I found that it could set up a USB 3G modem pretty easily but had zero provisions to set up a bluetooth modem, and the Linux bluetooth stack (bluez) also leaves something to be desired (try using bluez 4 to pair to something without X... fail). I've previously been directing these people to some posts I made on the CLUG mailing list that had my configuration files, but it's clear that it will be easier to direct people to a blog post.

The quickstart guide for those people would be scan this post, grab the file excerpts and place them where they belong, restart bluetooth and run the pon <profile> command to try to bring up the 3G connection. Then when that inevitably doesn't work read the rest of the article to figure out what you need to change to make it work. I should note that I'm using Debian so some of this article may not apply to other non Debian derived distributions (the pon and poff commands came from Debian, for example)

Firstly, a little background on the technical details we care about: 3G modems provide PPP (Point-to-Point Protocol) links to your ISP, just like the dial-up modems of old did. We even use the same protocol and method to talk to them that we used to use to talk to dial-up modems - the AT command set over some kind of serial like interface (itself encapsulated in a USB or bluetooth link).

A few things have changed though - for one they are much faster than dial-up modems. Authentication is also handled differently - we no longer (typically) use a username and password, instead handling the authentication in the SIM card. And instead of calling a phone number for your local ISP, we instead call a special number (such as *99#) to establish the link. Added to this, we now also have something called an APN (Access Point Name) to identify the IP packet data network that we want to communicate with.

There are a few important consequences of all of this. Firstly, we are using the same infrastructure (ppp, chatscripts, wvdial, ...) in Linux to connect to 3G that we used to use to connect old dial-up connections. Secondly, despite not requiring a username and password any more we still have to provide something in their stead to make everything happy even though they are ignored. We also still have the same nonsense of every ISP having a subtle difference in their authentication that affects how we connect to them. There can also be subtle differences in the AT commands we need to communicate with different modems to get them to do what we want.

Some people like using wvdial to establish their ppp links. If that works for you that's great, but my experience has been that wvdial fails in many circumstances, and getting it to work in those cases is quite often impossible, so I'm going to cover a much more tunable back to basics method: ppp + chatscripts + rfcomm.

Firstly, make sure ppp is installed (apt-get install ppp)... I hope you have some other connection than your 3G link to get that... Perhaps whatever you are reading this blog on?

We'll start with a USB connection - no sense adding the extra complexities of a bluetooth link to the mix until we have that working. I'll show the profiles I use for both the Huawei E220 USB modem that came with the plan and the USB link to my N900 (or N95).

We need two configuration files for each profile - the configuration for the ppp side of the link goes under /etc/ppp/peers/<profile> and the chatscript which tells the modem how to establish the ppp link under /etc/chatscripts/<profile>. The chatscript is referenced from the ppp configuration file, so it is possible to use one chatscript for multiple profiles, assuming the profiles are talking to the same modem (or at least that one modem doesn't require special treatment) and using the same APN.

The chatscript is responsible for initialising the modem and getting the connection to the point where pppd can take over, so I'll start with that. Here's the chatscript that I use for my Nokia N900 (USB and bluetooth), Nokia N95 and Huawei E220 USB mdoem:

/etc/chatscripts/optus-n900
ABORT BUSY
ABORT ERROR
ABORT 'NO CARRIER'
REPORT CONNECT
TIMEOUT 10
"" "ATZ"
OK "ATE1V1&D2&C1S0=0+IFC=2,2"
OK AT+CGDCONT=1,"IP","<APN>"
OK "ATE1"

OK "ATDT*99#"

CONNECT \c

IMPORTANT: Replace <APN> with the APN for your connection (for me on Optus post-paid mobile broadband that is "connect", for Lucy on Three pre-paid mobile broadband that is "3services" - refer to the documentation that came with your plan to find out what it is for you). If you don't you will run into inexplicable problems later.

I said above that some modems need to be treated specially in the chatscript. I used to have to use this on my Huawei E220 because I could not find one script that would satisfy both it and my N95 (the AT+IPR line below was necessary for the E220, but caused the N95 to fail), but the differences no longer seem to be necessary (firmware upgrade? Some other change I made and forgot about? Phase of the moon? I can't recall), but it might help someone so here it is:

/etc/chatscripts/optus-huawei
ABORT BUSY
ABORT ERROR
ABORT 'NO CARRIER'
REPORT CONNECT
TIMEOUT 10
"" "ATZ"
OK AT+CGDCONT=1,"ip","connect"
OK "ATE1V1&D2&C1S0=0+IFC=2,2"
OK "AT+IPR=115200"

OK "ATE1"

TIMEOUT 60
"" "ATD*99#"

CONNECT \c

Now we need a profile for ppp that references that chatscript and contains all the settings necessary to establish a successful ppp link. I have a number of these for different profiles, depending on which modem I'm using and whether I'm using my Optus link or Lucy's Three link, but they all pretty similar and include some common elements, so I'll just show one combined file with comments for differences between them. All these options and more are described in man pppd:

/etc/ppp/peers/<profile>
# This can help track down problems:
#debug

# The modem device to talk to:
/dev/ttyACM0 # N900/N95 USB
#/dev/ttyUSB0 # Huawei USB
#/dev/rfcomm0 # N900 Bluetooth

# In some cases it may be necessary to specify a baud rate,
# but generally it's best to let ppp detect this:
#115200
#230400
#460800
#... etc

# Optus requires both of these options, Three requires neither.
# Other ISPs may have different authentication requirements:
refuse-chap
require-pap

# When to detach from the console:
updetach
#nodetach

# These are generally necessary:
crtscts
noauth
noipdefault

# If the connection drops out try to reopen it:
persist

# We want this to be the default internet connection:
defaultroute
replacedefaultroute

# Get DNS settings from the ISP:
usepeerdns

# not used, but we must provide something:
user "na"
password "na"

# Playing with these compression options *may* improve
# performance, but get it working first:
noccp
nobsdcomp
novj
#nodeflate

#What chatscript we are using in this profile:
connect "/usr/sbin/chat -s -S -V -f /etc/chatscripts/optus-n900"
#connect "/usr/sbin/chat -s -S -V -f /etc/chatscripts/optus-huawei"


Got that? Great, let's give it a go! Connect your modem by USB, do whatever magic incantations you need to get your modem to reveal it's modem aspects to Linux (for Nokia phones this is usually select the PC suite mode when you plug it in, some people report having to do strange things with kernel modules and udev to poke their Huawei E220 modems, though I have never found that necessary myself), shutdown your network manager and run this in a terminal:

pon <profile>

All going well hopefully you will see some output like this:

ATZ
OK
ATE1V1&D2&C1S0=0+IFC=2,2
OK
AT+CGDCONT=1,"IP","connect"
OK
ATE1
OK
ATDT*99#
CONNECTchat: Nov 14 13:12:13 CONNECT
Serial connection established.
Using interface ppp0
Connect: ppp0 <--> /dev/rfcomm0
PAP authentication succeeded
Cannot determine ethernet address for proxy ARP
local IP address www.xxx.yyy.zzz
remote IP address 10.6.6.6
primary DNS address 211.29.132.12
secondary DNS address 61.88.88.88

Obviously the exact output will vary, but usually if you see some IP and DNS addresses you have successfully connected. Otherwise you really should try to get this working before continuing to the bluetooth part. If you got as far as the CONNECT... "Serial connection established." your modem and chatscripts are probably working (assuming you APN in the chatscript is correct) and you may need to look at the ppp configuration, though you might just try a few times first - sometimes my connections take a few attempts to come up successfully.

If you haven't got as far as the CONNECT you'll need to check your modem, coverage and chatscripts to try to locate the problem. Also double check that you have specified the correct device in the ppp configuration. If you are using a phone as your modem you might try rebooting it. If you get a NO CARRIER you are likely out of coverage or your modem couldn't connect to a nearby base station for some other reason (such as it being full), though the symptoms for that are unfortunately not always consistent - failing to connect to the modem at all can also be a symptom of that (and a host of other possible causes) for instance.

There's just too many things that can go wrong by this point for me to cover here. Google is your friend. You may be able to find other people's chatscripts and ppp configuration for your modem and/or ISP that you could try.

Now you've successfully got a connection with ppp + chatscripts it's time to add bluetooth into the mix. Serial connections over bluetooth are handled with the rfcomm protocol. They are controlled with the rfcomm program and once bound show up as /dev/rfcomm0 and similar. A device can have different serial services listening on different rfcomm "channels" (like IP ports), and there is no guarantee for which services appear on which rfcomm channel. My Nokia N95 reveals it's modem on rfcomm channel 2 and it's GPS on rfcomm channel 5 (via ExtGPS), while my N900 reveals it's modem on rfcomm channel 1 (In fact it is actually running rfcomm -S -- listen -1 1 /usr/bin/pnatd {}). You can use an rfcomm scanner like rfcomm_scan from Collin Mulliner's BT Audit suite or do some trial and error to find the channel you need (there's only 30 channels and it's usually a low number).

Add a section like the following to your /etc/bluetooth/rfcomm.conf:

/etc/bluetooth/rfcomm.conf:
rfcomm0 {
 bind yes;
 device AA:BB:CC:DD:EE:FF;
 channel 1;
 comment "N900 Data";
}

Replacing the bluetooth address and channel number as appropriate. Then tell rfcomm to bind rfcomm0 to this device with rfcomm bind 0 (this will also happen automatically at boot).

You should now see a new file /dev/rfcomm0 which we use to communicate with the modem over bluetooth. You should make a copy of the /etc/ppp/peers/<profile> you were using to connect over bluetooth and change the new profile to use /dev/rfcomm0.

Now, we need to pair the devices together and tell the phone to trust the computer to connect whenever it wants. Pairing in bluez is still a bit hairy, particularly if you aren't using KDE or GNOME (like me) which provide their own bluez agents. In that case you don't have many options available to you. Bluez 3 used to have a hack in which you could specify a PIN to pair with under /var/lib/bluetooth/<device>/pincodes to allow pairing without an agent, however that does not work in bluez 4. Bluez provides an example console agent in the examples directory, but I have never managed to get it to work reliably with bluez 3 or bluez 4, so we now need a bluez agent, which lacking any decent console/curses agents means we need X (FAIL). This nonsense is now true even of HID devices which could previously be paired and activated with a simple hidd --search, which now doesn't trust them to re-pair to the computer so they stop working as soon as they start power saving (FAIL). Sigh, one day I'll get around to writing a decent ncurses bluez agent if no one beats me to it, but I digress.

If you aren't using GNOME or KDE you might try using the GTK bluez agent blueman instead. You'll need to have it's system tray applet (blueman-applet) running for blueman-manager to work properly (FAIL - I don't have a system tray. At least it doesn't actually need to show the tray icon to work, though if you want that "trayer" or "stalonetray" can be used to provide a temporary system tray).

Anyway, once you have some kind of bluez agent running, be it KDE's kbluetooth, gnome-bluetooth or blueman you can try to pair your phone. I say "try" because even with an agent, pairing with bluez is still hairy. In theory running the pon <profile> command will attempt to open the bluetooth link and initiate pairing, causing both phone and computer to ask for a PIN to authenticate each other - enter the same on each. If you're really lucky they might even remember that they have been paired so you don't have to do it again the next time. If you're unlucky and that didn't work you can try deleting any existing pairing from the computer and phone then using your bluetooth agent's interface to initiate a pairing. Rebooting and walking around your computer in circles while chanting "all hail bluez" over and over may also help - I wish you luck.

The good news is that you only need the bluez agent while pairing - once you successfully pair and manage to get the 3G link up (and down and up a second time to make sure it remembered what to do) you usually don't have to touch bluez again and things get a lot easier. Unless one of the devices pairings get lost or confused... Or your bluetooth address changes, or ...

Hopefully by this stage you have successfully managed to pair your computer and phone you should be able to use the pon and poff commands to bring the connection up and down as above. Congratulations, you're done! You can stop reading now. If you are getting a "host is down" error you have not successfully paired or the bluetooth link has otherwise failed. Another symptom of (non-pairing) bluetooth related problems that I've seen was getting no OK response after the initial ATZ. If you are pairing OK but only getting partway through the connection sequence you may have to go back to debugging your chatscripts and ppp options like I talked about above.


The (broadcom) bluetooth dongle I use on my EeePC introduces another complexity to the process - every time it is plugged in a couple of bits in it's bluetooth address change at random for no good reason (check with hciconfig), which as you can imagine makes it rather hard to maintain a pairing between it and anything else. I've also come across some (broadcom) bluetooth dongles with a bluetooth address of 00:00:00:00:00:00. Oddly enough, very few devices like pairing with them, and fewer still will re-pair with them automatically. If you have this problem tell broadcom they suck buy a CSR dongle you might try the dbaddr utility in the bluez source to force them to use a particular bluetooth address (if they support changing it through software, which of course is no guarantee). The script I use on my EeePC to connect shuts down my network manager and any running DHCP client, changes the bluetooth address on the dongle and opens the 3G connection:

/etc/init.d/wicd stop
killall dhclient
killall dhclient3

/usr/local/sbin/dbaddr AA:BB:CC:DD:EE:FF
hciconfig hci0 reset

pon optus-blue

Tuesday, 9 November 2010

Remind+wyrd events in other timezones & other tricks

When I bought my EeePC I challenged myself to wherever possible find lightweight (console/curses if possible) and keyboard friendly alternatives to the software I had been using. What I discovered was that I quickly began to prefer that way of interacting with the computer to my previous KDE centric setup, so now almost all of my desktop and laptops have the same setup.

One application which I sought to replace was a calendar. I discovered a lightweight console calendar program called "remind" with a ncurses frontend known as "wyrd":


A basic event in file processed by remind might look something like this:

REM Nov 09 2010 AT 18:00 MSG Write a blog entry

That should be reasonably self explanatory. You can also specify some quite advanced recurring events in fairly natural ways:

REM Mon Tue Wed Thu Fri AT 9:00 MSG Go to work

REM Dec 25 MSG Christmas!

Or to specify the fourth Thursday of every month (Technically the next Thursday on or after the 22nd of any month):

REM Thursday 22 AT 19:00 DURATION 3:00 MSG Canberra Linux Users Group Meeting

There are also syntaxes for advanced reminders (+) and repetition (*) - but this isn't a full remind tutorial, read the man pages or search google (tip: add wyrd in your search to narrow the results down).

You may have noticed that I never specified a timezone in those examples. Unfortunately remind was written a long time ago on a hermit like platform that knew nothing of how time worked elsewhere in the world (DOS) and as a result doesn't have any support for events in other timezones built in. Just defining the event in local time may not be suitable depending on what both timezones do with daylight savings.

But there is another thing you should know about remind - it's not just a calendar domain specific language (though as you can see from those examples it certainly includes plenty of DSL constructs), it is in fact a calendar oriented programming language and we can use that to work around this limitation.

Seriously, let me say that one more time. My calendar is specified in a programming language. That is awesome. I can specify events to only occur once every blue moon---for real. I could shell out and have reminders only occur if my IP address indicates I'm at the office. Seriously, it could remind me to catch the bus only if I haven't already done so (note to self: make it do that, that would be cool).

Specifying a one off event in another timezone isn't in itself terribly difficult:

REM [trigger(tzconvert('2010-09-11@18:20', "US/Pacific"))] +30 DURATION 1:00 MSG Look up

The problem with this method is that there is no way to specify advanced recursion. tzconvert takes a datetime and returns a datetime. There's no way to say "every monday in that timezone" or "every fortnight commencing on x in that timezone" or "on the last Sunday of October every year in that timezone", which remind has no trouble doing for local events.

Remind's programing language capability is unfortunately somewhat limited - mixing the DSL grammar and functions together is a bit kludgey. It's easy to cast the output of a function to a string and use it in the grammar (as above), but going the other way is a little more difficult. For instance, variables are set using the SET command, but if there is any way to set a variable from a function it has escaped me. Functional programing techniques may be usable to work around this, but I get the impression that remind's author didn't exactly design it with that in mind - for one thing recursive calls are explicitly disallowed.

But, we can INCLUDE another file, which will then be executed by remind (even if it's included multiple times) and will be able to use the DSL commands and have access to any variables already defined, so we can use that mechanism to create a function that will do what we want. After a bit of playing around today I finally settled on this:

# USAGE:
# SET these variables then INCLUDE this script:
#
# tz_src - the timezone the event is in
# tz_src_date - the date component of the event as would be passed to REM,
# including any repetition and reminders
# tz_src_time - the time component of the event in hh:mm form
# tz_src_trem - any time repetition, reminders, DURATION, etc. as passed into
# REM (if not desired, set to "")
# tz_msg - The message to print.
#
# Afterwards tz_dst_time will be set for *today's* occurrence of the event in
# localtime, or unset if no event occurs.


# Find next date in src timezone that occurs today() in localtime:
REM [tz_src_date] SCANFROM [trigger(today()-2)] UNTIL [trigger(today()+2)] SATISFY \
 coerce("DATE", tzconvert(datetime(trigdate(), tz_src_time), tz_src)) == today()
IF trigvalid()
 # We know local date is today from SATISFY, convert time to local:
 SET __dst_dt tzconvert(datetime(trigdate(), tz_src_time), tz_src)
 SET tz_dst_time coerce("TIME", __dst_dt)

 REM [trigger(today())] AT [tz_dst_time] [tz_src_trem] MSG [tz_msg]
ELSE
 UNSET tz_dst_time
ENDIF

That searches for a date the event occurs on the other timezone that satisfies the condition that the event occurs today() in the local timezone (today() is not necessarily the actual system date, it could be a specific date being looked up or the date of a calendar entry being computed). The source date can be specified with any of the usual remind recurrence constructs, just like an ordinary event. I've noticed some parse errors using this with a one off event on days the event does not occur - I think it might be a bug in remind for non-recurring events with a SATISFY clause that returns 0, but if someone can see something I've done wrong there I'd welcome the feedback. Anyway, for one off events you can just use the more concise syntax above, I've tried a few different forms of recurring events and haven't yet seen it on any of them.


The title of this post says "and other tricks", so I should probably show you some. I have a weekly meeting who's time varies depending on daylight savings (to better accommodate people elsewhere in the world who call in), so I've come up with this trick checking if every Friday is in (local) daylight savings time to accommodate this (try doing this in iCal!):

REM Fri SATISFY 1
IF isdst(trigdate())
 REM [trigger(trigdate())] +2 SKIP AT 09:30 DURATION 0:30 Some meeting
ELSE
 REM [trigger(trigdate())] +2 SKIP AT 08:30 DURATION 0:30 Some meeting
ENDIF



Finally, for anyone in Canberra, here is a list of public holidays you can import into your remind file. These should take care of any of the floating public holidays as well, and you can use the SKIP keyword to have events automatically be cancelled if it falls on a public holiday, or the BEFORE or AFTER keywords to move it to another day. The only thing these can't predict is any meddling from the Government:

# Public Holidays
FSET next_monday(x) x + (7-wkdaynum(x-1))
FSET next_monday_inc(x) x + (7-wkdaynum(x-1))%7
FSET weekend(x) wkdaynum(x) == 0 || wkdaynum(x) == 6

OMIT Jan 1 SPECIAL COLOR 255 255 255 New Year's Day
REM Jan 1 SCANFROM [trigger(today()-7)] SATISFY weekend(trigdate())
OMIT [trigger(next_monday_inc(trigdate()))] SPECIAL COLOR 255 255 255 New Year's Day Holiday
OMIT Jan 26 SPECIAL COLOR 255 255 255 Australia Day
REM Jan 26 SCANFROM [trigger(today()-7)] SATISFY weekend(trigdate())
OMIT [trigger(next_monday_inc(trigdate()))] SPECIAL COLOR 255 255 255 Australia Day Holiday
REM Mon Mar 8 SCANFROM [trigger(today()-7)] SATISFY 1
OMIT [trigger(trigdate())] SPECIAL COLOR 255 255 255 Canberra Day
SET easter EASTERDATE(YEAR(TODAY()))
OMIT [TRIGGER(easter-2)] SPECIAL COLOR 255 255 255 Good Friday
REM [TRIGGER(easter-1)] SPECIAL COLOR 255 255 255 Easter Saturday
REM [TRIGGER(easter)] SPECIAL COLOR 255 255 255 Easter Sunday
OMIT [TRIGGER(easter+1)] SPECIAL COLOR 255 255 255 Easter Monday
OMIT Apr 25 SPECIAL COLOR 255 255 255 Anzac Day
REM Apr 25 SCANFROM [trigger(today()-7)] SATISFY weekend(trigdate())
OMIT [trigger(next_monday_inc(trigdate()))] SPECIAL COLOR 255 255 255 Anzac Day Holiday
REM Mon Jun 8 SCANFROM [trigger(today()-7)] SATISFY 1
OMIT [trigger(trigdate())] SPECIAL COLOR 255 255 255 Queen's Birthday
REM Mon Oct SCANFROM [trigger(today()-7)] SATISFY 1
OMIT [trigger(trigdate())] SPECIAL COLOR 255 255 255 Labour Day
OMIT 25 Dec SPECIAL COLOR 255 255 255 Christmas
OMIT 26 Dec SPECIAL COLOR 255 255 255 Boxing Day
REM 25 Dec SCANFROM [trigger(today()-7)] SATISFY weekend(trigdate())
IF trigvalid()
 OMIT [trigger(next_monday_inc(trigdate()) )] SPECIAL COLOR 255 255 255 Christmas Holiday
 OMIT [trigger(next_monday_inc(trigdate())+1)] SPECIAL COLOR 255 255 255 Boxing Day Holiday
ENDIF
REM 26 Dec SCANFROM [trigger(today()-7)] SATISFY wkdaynum(trigdate()) == 6
OMIT [trigger(next_monday_inc(trigdate()))] SPECIAL COLOR 255 255 255 Boxing Day Holiday

Wednesday, 14 July 2010

Fun with Foreign Debian Bootstrapping

Yesterday I found myself booting Linux on a device with no attached permanent storage - all I had was several gigabytes of RAM and the ability to netboot it through TFTP. I had been using a very minimal root filesystem inside the kernel image, but I began to wonder if it would be possible to have an entire Debian installation in the ramdisk instead - the box certainly had enough RAM to fit a minimal installation.

Ordinarily one could just use debootstrap to set up a minimal Debian installation inside a directory and make a ramdisk from that, but this was further complicated by the fact that this was a PowerPC device. Debootstrap does have a --foreign option to perform the first part of the installation on a different architecture, but the --second-stage still needs to be run as root on native hardware and assumes that it is being run from within an existing Linux installation with a bunch of standard tools available to it.

The only machines I had root on were all x86 (other than the device in question, but the ramdisk I had been using had some limitations that would have complicated matters) and some other test boxes (which I would have had to wait to requisition). So instead I decided to do a partial debootstrap on my local x86 box and complete the installation using only my local x86 box and that partial image on the PowerPC box.

If you are following this article as a guide I should note that it assumes you are able to compile and boot your own kernel and have a decent familiarity with Linux in general.

So first, begin the debootstrap process, but use --foreign to only perform the first part of the bootstrapping process (NOTE: almost everything here needs to be run as root, signified by the # at the start of each line):

# mkdir deb-ppc
# debootstrap --arch=powerpc --foreign squeeze deb-ppc http://<mirror>/debian

After this command completes you have an incomplete Debian installation in deb-ppc - some basic tools are installed (but not configured) and some packages have been downloaded but not installed. I did not select any additional packages into the initial root disk at this stage, though had I been thinking ahead it would have been useful to also include openssh-server and rsync, but that was not a major setback for me. You might want to include them, and if you don't like vi or nano you might also want to install your console editor of choice. At the moment the root disk is not bootable, so let's fix that:

# ln -s /bin/bash deb-ppc/init

This still won't boot into a full Debian installation - after the kernel finishes it's initialisation and tries to spawn the init userspace process to take over booting, it will instead spawn an interactive shell which can be used to complete the bootstrapping process. Since I'm bundling this inside the kernel image as an initramfs as opposed to an initrd loaded separately, I link an interactive shell into /init. If you were doing this with an initrd you would instead link it to /initrd.

Before we can make a ramdisk image from that directory we need to save this script as mkinitramfs.sh from Documentation/filesystems/ramfs-rootfs-initramfs.txt in the kernel sources:
#!/bin/sh

# Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation.
# Licensed under GPL version 2

if [ $# -ne 2 ]
then
  echo "usage: mkinitramfs directory imagename.cpio.gz"
  exit 1
fi

if [ -d "$1" ]
then
  echo "creating $2 from $1"
  (cd "$1"; find . | cpio -o -H newc | gzip) > "$2"
else
  echo "First argument must be a directory"
  exit 1
fi

NOTE: when using this script be sure you are calling this script and not a separate program also named mkinitramfs from your distribution.

Let's bundle the root disk into a cpio image:

# ./mkinitramfs.sh deb-ppc ramdisk.cpio.gz

Now you need to compile the kernel and netboot it - I'll leave the details of how to actually do that out of this article - there's plenty of good resources for that around already and the netboot procedure may vary depending on your setup (if you are netbooting at all). If you are doing this with an initramfs like I am you will need to point CONFIG_INITRAMFS_SOURCE to that image - once you have configured the kernel edit the .config file and remove the 'CONFIG_INITRAMFS_SOURCE=""' line. Then run make oldconfig which will ask you to set that option as well as some UID and GUI mapping (which you can leave as 0 since the image already should already have the correct ownership). After that you can run make and wait for the kernel to build. I'll also assume you know which zImage is the correct one to boot on your hardware.

Once you have successfully booted the kernel you should find yourself at a bash prompt. You should be aware that the environment is extremely limited at this point - for one thing there is no job control so don't try to spawn a process that you need to ctrl+c out of (I made the mistake of pinging a host to check that the network was up).

The debootstrap --second-stage did not work for me, so instead I completed the installation manually:

# export PATH=/usr/sbin:/usr/bin:/sbin:/bin
# dpkg --force-depends --install /var/cache/apt/archives/*.deb

A few things may complain during that and you may need to tell apt to fix up any problems:

# apt-get -f install

Now you will have a much more complete userspace - including vi. There's a few more things we need to do to get the system usable. Firstly, let's edit /etc/fstab and add an entry for /proc since so much userspace depends on it:

# vi /etc/fstab

proc /proc proc defaults 0 0

# mount /proc

Now we should probably get networking set up (I'm assuming you are using DHCP and your interface is eth0):

# vi /etc/network/interfaces

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp

# vi /etc/hostname
# ifup lo
# ifup eth0

Do not make the mistake I made of checking if the interface is up by pinging something. You can run ifconfig to make sure your IP address looks right.

And set up apt (note /debian postfix in sources.list which isn't in the template provided by debootstrap - I spent around 10 minutes contemplating the 403 I was getting before I noticed that):

# vi /etc/apt/sources.list

deb http://<mirror>/debian squeeze main

# vi /etc/apt/apt.conf.d/10local
APT::Install-Recommends "0";
APT::Install-Suggests "0";

# apt-get update

Now you can install any additional packages you may need (if you didn't do this in the initial debootstrap), so let's install what we need to be able to copy our changes out of the machine (interactive SSH won't work just yet, but file copying will):

# apt-get install openssh-server rsync
# passwd

Note that if you are interacting with the machine via serial it may be a bit awkward to interact with the configuration for some packages (such as localepurge) so just install the bare essentials for the moment. After installing some packages it's probably a good idea to clean the apt cache since we are likely pretty tight on RAM:

# apt-get clean

Speaking of serial, if you are logging into the machine via serial (as I was) you may want to spawn a console on the serial line:

# vi /etc/inittab

T0:2345:respawn:/sbin/getty -L ttyS0 57600 vt100

Back on the x86 box we can now copy all those changes back into the ramdisk and make it actually boot Debian:

# rsync -avx <host>:/ deb-ppc
(NOTE: the x is important, otherwise /proc will be copied as well)
# rm deb-ppc/init
# ln -s /sbin/init deb-ppc/init
# ./mkinitramfs.sh deb-ppc ramdisk.cpio.gz
(again, the mkinitramfs from the kernel doc, not a distro)

Again, compile the kernel and boot it. You will need to do this last part every time you make a change in the ramdisk that you want to make persistent.

Once booted you will be able to interactively SSH into it and will find you now have a complete Debian installation you can do whatever you like with within the constraints of the available RAM. With full SSH, job control and proper TTY management you can now perform some changes that would have been a little tricky earlier, such as reconfiguring any packages you couldn't configure properly earlier (tzdata for me) and stripping out unneeded locales (this messed up a little for me since locales wasn't installed before localepurge. I haven't tested this and it's probably longer than it needs to be, but I think it will work):

# apt-get install locales
# locale-gen en_AU-UTF-8
# dpkg-reconfigure locales
# apt-get install localepurge
# localepurge
# apt-get clean

You might also want to strip out some unneeded packages, for example with:

# apt-get purge logrotate mac-fdisk rsyslog yaboot info install-info man-db manpages nano

Remember to follow the above instructions to make those changes persistent if you are happy with them. Later I'll probably play around with docpurge (from maemo) and look at other ways of reducing the size of the image (disabling logging is probably a good place to start).

If you're after some further reading on booting the kernel with initial ramdisks, check out Documentation/early-userspace/README and Documentation/filesystems/ramfs-rootfs-initramfs.txt in the kernel source.

Wednesday, 4 February 2009

Of Rips and Magical Musical DVDs

If there is one thing that irks me almost as much as mistagged mp3s, it's poorly encoded videos. Why is it that AVI is still so popular when Matroska containers are superior in every way? Why is MP3 still being used for the audio when any device that can play the video certainly will have enough grunt to play OGG Vorbis (codec support notwithstanding)? God forbid something encoded in ... [gasp] MPEG2 - H.264 people, H.264 (patent law notwithstanding)! Even mplayer on my Eee 701SD (running Debian Lenny) can handle all that without missing a frame!
This rant comes about as a result of me trying to buy a certain music DVD since about 5 months ago. I've had it on backorder for months, I've tried JB HiFi and some other local shops and not really trusting eBay for these kind of purchases I eventually decided to just use my left over internet quota for the month and download the thing.
On a side note, record companies - if you want to make money, why do you make it so difficult to buy things from you? "On backorder, you will receive an email when the item is back in stock", "Unfortunately xxxxx are sold out. Would you like to...", "still out of stock, there's some sorta licensing issue which is taking forever to resolve." - these are just some of the quotes I've heard as a consumer over the last year. Then there was that incident with those CDs stuck in US customs for those two months without anyone knowing where they were and costing more money as replacements were sent, then more money as they were returned after customs finally released them (So, the only thing that Free Trade Agreement did was stuff up our legal system then?)... Yeah, I'm a little off buying CDs over the Internet by now, but of course the alternative of buying in a store is quite difficult given that the bands practically have to have achieved worldwide fame to have a snowballs chance in hell of actually being in stock (ok, I am exaggerating that a little).
Now, since I live in Australia and have limits on how much I can download in a month, I strongly preference not downloading any files larger than ~700mb - and why should I? All the music DVDs I've ripped myself sound (and to a lesser extent, look) supurb at that size, surely it couldn't be that much worse than mine, right? wrong.
Now, ask yourself this - if you were ripping a music DVD, you would make sure that you set a decent bitrate on the audio track wouldn't you? I certainly would - at least 196kbps, perhaps even as high as 256kbps for those of you with ultra sensitive hearing. Well, let's just say that this particular download was a tad less than that and not go into the details too much. I won't even mention just how excruciatingly painful it was to try to listen to.
Now, it's not that hard to do a decent encoding, but it is important to have a reasonable understanding of what's actually involved in the process. It is important to know your source media - is it interlaced? Does it need to be cropped? Is there a subtitle track that you should rip as well? Is there just the one audio track; which one is the right one? Does the aspect ratio need to be fixed?
Many of those answers will vary from situation to situation and from DVD to DVD, so there isn't a one size perfectly fits all solution. Of course there are graphical tools to do this for you and some of them are no doubt pretty good, though they do not remove the need to have at least a basic understanding of what is actually happening if you want good results. I'm not going to cover any graphical tool though, I learned how to do this on the command line years ago and have stuck with that, merely expanding my knowledge when new codecs and options came out. This shell script (which I know needs work - patches welcome) is my current best practice for ripping DVDs for my personal use.
It does make a few assumptions - that the DVD is interlaced and that you want it de-interlaced (because you will be playing it on a computer monitor as opposed to a TV), that there is no subtitle track that you want to extract (if you do, add "-sid n" without the quotes and where n is the subtitle track you want, usually 0, to the end of each line starting with mencoder, though also note that there are "better" ways to do this), that this is a music DVD and not a movie (I recommend lowering the audiobitrate to 128 if it is a movie), that you only want the one default audio track (if not, specify it with mplayer's -aid option and find the appropriate ID with mplayer's -identify option), and that it doesn't need to be cropped (too error prone to automate - look at the -vf cropdetect and -vf crop options in mplayer if you need it).
You will need a few dependencies: You need the Matroska tools, Vorbis tools and x264 libraries. You will also need to make sure that you have mplayer AND mencoder built with x264 support and able to play your DVD. This probably means you will need to compile it from source, which is outside the scope of this article on account of me needing to sleep soon. Also note that depending on your location you may find that you may have legal issues regarding the patents surrounding the H.264 codec. Not to mention that you may live in a country where you cannot legally format shift or where breaking Technological Protection Measures (such as encrypted DVDs) is plain illegal - I leave it to the reader to verify that they can legally do these things or go away and complain loudly to their Government if they can't, just don't go and drag me into it all, I'm just not in the mood.

So, if you've kept reading instead of going to complain to someone in authority than I guess that you are bearing the responsibility and want to know how to actually use this.
Save it as something like rip.sh and use it like
./rip.sh filename track
where filename is the base filename you will end up with and track is the DVD track number to extract - if you leave the track blank then it will rip whatever would have played with mplayer dvd://

#!/bin/bash

targetfilesize=$[ 700 * 1024 * 1024]
audiobitrate=256

file=$1
dvddump="dvd://$2"
rawaudio="$file-rawaudio.wav"
compressedaudio="$file-compressedaudio.ogg"
pass1out="$file-pass1.avi"
pass2out="$file-pass2.avi"
finalcut="$file.mkv"

#extract audio
mplayer "$dvddump" -vc null -vo null -ao pcm:file="$rawaudio":fast </dev/null

#compress audio
oggenc "$rawaudio" -b $audiobitrate -o "$compressedaudio"
rm "$rawaudio"

#Sometimes the length of the video is misreported, so use the length of the audio track instead since it was just encoded and therefore more likely to be accurate:
#NOTE: There is a rare situation where the audio track is really not the same length as the video track - if that is the case you will need to alter this section appropriately
videolength=`echo \`mplayer -identify "$dvddump" -vo null -ao null -frames 0 2>/dev/null |awk -F= '/ID_LENGTH/ {print $2}'\` / 1 + 1 | bc`
audiolength=`echo \`mplayer -identify "$compressedaudio" -vo null -ao null -frames 0 2>/dev/null |awk -F= '/ID_LENGTH/ {print $2}'\` / 1 + 1 | bc`
echo videolength: $videolength
echo audiolength: $audiolength
length=$audiolength
echo length: $length

#calculate video bitrate
videotargetsize=$[ $targetfilesize - `du -b "$compressedaudio" | awk '{print $1}'` ]
videobitrate=`echo "$videotargetsize * 8 / $length / 1000" | bc`
echo video bitrate: $videobitrate

#video pass 1
rm divx2pass.log
mencoder "$dvddump" -vf kerndeint,scale -ovc x264 -oac lavc -lavcopts abitrate=64 -x264encopts bitrate=$videobitrate:threads=auto:pass=1:turbo=1 -o "$pass1out"

#video pass 2
mencoder "$dvddump" -vf kerndeint,scale -ovc x264 -oac lavc -lavcopts abitrate=64 -x264encopts bitrate=$videobitrate:threads=auto:pass=2 -o "$pass2out"

#compile
mkvmerge -o "$finalcut" -A "$pass2out" "$compressedaudio"

If anyone does want to submit patches for this, the main features I've been intending to implement are a more flexible command line usage, a better way to extract the audio (that doesn't have the same risk of pressing left/right yet still produces perfectly synced audio), get all the subtitle tracks embedded into the mkv file and convert the DVD chapters into a format that can be embedded into the mkv.
Update: I can't believe that I didn't think of this earlier - simply redirecting stdin from /dev/null solves the keyboard input issue when dumping the audio with mplayer.

As for me, well, I guess I'll just eBay it after all hoping it's not a bootleg and go to sleep.

Update: I'm just going to go over an issue I mentioned in this post - how to deal with media that needs it's aspect ratio corrected. The symptoms of this are generally that while you are watching a video everything just feels slightly distorted - in many cases this will be your imagination playing tricks on you, but if you are fairly certain that it isn't, read on. I'm going to use the music video for "Stick Together" which was on the bonus DVD from the album "Rock Music" by "The Superjesus" as an example. Everytime I watched this it looked distorted, so today I paused it at this frame and used the GIMP to take a screenshot:

Now, the reason I took the screenshot here is that there is a fairly large (easier to measure) drum (circular object) reasonably close to the centre of the screen (not too heavily distorted by the camera's lens) and facing the camera almost perfectly straight on (avoids perspective distortion). Using the measure tool in the GIMP I find that the drum is approximately 170 pixels wide but about 184 pixels high - clearly the aspect ratio is way out and in this case it wasn't just my imagination (phew).
You will also notice the large black bars above and below the image - these need to be cropped. And here's another reason I chose this video - take a look at this screenshot:

Notice where the black bars are and how large they are this time? This is exactly why it's important to know your source media. If I simply run mplayer -vf cropdetect over this it's going to change it's mind 3 times during playback - within the first second as that fades in it changes from crop=560:496:80:38 to crop=576:496:72:38. Then when the widescreen video starts it decides on crop=688:496:18:38. None of these are correct - the first two would cut off the left and right of the video and the last one will still leave small black bars at the top and bottom. This is one of the reasons why I mentioned that automating cropping is just too error prone. So, what's the solution? Tell mplayer to start playback after the intro artwork is gone of course! If, hypothetically, you wanted to modify my above script to attempt to detect how to crop it, I would suggest adding a line something like this (and using the crop variable in the appropriate place in the video filter chain - I cover this later):

crop=`mplayer "$dvddump" -vf cropdetect -vo null -ao null -fps 1000 -ss 60 -endpos 5|grep CROP|tail -n 1|sed 's/^.*(-vf //'|sed 's/).*$//'`

This starts the playback one minute in, quickly runs the video for 5 seconds and gives me a crop parameter of crop=688:432:18:72 - checking this with mplayer -vf crop=688:432:18:72 video.vob looks about right so it's time to move back to the problem of the aspect ratio (you could also crop the video after changing the aspect ratio - just remember to keep your video filters, including cropdetect, in the same order that you are working with).
So, let's see - I have a width of 688 pixels with the drum 170 pixels wide, and a height of 432 pixels with the drum 184 pixels high. Personally, I want to keep the width as is and scale the height to adjust the aspect. So, the currect aspect ratio is about 1.6 (688/432) and I probably want about 1.7 (432/184*170) - plugging this value into mplayer still doesn't look quite right, but I know that this is close to the standard 16:9 (1.7) aspect and a little more eyeballing tells me that's probably a bit closer. What I'm trying to get at here is that despite your best measuring efforts, it's quite difficult to get this exact and you eventually will need to just eyeball it and see if it looks good enough.
So, all together now the filter chain will look something like this:

mplayer -vf kerndeint,crop=688:432:18:72,dsize=16:9,scale=-1:-2

Breaking that down:
1. Deinterlace the video before any other processing (the absolute last thing you would ever want to do is scale first and then try to deinterlace, unless of course you like to make your eyes bleed).
2. Crop the black bars away (again, if you altered the aspect ratio before cropping the video this would be at the end of the chain).
3. dsize is used to change the intended aspect ratio used by all the following video filters (but doesn't change the aspect ratio itself).
4. Actually change the aspect ratio: a width of -1 tells it to use the original width (688 pixels), and a height of -2 tells it to scale the height using the other dimension and the intended aspect ratio.
Ok, I have lied a little - my source media is actually not interlaced in this case, so I did not use the kerndeint filter, but I wanted to drive home the point about the importance of getting the video filter order correct - I've seen it done wrong. My eyes started to bleed.

Friday, 16 January 2009

New Years Resolution: Massive Music Tag Cleanup

Once again I find that months have passed since my last entry. The blog will be a year old in little over a week and I will once again be attending linux.conf.au, this time down in Hobart. I've got myself some new gadgets - in particular a Eee PC 701SD which only cost $327 AU from JB Hifi so I have a decent computer for the conference. I'll be posting a lot more about it in the coming weeks, but am just mentioning it now as it is linked to today's post. Allow me to explain - while I've kept the default Xandros install on the internal 8 gig solid state drive I've installed Debian on a 2 gig SD card. 2 gig. yep. small, isn't it? and encrypted, but that's for another post. The point is that I've been looking for lightweight alternatives to all the software that I traditionally use in my day to day tasks, so while I'll happily leave Amarok alone on Xandros, I didn't really want to pull in all the KDE dependencies to have it on Debian, and I've come across a nice little ncurses music player called cmus to use instead.

Now, on my desktop and main laptop I use Amarok pretty much exclusively and have tried to keep all the tags in my music collection accurate - I try to check the track listing, the genre, the year and that the capitalisation complies with English capitalisation rules (except when it is apparent that the odd capitalisation is a concious decision on part of the artist and forms part of the art). I'm well aware that I've missed some - some of the artists that have been in my collection for longer still have bad capitalisation and I've only started to check the accuracy of the album years recently.

But there is a larger problem - Amarok doesn't reveal every tag to me. While that doesn't matter in the least as long as I'm only using Amarok, it can matter when I use other media players. I'm not worried about any of those albums I own a physical copy of - they're all in ogg (but if you do need a powerful ogg tag editor, tagtool's advanced mode _looks_ promising), but rather the music I've downloaded and have left in mp3. I've been aware of the issue for a while because I occasionally observe some of the symptoms on the various media players available on the Internet Tablet. I have looked at dedicated tag editors, but until now I haven't been able to find one that would show me *every* tag - not just the one's it's programmed to recognise, not just the id3v2.3 tags, but all of them. And not just the first 30 characters of them either.

Why is this *so* important, they're just extra tags, right? Well, my biggest annoyance is that cmus uses the contents of the TPE2 tag if it is present for the Artist in it's library view rather than the TPE1 tag which Amarok uses. TPE1 is defined as "Lead performer(s)/Soloist(s)", while TPE2 is defined as "Band/orchestra/accompaniment". Now, the TPE2 tag may well be perfectly valid and correct, but it is not a tag that I have been organising or validating so far with Amarok, so I'd like to get everything consistent and delete the TPE2 tags. While I'm at it, why not remove all the cover art from the mp3s - I've always felt it wasteful to keep 12 copies of the same image when I could and do just put a single image in the same folder. In fact, why not go and remove all the tags that aren't recognised by Amarok - do I really care that it was encoded with lame? I might be happy to leave the 'free download from http://www.last.fm' comment tags alone and I certainly don't want to destroy any comments that I've added, but do I really want any of the other comment tags in there?

So I finally found a id3 tag editing tool that can show me most of the tags - eyeD3. It's still not perfect - there isn't any support for id3v2.2, it doesn't show me the tags that replaygain uses and it did crash while parsing some of the mp3s - I dare say I'll have to come back to those later with another tool, even if it is hexedit. Edit: As the author pointed out, eyeD3 is in fact able to read id3v2.2 tags, just not write them and those crashes will doubtless be solved in no time.

The first step was to find out what tags are actually present in my collection:

find music -iname "*.mp3" -exec eyeD3 -v {} \; | tee index
sort -u index | awk -F\): '/^<.*$/ {print $1}' | uniq | awk -F\)\> '{print $1}' | awk -F\( '{print $(NF)}' > tags

So, that gives me a list of all the different types of tags in my collection - 44 unique tags in my case. Next step is to work out which ones are used by Amarok and if I want to keep any of the others. While I could go through and speculate on which of the three tags I can immediately see that might be a year, it's probably a better idea to look at the source code.

apt-get source amarok libtag1c2a
view amarok-1.4.9.1/amarok/src/metabundle.cpp
view taglib-1.4/taglib/mpeg/id3v2/id3v2tag.cpp

Some immediately obvious tags because it names their identifier directly are TPOS (Disc number), TBPM (beats per minute), TCOM (Composer - admittedly this is one tag that I have not been validating), TPE2 (which is marked as a non-standard MS/Apple extension - so it is aware of it but since it's messing up my collection and Amarok doesn't seem to display it anywhere I'm getting rid of it anyway) and TCMP (Compilation album, ie, show under various artists. Unfortunately cmus doesn't appear to use this tag, though does seem to have some logic for compilation albums - this is a matter I will need to investigate further later on).
Digging deeper to look past the nice friendly names that the programmers can recognise to the harsh id3 reality I also identify that I'll need to keep title (TIT2), artist (TPE1), album (TALB), comment (COMM), genre (TCON), year (TDRC) and track (TRCK) - as well as anything that is used when playing the file that isn't identified here.

Though Amarok can use images embedded in the mp3s, I don't want any - I much prefer to use Amarok's cover manager combined with copycover-offline.py to copy them into the appropriate directory (look through the comments for useful patches - hmmm, should probably submit my fix for albums with Various Artists come to think of it).

So, I made a list of these tags, one per line in a file called amaroktags. Then found all the tags in my collection that aren't supported by Amarok:

cat amaroktags tags | sort | uniq -u
view taglib-1.4/taglib/mpeg/id3v2/id3v2.4.0-frames.txt


Which left me with a list of tags that I wanted to keep:
COMM, TALB, TBPM, TCMP, TCOM, TCON, TDRC, TIT2, TPE1, TPOS, TRCK, MCDI (Music CD Identifier), TFLT (File type), TLEN (length, used for seeking), TSRC (International Standard Recording Code - the only album using it in my collection is Nine Inch Nail's Ghosts I-IV)

And an even larger list of tags to zap:
TPE2, APIC (Attached picture), TDTG (Tagging time), GEOB (arbitrary file), PCNT (Play count), POPM (Popularimeter), PRIV (private textual & binary data), TCOP (copyright), TDEN (encoding timestamp), TENC (Encoded by), TIT1 (content group description), TIT3 (Description refinement), TLAN (language), TMED (Media type), TOAL (Original title), TOFN (original filename),
TPUB (publisher), TSSE (encoding settings), TXXX (User defined text), UFID (unique file identifier), USLT (lyrics), WCOM (commercial info), WOAR (artist web page), WXXX (other URL)

As well as these ones that I couldn't identify, so I'll zap em and hope nothing breaks:
NCON, TAGC (appears to be a timestamp)

And a couple to manually check later:
TOPE (Original artist - I notice that Kong in Concert uses these for the original track names, though not accurately - they should probably be in TOAL), TYER and TDRL (years with subtly different meanings - taglib does seem to fallback and use these, but I will need to check for conflicts)

So, now I have a pretty definitive list of tags it's time to zap em' (after backing up in case something blows up in my face of course). Although not immediately obvious it appears that using the --set-text-frame specifying the 4 letter name of the frame and no contents will remove it, even if it isn't a text frame. Now, this doesn't appear to actually conserve any space in the file - it shuffles the rest of the tags upwards and zeroes out the gap (presumably conserving the space would be possible, but I don't know an easy way off the top of my head - suggestions welcome). There may be some tags that you want to have more intelligent processing on - maybe only remove some of the images or maybe only remove some of the GEOBs and if that is the case read the eyeD3 documentation, but for me I'm sick of them all and want them gone:


find music -iname "*.mp3" -exec eyeD3 --set-text-frame=TAGC: --set-text-frame=TPE2: --set-text-frame=TDTG: --set-text-frame=TCOP: --set-text-frame=TDEN: --set-text-frame=TENC: --set-text-frame=TIT1: --set-text-frame=TIT3: --set-text-frame=TLAN: --set-text-frame=TMED: --set-text-frame=TOAL: --set-text-frame=TOFN: --set-text-frame=TPUB: --set-text-frame=TSSE: --set-text-frame=TXXX: --set-text-frame=UFID: --set-text-frame=USLT: --set-text-frame=WCOM: --set-text-frame=WOAR: --set-text-frame=WXXX: --set-text-frame=NCON: --set-text-frame=APIC: --set-text-frame=GEOB: --set-text-frame=PCNT: --set-text-frame=POPM: --set-text-frame=PRIV: --set-text-frame=TCMP: {} \; | tee log


Depending on how large your collection is, at this stage you may choose to blink, stretch your arms, get some coffee, go to bed or take a vacation. Personally, I wrote a blog post.

I still have some things I know I'll have to fix up - the Deus Ex Soundtracks all seem to have multiple redundant comments, and there are some non English comment fields, but you should by this stage have a decent understanding on how to do this - that is of course, if this whole article didn't just go over your head (congrats if it did and you still read this far though :)

update: It turns out that the TCMP frame is not actually set by Amarok, so my solution is to remove all the TCMP flags from the library (I've added it to the above list, though where they are 1 in my collection is correct, but very few of the other tracks in the same album are tagged in the same way and would explain some odd behaviour when importing the albums), then to manually add them for all relevant tracks, which hopefully will ease future migration. Unfortunately as best I can tell, cmus doesn't appear to have any concept of compilation albums in it's id3.c. OGG files will supposedly get them since their tags don't require almost one thousand lines of C code to process (by contrast, cmus' vorbis.c file has a mere 285 lines including 33 lines of tag parsing), which begs the question as to why only 1 of my OGG compilation albums are marked as such in cmus.

find music/V/Various\ Artists/ -iname "*.mp3" -exec eyeD3 --set-text-frame=TCMP:1 {} \;


update: I've written a simple shell script to do this automatically, just save this as striptags.sh and execute it from your music directory:

#!/bin/sh

oktags="COMM TALB TBPM TCMP TCOM TCON TDRC TIT2 TPE1 TPOS TRCK MCDI TFLT TLEN TDTG"

indexfile=`mktemp`

#Determine tags present:
find . -iname "*.mp3" -exec eyeD3 -v {} \; > $indexfile
tagspresent=`sort -u $indexfile | awk -F\): '/^<.*$/ {print $1}' | uniq | awk -F\)\> '{print $1}' | awk -F\( '{print $(NF)}' | awk 'BEGIN {ORS=" "} {print $0}'`

rm $indexfile

#Determine tags to strip:
tostrip=`echo -n $tagspresent $oktags $oktags | awk 'BEGIN {RS=" "; ORS="\n"} {print $0}' | sort | uniq -u | awk 'BEGIN {ORS=" "} {print $0}'`

#Confirm action:
echo
echo The following tags have been found in the mp3s:
echo $tagspresent
echo These tags are to be stripped:
echo $tostrip
echo The tags will also be converted to ID3 v2.4 where appropriate
echo
echo -n Press enter to confirm, or Ctrl+C to cancel...
read dummy

#Strip 'em
stripstring=`echo $tostrip | awk 'BEGIN {FS="\n"; RS=" "} {print "--set-text-frame=" $1 ": "}'`
find . -iname "*.mp3" -exec eyeD3 --to-v2.4 $stripstring {} \; | tee -a striptags.log