One of the most often requested features for historious is del.icio.us link importing, so I researched how it would be possible to add that feature. It turns out that the delicious bookmark export is an HTML page with links, so it should be possible to upload this file, extract all the links and add them to historious.

While trying to do this, I found a URL-validating regular expression on StackOverflow which has the desirable property of not only validating a URL, but also being so thorough that it can search for them in a page as well. I have changed it quite a bit, as it would not match some URLs, and now it works fine (IPv6 is not supported, however).

Some quick coding later, and the feature is almost ready for pushing to production. In the interests of contributing to the community, and to make it easier for people to validate their URLs, here’s the regex:

(https?:\/\/                                                   # protocol
(?:(?:[a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+            # username
(?::(?:[a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?         # password
@)?                                                            # auth requires @
(?:(?:(?:[a-z0-9][a-z0-9-]*[a-z0-9]\.)+                        # domain segments AND
[a-z]{2}(?:[a-z0-9-]*[a-z0-9])?\.?                             # top level domain OR
|(?:(?:2[0-4][0-9]|25[0-5]|1\d{2}|[1-9]\d|\d)\.){3}
(?:2[0-4][0-9]|25[0-5]|1\d{2}|[1-9]\d|\d))                     # IP address
(?::\d+)?)                                                     # port
(?:(?:(?:\/+(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)* # path
(?:\?(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)         # query string
?)?)?                                                          # path/query string optional
(?:\#(?:[a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?)       # fragment

If you use it, make sure to set the “case insensitive” and “verbose” flags. As you can see, it matches authentication, IP address/domain, port, path, query strings and fragments. It has managed to find most URLs I could throw at it, but if you have any valid URLs it won’t match, please send them to me and I will update it.

Also, to the people who say that it’s too unreadable: Learn to read regexes better!