– a Perl script for archiving URL sets in the Internet Archive


Introduction lets you collect URLs in a text file and stores them in the Internet Archive. It fetches the documents itself and scraps some metadata in order to generate a link list in HTML that is suitable for posting it to a blog or as Atom feed. Windows users, who lack Perl on their machine, can obtain it as exe-file.

Table of contents


Perl 5.24 (earlier versions not tested but it is likely to work with every build that is capabale of getting the required modules installed).

Table of contents


Collect URLs you want to archive in file urls.txt separated by line breaks and UTF-8-encoded and call perl without arguments. The script does to things: it fetches the URLs and extracts some metadata (works with HTML and PDF). It submits them to Internet Archive by opening them in a browser. This is necessary because Internet Archive blocks robots globally. Then it generates a HTML file with a link list that you may post to your blog. Alternatively you can get the link list as Atom feed. Regardless of the format you can upload the file on a server via FTP. Optional parameters available:

-a output as Atom feed instead of HTML -c <creator> name of feed creator (feed only) -d <path> FTP path -f <filename> name of input file if other than `urls.txt` -n <username> FTP user -o <host> FTP host -p <password> FTP password -s Save feed in Wayback machine (feed only) -u <URL> feed URL (feed only)

Table of contents



  • Enhanced metadata scraping.
  • Archive images from Twitter in different sizes.
  • Added project page link to outfile.
  • Remove UTF-8 BOM from infile.
  • User agent avoids strings archiv and wayback.
  • Internet Archive via TLS URL.
  • Thumbnail if URL points to an an image.

Table of contents


  • Debugging messages removed.
  • Archive.Org URL changed.

Table of contents


  • Internationalized domain names (IDN) allowed in URLs.
  • Blank spaces allowed in URLs.
  • URL list must be in UTF-8 now!
  • Only line breaks allowed as list separator in URL list.

Table of contents


  • Added workaround for Windows ampersand bug in Browser::Open (ticket on CPAN).

Table of contents


Copyright © Ingram Braun
GPL 3 or higher.

Table of contents


or clone it from GitHub:

$ git clone

Table of contents

words, ≈ characters