– a Perl script for archiving URL sets in the Internet Archive

Introduction lets you collect URLs in a text file and stores them in the Internet Archive. It fetches the documents itself and scraps some metadata in order to generate a link list in HTML that is suitable for posting it to a blog or as Atom feed. Windows users, who lack Perl on their machine, can obtain it as exe-file.

Table of contents


Perl 5.24 (earlier versions not tested but it is likely to work with every build that is capabale of getting the required modules installed). If there are issues with installing the XMLRPC::lite module, do it with CPAN’s notest pragma.

Table of contents


Collect URLs you want to archive in file urls.txt separated by line breaks and UTF-8-encoded and call perl without arguments. The script does to things: it fetches the URLs and extracts some metadata (works with HTML and PDF). It submits them to Internet Archive or use autmoated submission to WordPress. Then it generates a HTML file with a link list that you may post to your blog. Alternatively you can get the link list as Atom feed. Additionally you can post the links on Twitter. Regardless of the format you can upload the file on a server via FTP. If the archived URL points to an image, a thumbnail is viewed in the output file.

Internet Archive has a submission limit of 15 URLs per minute per IP. Set an appropriate delay (at least five seconds) to meet it. If you want to be faster, set a proxy server which rotates IPs. This is fi. possible with TOR as service (the TOR browser on Windows does not work here!). Set MaxCircuitDirtiness 10 in the configuration file (/path/to/torrc) to rotate IPs every ten seconds.

Optional parameters available:

-a                              output as Atom feed instead of HTML
-c <creator>                    name of feed creator (feed only)
-d <path>                       FTP path
-D                              Debug mode - don't save to Internet Archive
-f <filename>                   name of input file if other than `urls.txt`
-h                              Show commands
-i <title>                      Feed or HTML title
-k <consumer key>               Twitter consumer key
-n <username>                   FTP or WordPress user
-o <host>                       FTP host
-p <password>                   FTP or WordPress password
-P <proxy>                      A proxy, fi. socks4://localhost:9050 for TOR service
-r                              Obey robots.txt
-s                              Save feed in Wayback machine (feed only)
-t <access token>               Twitter access token
-T <seconds>                    delay per URL in seconds to respect IA's request limit
-u <URL>                        Feed or WordPress (xmlrpc.php) URL
-v                              version info
-w                              *deprecated*
-x <secret consumer key>        Twitter secret consumer key
-y <secret access token>        Twitter secret access token
-z <time zone>                  Time zone (WordPress only)

Table of contents



  • Introduced option -P to connect to a proxy server that can rotate IPs (fi. TOR).
  • User agent bug in LWP::UserAgent constructor call fixed.
  • -T can operate with floats.
  • Screen logging enhanced (total execution time and total number of links).
  • IA JSON parsing more reliable.


  • Script can save all linked URLs, too (IA has restricted this service to logged-in users running JavaScript).
  • Debug mode (does not save to IA).
  • WordPress bug fixed (8-Bit ASCII in text lead to database error).
  • Ampersand bug in ‘URL available’ request fixed.
  • Trim metadata.
  • Disregard robots.txt by default.


  • Post HTML outfile to WordPress
  • Wayback machine saves all documents linked in the URL if it is HTML (Windows only).
  • Time delay between processing of URLs because Internet Archive set up a request limit.
  • Version and help switches.


  • Tweet URLs.
  • Enhanced handling of PDF metadata.
  • Always save biggest Twitter image.

Table of contents


Not published.

Table of contents


  • Supports wget and PowerShell (w flag).
  • Displays the closest Wayback copy date.
  • Better URL parsing.
  • Windows executable only 64 bit since not all modules install properly on 32.

Table of contents


  • Enhanced metadata scraping.
  • Archive images from Twitter in different sizes.
  • Added project page link to outfile.
  • Remove UTF-8 BOM from infile.
  • User agent avoids strings archiv and wayback.
  • Internet Archive via TLS URL.
  • Thumbnail if URL points to an image.

Table of contents


  • Debugging messages removed.
  • Archive.Org URL changed.

Table of contents


  • Internationalized domain names (IDN) allowed in URLs.
  • Blank spaces allowed in URLs.
  • URL list must be in UTF-8 now!
  • Only line breaks allowed as list separator in URL list.

Table of contents


  • Added workaround for Windows ampersand bug in Browser::Open (ticket on CPAN).

Table of contents


Copyright © 2015–2020 Ingram Braun
GPL 3 or higher.

Table of contents


or clone it from GitHub:

$ git clone

Table of contents – a Perl script for archiving URL sets in the Internet Archive 1

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment