archive.pl – a Perl script for archiving URL sets in the Internet Archive

()

Introduction

archive.pl lets you collect URLs in a text file and stores them in the Internet Archive. It fetches the documents itself and scraps some metadata in order to generate a link list in HTML that is suitable for posting it to a blog or as Atom feed. Windows users, who lack Perl on their machine, can obtain it as exe-file.

Table of contents

Requirements

Perl 5.24 (earlier versions not tested but it is likely to work with every build that is capabale of getting the required modules installed).

Table of contents

Usage

Collect URLs you want to archive in file urls.txt separated by line breaks and UTF-8-encoded and call perl archive.pl without arguments. The script does to things: it fetches the URLs and extracts some metadata (works with HTML and PDF). It submits them to Internet Archive by opening them in a browser. This is necessary because Internet Archive blocks robots globally. Then it generates a HTML file with a link list that you may post to your blog. Alternatively you can get the link list as Atom feed. Regardless of the format you can upload the file on a server via FTP. Optional parameters available:

-a output as Atom feed instead of HTML -c <creator> name of feed creator (feed only) -d <path> FTP path -f <filename> name of input file if other than `urls.txt` -n <username> FTP user -o <host> FTP host -p <password> FTP password -s Save feed in Wayback machine (feed only) -u <URL> feed URL (feed only)

Table of contents

Changelog

v1.3

  • Debugging messages removed.
  • Archive.Org URL changed.

Table of contents

v1.2

  • Internationalized domain names (IDN) allowed in URLs.
  • Blank spaces allowed in URLs.
  • URL list must be in UTF-8 now!
  • Only line breaks allowed as list separator in URL list.

Table of contents

v1.1

  • Added workaround for Windows ampersand bug in Browser::Open (ticket on CPAN).

Table of contents

License

Copyright © Ingram Braun
GPL 3 or higher.

Table of contents

Download

or clone it from GitHub:

$ git clone https://github.com/CarlOrff/archive.git

Table of contents

words, ≈ characters