Hyperlinks in your files? How to get them out using tikalinkextract

Related to my work exploring hyperlinks in documentary heritage – something I feel we’ll be taking care of for a long time – I created a hyperlink extract tool called tikalinkextract.

Put simply – the tool will take your collection of files, extract the intellectual content using Apache Tika, and then analyse that content for anything ‘looking like’ a hyperlink. (And recently added, thanks to Andrew Berger on Twitter, ‘mailto:’ links.)

Tika’s list of supported formats grows with each new release, and so building a tool around Tika’s capabilities to extract content from files makes perfect sense. In theory, if the mechanics of tikalinkextract can be perfected over time, then the more formats Tika can access, the more formats we have access to where we can think about preserving links to potentially evidential external records.

Architecture

Apache Tika Server: tikalinkextract connects to Apache through TCP/IP to minimise the amount of embedding that needs to happen inside the code. This is a pattern I have found very attractive recently and will continue to work on.
httpreserve linkscanner: A golang package I created to tokenize strings and look for hyperlinks. This is the engine of the tool – if we continue to improve the hyperlink spotting capabilities of this, it can be deployed in multiple other applications.
httpreserve tikalinkextract: The front-end for the two other main components that sends requests to the server and then aggregates the results from linkscanner. tikalinkextract takes care of walking directories for you.

Output

The default mode of the tool is to output a CSV (comma-separated-values) list that informs users the name of the record where a hyperlink has been found, and the hyperlink itself:

"DPP_Glossary.pdf", "http://www.cdlib.org/services/uc3/curation"
"DPP_Glossary.pdf", "http://www.cdlib.org/gateways/technology/glossary.html"
"DPP_Glossary.pdf", "http://www.dcc.ac.uk/digital-curation/glossary"
"DPP_Glossary.pdf", "http://www.dpconline.org/advice/preservationhandbook/introduction/definitions-and-concepts"
"DPP_Glossary.pdf", "http://www.diglib.org/about/dldefinition.htm"
"DPP_Glossary.pdf", "http://www.icpsr.umich.edu/icpsrweb/icpsr/curation/preservation/glossary.jsp"
"DPP_Glossary.pdf", "http://www.copyright.gov/circs/circ1.pdf"
"DPP_Glossary.pdf", "http://www.oclc.org/research/activities/past/orprojects/pmwg/premis-"
"DPP_Glossary.pdf", "http://www.oclc.org/research/activities/past/rlg/trustedrep/repositories.pdf"
"DPP_Glossary.pdf", "http://en.wikipedia.org/wiki/open_source"
"DPP_Glossary.pdf", "http://www.wipo.int/about-ip/en"

Links are unique to the record, but not to the collection. As such, to better support a potential web-archiving workflow in government, a seed-mode was created. This outputs a unique set of URLs per collection of files.

How to make this work for you?

Wrap your files in a top-level directory and run tikalinkextract (instructions below are for Linux):

./start-tools.sh

This will start the Apache Tika-1.16 server.

./tikalinkextract -seeds -file your-files/ > linkslist.txt

This will output a unique list of URLs to linkslist.txt

How to archive this?

Enter wget (manual).

GNU Wget is a free utility for non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.

It has supported wrapping downloaded files in WARC (web-archiving) format, since version 1.14 and it is now on version 1.18.

To make this work for us, we can take linkslist.txt and run the following command.

wget --page-requisites \
   --span-hosts \
   --convert-links \
   --execute robots=off \
   --adjust-extension \
   --no-directories \
   --directory-prefix=output \
   --warc-cdx \
   --warc-file=accession.warc \
   --wait=0.1 \
   --user-agent=httpreserve-wget/0.0.1 \
   -i linkslist.txt

There are a lot of options there. I won’t go into them all, (RTM!). For me, the important options to highlight are,

no-directories
directory-prefix=output

Wherever you run this command it will still download all the files associated with the web-archive, despite also wrapping them inside a WARC file. To make sure that they can be easily cleaned-up afterwards, and they don’t pollute whatever directory they are in, I have opted for them to be stored in a directory called ‘output’.

-i

This argument will take a list of links which all get wrapped into the corresponding WARC.

NB. Here is a further explanation of the other options via Explainshell.com.

When the command completes, a WARC will exist that you can inspect with tools like DROID and Siegfried to see that contain all the files associated with representations of the websites listed in your seed list.

What to do with the WARC?

Well, it should be no secret that I’m still exploring web archiving. I need to look into tools to look deeper into their structure and what is stored.

Of course, the two main httpreserve tools, httpreserve.info, and httpreserve-workbench, have been created to help explore what links are still active, or inactive: https://github.com/httpreserve/wadl-2017

My hope is that a government archive can store a record of the hyperlinks associated with a record, somewhere adjacent to the collection, and inform users about how to access this and what it means to them. This work, however, runs a little deeper than that.

Why is this important?

We cover some of this in the original blog. I also write about this in my recent article for Archives and Manuscripts: Binary Trees? Automatically Identifying the Links Between Born-digital Records.

Where I find myself today, and what I’m trying to do is:

Demonstrate the value of a government permalink service by building an evidence base that shows the number of hyperlinks being used in records and the number of those links being lost. Collecting these links at point-of-transfer may ~~(will)~~ already be too late.
Preserve links to external records: Any hyperlink that has somehow informed a decision made in a public office.
Preserve links to content-management-system records: As more content management systems (CMS) become web-enabled, or web-based, the technology used in web-archiving becomes more relevant to records and information managers. If a CMS link is used in a record, how do we maintain that connection? What happens when the link becomes a 404?

Comments on this blog on any-or-all aspects would be appreciated. Knowing what else is out there is good. Knowing how this work can be improved, also.

Further Reading

Developing this blog, I came across a number of useful bits and pieces.

Archive Team and Wget and WARC: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
Other developers out there trying to get web-archivng working for them: https://www.petekeen.net/archiving-websites-with-wget
wget manual: https://www.gnu.org/software/wget/manual/wget.html
This is a great blog by Raffaele Messuti that outlines a ‘shell-based’ approach on Linux to doing this work as well. https://literarymachin.es/epub-linkrot/

Get tikalinkextract

Code and releases, here: https://github.com/httpreserve/tikalinkextract/releases/tag/0.0.2

Hyperlinks in your files? How to get them out using tikalinkextract

Leave a Reply

You might also like…

Software benchmarks in digital preservation: Do we need them? Can we have them? How do we get them?

Why PDF/A validation matters – Part 2

Resurrecting the first Dutch web index: NL-menu revisited

Join the conversation

Member-only content

or

or

or

or

Download

or