Difficulty: beginner
Estimated Time: 15-20 minutes

Welcome to Web Archive Discovery

In this tutorial, you will learn how to process WARC files so you can perform full-text search, faceted search, and basic analytics on their contents.

We will use the webarchive-discovery toolkit to process the WARCs and populate an Apache Solr full-text search index.

We will then explore the results using Apache Solr itself, and the Shine and Warclight user interfaces.

Pre-requisites

Basic familiarity with the UNIX command line is recommended.

We will use Docker containers as a way of running the services you need during this tutorial. Some basic familiarity with Docker will mean you get more out of this exercise, but if you are not familar with Docker you can just run the commands as suggested and focus on the indexing and exploration of the WARCs themselves.

IMPORTANT NOTE

This tutorial system provides a safe space for experimentation, but please note that is is temporary, and no data will be kept once the session times-out or once you leave the tutorial.

Created by The UK Web Archive in partnership with The Archives Unleashed Project.

Well done! You've completed this introduction to Web Archive Discovery!

If you want to know more about the tools and commands you've used, you can visit:

If you have any questions, don't hesitate to get in touch with us via:

Thanks!

The UK Web Archive Team

Don’t stop now! The next scenario will only take about 10 minutes to complete.

Introduction to Web Archive Discovery

Step 1 of 10

Step 1 - Start Solr

First, we need to start an Apache Solr service, so we can populate it with our WARC data.

Start Solr

We have prepared a packaged-up Solr server that contains a suitable configuration. Crucially, this includes a data schema that has been developed for working with web archive data as part of the Web Archive Discovery toolkit.

You can run it with the following command. Like all the commands in this tutorial, you can just click it and it will start running in the Terminal view to the right:

docker run --name solr -d -p 8983:8983 ukwa/webarchive-discovery-solr

It might take a little while to download, unpack, run and start-up fully.

Check the log file

Once it's started running, if you want to have a look at what's going on, you can type:

docker logs -f solr

This will follow the logs as they are written.

Once the logs have settled down, exit by pressing Ctrl-C and move onto the next step.

Terminal
Solr UI
Shine UI
Warclight UI