Happy Digital World Preservation Day from Freiburg!
Since the early days of the PLANETS project we have been working on advancing emulation as a preservation tool. Previous years were mostly focused on simplifying access to digital objects such as CDROMs. These technologies and workflows have matured and are now refined and put into production by the EaaSI initiative led by Yale University Library and generously supported by the Andrew W. Mellon Foundation and the Alfred P. Sloan Foundation.
As a research group we have to move on to new challenges and have started to look into more complex setups and objects. While CDROMs were the primary publication medium for digital content in the 1990s, the (web-based) Internet became the dominant medium in the early 2000s. With this article we start a loose series on our technical and conceptual work on emulating networked computers and the access options this opens to related digital content.
As a first public example use-case for the EaaS emulated network infrastructure we have acquired two WordPress instances from two DFG Collaborative Research Centers (German Sonderforschungsbereich (SFB)).
Example: “The Heroic as ‘Gift’ on the Victorian and Edwardian Book Market” (CRC 948) is database resulting of a completed sub-project from the first funding phase of the Collaborative Research Centre 948 “Heroes – Heroizations – Heroisms: Transformations and Conjunctures from Antiquity to the Modern Day”. The database was implemented by using WordPress, but due to a number of unfortunate decisions, aspects of sustainability were not considered. The data has been entered using an add-on that does not store the data in a relational database, such that the data can only be extracted and transferred to another system with considerable effort. However, after the end of the project the WordPress instance will meet its typical fate: as there will be no administrator left who can take care of updates, the continued operation of the database will, for security reasons, no longer be possible.
To address this dire scenario we took snapshots of the machines and uploaded it to a EaaS instance. Before adding it to a virtual network, we had to perform a few preparatory steps (some of these tasks will be automated in the future):
- Record (or change) the username/password for the admin user
- Web server — and in our case WordPress — should use hostname and port from the Host header (for normal operation, this is an anti-pattern for security reasons)
- Modify /var/www/html/wp-config.php to include define(“WP_SITEURL”, “//” . $_SERVER[“HTTP_HOST”]); define(“WP_HOME”, “//” . $_SERVER[“HTTP_HOST”]);
- The web server should not use/enforce HTTPS (certificates will expire and need maintenance). Secure connection will be provided through the access infrastructure. If the web server previously used TLS, TLS certificates have to be removed. If this is not possible, use tools like https://crt.sh/?spkisha256= to search if the same key is used in any other certificates (CAs MUST, within 24 hours, revoke all and not issue new certificates with compromised keys, so all of these certificates will, at least eventually, get revoked)
- Optionally, delete/sanitize potential log files/private data, e.g., from /var/log/apache2 as it might contain passwords for external databases or passwords/HMAC keys used to authenticate users/cookies of the installation. If the original instance continues to be accessible, it might be jeopardized. This step is recommended for machines with public access.
Next, a simple emulated network has to be created. In this case we define an IP network (10.0.0.0/24), enabling internal DNS / DHCP service and connecting the previously added machine to this network. The network definition metadata is used by a EaaS network orchestration component to create instances (a network can be started multiple times) and to manage its life cycle (see below).
When we start an emulated network instance we spawn a virtual Ethernet network consisting of a software-based Ethernet switch with an arbitrary number of connected virtual machines and network service components (e.g. DHCP server). To exchange network traffic between nodes in this virtual network, Ethernet frames are encapsulated and sent through WebSocket over TLS connections and routed through the public Internet. This allows not only a strict separation of virtual network traffic from traffic on the host environment but also eliminates any danger of attacks from archived environments to the host system or abuse of the host system’s network resources to attack third-party entities on the Internet (or the host system’s private network environment). In the same way this design shields the archived (and unmaintained) environments from the attacks from the public Internet.
In this blog post we focus on simple access by users through contemporary Web browsers. Leaving an unmaintained, outdated machine connected to the internet poses a latent and increasing security risk. To facilitate access we have to forward user requests to the machine and deliver content back to the user, without exposing the machine (or the whole emulated network) to the live internet. This functionality is implemented by a software component called eaas-proxy (https://gitlab.com/emulation-as-a-service/eaas-proxy/) which acts as a gateway in the virtual network and forwards any (TCP/UDP) connections to a physical network on the host system. Traffic between the environments and the gateway is still encapsulated in the described way in a WebSocket connection over TLS. This makes it possible to place the gateway on a special machine separated from the EaaS backend.
To prepare a gateway a domain name and SSL certificate are required. The original domain name can be re-used. On the gateway machine only a docker installation (incl. docker-compose) is required. For setup only eaas-proxy configuration is neccessary providing the necessary information to connect to the EaaS cloud, e.g. the network ID of the chosen network.
If an HTTP request is received by the gateway, the eaas-proxy connects to the EaaS cloud and requests a connection to a given network. For each network different life-cycle options are available:
- The network is started by an admin and keeps running. In this case the user is connected to a running / shared network instance.
- On-demand network (shared). If the network is not running, the network is started through a user access request. Any other subsequent user is connected to the same network. The network is shut down after pre-defined idle time.
- On-demand network (private). A new network instance will be spawned for each user, such that every user has a pristine instance.
In any case the proxy is transparent to URLs (and parameters), such that deep-links remain working (given that the gateway is hosted under the web server’s original domain).
This work was accomplished through a team of excellent developers and researchers
- Rafael Gieschke (University of Freiburg)
- Oleg Stobbe (OpenSLX)
- Thomas Liebtraut (@tommie_lie)
- Oleg Zharkov
and generous support by the eScience initiative of the state of Baden-Württemberg, the Andrew W. Mellon Foundation and the Alfred P. Sloan Foundation.