Installation

This section describes how to setup your own DCOR production instance.

Ubuntu and CKAN

Please use an Ubuntu 20.04 installation for any development or production usage. This makes it easier to give support and track down issues.

Before proceeding with the installation of CKAN, install the following packages:

apt update
# CKAN requirements
apt install -y libpq5 redis-server nginx supervisor
# needed for building packages that DCOR depends on (dclab)
apt install -y gcc python3-dev
# additional tools that you might find useful, but are not actually required
apt install -y aptitude net-tools mlocate screen needrestart python-is-python3

Install CKAN:

wget https://packaging.ckan.org/python-ckan_2.9-py3-focal_amd64.deb
dpkg -i python-ckan_2.9-py3-focal_amd64.deb

Note

Do NOT setup file uploads when following the instructions at https://docs.ckan.org. DCOR has its own dedicated directories for data uploads. The command dcor inspect will try to setup/fix that for you.

Follow the remainder of the installation guide at https://docs.ckan.org/en/2.9/maintaining/installing/install-from-package.html#install-and-configure-postgresql. Make sure to note down the PostgreSQL password which you will need in the initialization step.

Make sure to initiate the CKAN database with

source /usr/lib/ckan/default/bin/activate
export CKAN_INI=/etc/ckan/default/ckan.ini
ckan db init

DCOR by default stores all data on /data. This makes it easier to control backups and separate the CKAN/DCOR software from the actual data. If you have not mounted a block device or a network share on /data, please create this directory with

mkdir /data

Scratch Space

It is important that you have some scratch space of at least 100 GB available on you system, so that the ckanext-dc_serve extension can create temporary condensed datasets before uploading them to S3. By default, the cache is located at /data/tmp/ckanext-dc_serve and is editable via the configuration option ckanext.dc_serve.tmp_dir.

Object Storage

You should use a cloud storage provider that you trust instead of setting this up yourself. If you know what you are doing (e.g. for testing) and would like to setup S3-compatible object storage yourself, you can use MinIO. On a Ubuntu/Debian machine, install the latest MinIO server like so:

wget https://dl.min.io/server/minio/release/linux-amd64/minio_RELEASEDATE.0.0_amd64.deb
dpkg -i minio_RELEASEDATE.0.0_amd64.deb

This also installed the minio systemd service which we want to use. First, make sure that the user defined in the service:

systemctl show minio | grep User=

actually exists. You can add a system user via:

useradd -r minio-user

Then, create a file /etc/defaul/minio with the following content:

# Volume to be used for MinIO server (make sure minio-user has access).
MINIO_VOLUMES="/srv/minio"
# Use if you want to run MinIO on a custom port (console is the web interface).
MINIO_OPTS="--address :9000 --console-address :9001"
# Root user for the server.
MINIO_ROOT_USER=minio-root-user-account-name
# Root secret for the server.
MINIO_ROOT_PASSWORD=secret-password-for-minio-root-user
# set this for MinIO to reload entries with 'mc admin service restart'
MINIO_CONFIG_ENV_FILE=/etc/default/minio

Now you can enable and start the minio service:

systemctl enable minio
systemctl start minio

Create a “dcor” user (http://minio.server.name:9001/identity/users/add-user) with readwrite permissions and create an access key (via “Service Accounts”) which you can then copy-paste to the ckan.ini configuration:

dcor_object_store.access_key_id = access-key-id
dcor_object_store.secret_access_key = secret-access-key

DCOR Extensions

Installation

Whenever you need to run the ckan/dcor commands or have to update Python packages, you have to first activate the CKAN virtual environment.

source /usr/lib/ckan/default/bin/activate

With the active environment, first install some basic requirements.

pip install --upgrade pip
pip install wheel

Then, install DCOR, which will install all extensions including their requirements.

pip install dcor_control

Background workers

DCOR comes with three job queues dcor-short, dcor-normal, and dcor-long for data processing after a resource is added to a dataset. The CKAN instance populates those queues and CKAN workers (e.g. via ckan jobs worker dcor-short) fetching and running the jobs in the background. The workers are run, like ckan itself, via supervisor and are defined via individual configuration files in /etc/supervisor/conf.d. When you run dcor inspect (see next section), these files will be created with your approval.

Initialization

The dcor_control package installed the entry point dcor which allows you to manage your DCOR installation. Just type dcor --help to find out what you can do with it.

For the initial setup, you have to run the inspect command. You can run this command on a routinely basis to make sure that your DCOR installation is setup correctly.

source /usr/lib/ckan/default/bin/activate
dcor inspect

Testing

If you are setting up a development instance, then you might want to be able to run the DCOR tests. This step is not required if you are setting up an instance for production.

For testing purposes, you can use the DCOR vagrant box. It contains a full install of DCOR (including SOLR and object storage) and is updated regularly.

SSL

You have two options. If you server is reachable through the internet, you should use Let’s encrypt (or a certificate from your organization) to set up SSL. If you are hosting your server on the intranet (clinics scenario), then you should create your own certificate and distribute it to your users

Creating an SSL certificate (Intranet only)

Start by creating your certificate (valid for 10 years):

openssl req -newkey rsa:4096 -x509 -sha256 -days 3650 -nodes -out fqdn.cert -keyout fqdn.key

where fqdn is your fully qualified domain name (FQDN) which maps to the server’s IP address. Make sure to enter it in the dialog (otherwise use the IP address). This makes connection tests easier (e.g. if you only have SSH access to the machine and need to use SSH tunneling to connect to the CKAN instance by mapping its FQDN in the /etc/hosts file to 127.0.0.1 on the testing client).

You may want to create an encrypted access token for your users.

Now proceed with the SSL configuration below, replacing “dcor.mpl.mpg.de” with your FQDN.

Configuring nginx (SSL and uWSGI proxy)

Encrypting data transfer should be a priority for you. If your server is available online, you can use e.g. Let’s Encrypt to obtain an SSL certificate. If you are hosting CKAN/DCOR internally in your organization, you will have to create a self-signed certificate and distribute the public key to the client machines manually.

First copy the certificate to /etc/ssl/private:

cp dcor.mpl.mpg.de.cert /etc/ssl/certs/
cp dcor.mpl.mpg.de.key /etc/ssl/private/

Note

If dclab, Shape-Out, or DCOR-Aid cannot connect to your CKAN instance, it might be because the certificate in /etc/ssl/certs/ does not contain the full certificate chain. In this case, just download the entire certificate chain using Firefox (right-lick on the shield symbol an look at the certificate - there should be a download option for the chained certificate somewhere) and replace the content of the .cert file with that.

Then, edit /etc/nginx/sites-enabled/ckan and replace its content with the following (change dcor.mpl.mpg.de to whatever domain you use):

Now, we need to modify the CKAN uWSGI file at /etc/ckan/default/ckan-uwsgi.ini:

[uwsgi]

; Since we are behind a webserver (proxy), we use the socket variant.
; We use HTTP1.1 (keep-alives)
http11-socket          =  127.0.0.1:8080
uid                    =  www-data
gid                    =  www-data
wsgi-file              =  /etc/ckan/default/wsgi.py
virtualenv             =  /usr/lib/ckan/default
module                 =  wsgi:application
master                 =  true
pidfile                =  /tmp/%n.pid
; 10 hours for very long-lasting uploads
harakiri               =  36000
harakiri-verbose       =  true
; Restart workers after this many requests
max-requests           =  1000
; How long to wait before forcefully killing workers
worker-reload-mercy    =  30
; Delete sockets during shutdown
vacuum                 =  true
callable               =  application
; Disable threads only if not using threads and performance is critical
enable-threads         =  false
; Do not use multiple interpreters (since we only have one service)
single-interpreter     =  true
; Shutdown when receiving SIGTERM
die-on-term            =  true
; Fail to start if application cannot load
need-app               =  true
; Make sure all options in this file exist.
strict                 =  true

; Unfortunately, buffering the upload with nginx and then sending the upload
; to uWSGI does not work for some reason (uWSGI gets stuck when crunching the
; data). The intuitive choice would be to set this here to "1", but a look
; at the sources reveals that this should be set to the buffer size (2MB).
post-buffering         =  2097152
post-buffering-bufsize =  2097152

; Reduce or increase this number to limit POST requests. By default,
; the size of POST requests is unlimited.
limit-post             =  100000000000

; Set the number of workers to something > 1, otherwise
; only one client can connect via nginx to uWSGI at a time.
; See https://github.com/ckan/ckan/issues/5933
; In addition, use two threads per worker.
workers                =  4
; Use lazy apps to avoid the `__Global` error.
; See https://github.com/ckan/ckan/issues/5933#issuecomment-809114593
lazy-apps              =  true
; If we don't want to cache the files that users want to download
; (i.e. set `proxy_max_temp_file_size 0;` in nginx), then we have to
; set socket-timeout to a very large number (e.g. 7200).
; We may also want to increase this number if the storage location for
; resources has a low write speed (e.g. NFS). From the uWSGI sources,
; it looks like the default value is 4s.
socket-timeout         =  500
; (Note that we are serving CKAN via http11-socket behind nginx).
; Otherwise, downloads will fail with `uwsgi_response_sendfile_do() TIMEOUT !!!`,
; because the client cannot download the file from nginx as fast as
; uWSGI can send the file to nginx. But in this case, we can really only
; have as many connections as we have workers.
; On the other hand, if we, set `proxy_max_temp_file_size 100000m;`
; in nginx, then all downloads will be cached by nginx. And nginx will
; handle all users. The purpose of setting `workers` to `4` in uWSGI
; is now only so that CKAN does not block for as long as it takes the
; system to copy the download from uwsgi to nginx's `proxy_temp_path`.
; In other words, CKAN will only be unresponsive if 4 downloads are
; started at the same time for as long as it takes the smallest download
; to be copied over the http socket from uWSGI to nginx.

; Custom logging
; disable logging in general (files easily get above 50MB)
disable-logging        =  true
; enable logging for a few specific cases
log-4xx                =  true
log-5xx                =  true
log-ioerror            =  true
; set the log format to match that of CKAN
log-date               =  %%Y-%%m-%%d %%H:%%M:%%S
logformat-strftime     =  true
logformat              =  %(ftime) uWSGI %(addr) (%(proto) %(status)) %(method) %(uri) => %(size) bytes in %(msecs) msecs to %(uagent)
threaded-logger        =  true

; https://stumbles.id.au/how-to-fix-uwsgi-oserror-write-error.html
disable-write-exception = true
ignore-write-errors     = true
ignore-sigpipe          = true

Unattended upgrades

Unattended upgrades offer a simple way of keeping the server up-to-date and patched against security vulnerabilities.

apt-get install unattended-upgrades apt-listchanges

Edit the file /etc/apt/apt.conf.d/50unattended-upgrades to your liking. The default settings should already work, but you might want to setup email notifications and automated reboots.

Note

If you have access to an internal email server and wish to get email notifications from your system, install

apt install bsd-mailx ssmtp

and edit /etc/ssmtp/ssmtp.conf:

Note that this is something different than CKAN email notifications.

In order for unattended upgrades to work properly: whenever updates are installed, make sure that needrestart automatically restarts the services by editing the file /etc/needrestart/needrestart.conf and setting:

$nrconf{restart} = 'a';

Supervisor

Sometimes the ckan-uwsgi start job might take a little longer and the default (1s) is not long enough so supervisor becomes impatient. Edit the file /etc/supervisor/conf.d/ckan-uwsgi.conf and add startsecs=60.

Systemd

It is important that all services required for CKAN to run should be started before starting supervisor. This can be achieved by running systemctl edit supervisor and pasting the following config:

[Unit]
Requires=solr.service
After=solr.service
Requires=redis.service
After=redis.service
Requires=postgresql.service
After=postgresql.service

[Service]
Restart=always
RestartSec=20

If solr is slow when starting up, add this to its unit file systemctl edit solr:

[Service]
ExecStartPost=/bin/sleep 250
Restart=on-failure
RestartSec=10s

Afterwards run:

systemctl daemon-reload