Manual Installation
These instructions walk you through a manual installation of DCOR from scratch. As this process is rather tedious and time-consuming, it should only be used as a reference guide for the ansible-based installation.
Ubuntu and CKAN
Please use an Ubuntu 24.04 installation for any development or production usage. This makes it easier to give support and track down issues.
Before proceeding with the installation of CKAN, install the following packages:
apt update
# CKAN requirements
apt install -y libpq5 redis-server nginx supervisor
# needed for building packages that DCOR depends on (dclab)
apt install -y gcc python3-dev
# additional tools that you might find useful, but are not actually required
apt install -y aptitude net-tools mlocate screen needrestart python-is-python3
Install CKAN:
wget https://packaging.ckan.org/python-ckan_{LATEST-VERSION}-py3-{UBUNTU_RELEASE}_amd64.deb
dpkg -i python-ckan_..._amd64.deb
Note
Do NOT setup file uploads when following the instructions
at https://docs.ckan.org. DCOR has its own dedicated directories
for data uploads. The command dcor inspect will try to
setup/fix that for you.
Follow the remainder of the installation guide at https://docs.ckan.org/en/latest/maintaining/installing/install-from-package.html#install-and-configure-postgresql. Make sure to note down the PostgreSQL password which you will need in the initialization step.
Make sure to initiate the CKAN database with
source /usr/lib/ckan/default/bin/activate
export CKAN_INI=/etc/ckan/default/ckan.ini
ckan db init
DCOR by default stores all data on /data. This makes it easier to
control backups and separate the CKAN/DCOR software from the actual data.
If you have not mounted a block device or a network share on /data,
please create this directory with
mkdir /data
Scratch Space
It is important that you have some scratch space of at least 100 GB available
on you system, so that DCOR extensions can create temporary files.
By default, the cache is located at /cache. The relevant configuration
options are ckanext.dc_serve.tmp_dir and ckanext.dcor_depot.tmp_dir.
Object Storage
You should use a cloud storage provider that you trust instead of setting this up yourself. If you know what you are doing (e.g. for testing) and would like to setup S3-compatible object storage yourself, you can use MinIO. On a Ubuntu/Debian machine, install the latest MinIO server like so:
wget https://dl.min.io/server/minio/release/linux-amd64/minio.deb
dpkg -i minio.deb
This also installed the minio systemd service which we want to use.
First, make sure that the user defined in the service:
systemctl show minio | grep User=
actually exists. You can add a system user via:
useradd -r minio-user
Then, create a file /etc/default/minio with the following content:
# Volume to be used for MinIO server (make sure minio-user has access).
MINIO_VOLUMES="/srv/minio"
# Use if you want to run MinIO on a custom port (console is the web interface).
MINIO_OPTS="--address :9000 --console-address :9001"
# Root user for the server.
MINIO_ROOT_USER=minioadmin
# Root secret for the server.
MINIO_ROOT_PASSWORD=secret-password-for-minio-root-user
# set this for MinIO to reload entries with 'mc admin service restart'
MINIO_CONFIG_ENV_FILE=/etc/default/minio
Now you can enable and start the minio service:
systemctl enable minio
systemctl start minio
Create a “dcor” user (http://minio.server.name:9001/identity/users/add-user)
with readwrite permissions and create an access key (via “Service Accounts”)
which you can then copy-paste to the ckan.ini configuration:
dcor_object_store.access_key_id = access-key-id
dcor_object_store.secret_access_key = secret-access-key
DCOR Extensions
Installation
Whenever you need to run the ckan/dcor commands or have to update
Python packages, you have to first activate the CKAN virtual environment.
source /usr/lib/ckan/default/bin/activate
With the active environment, first install some basic requirements.
pip install --upgrade pip
pip install wheel
Then, install DCOR, which will install all extensions including their requirements.
pip install dcor_control
Background workers
DCOR comes with three job queues dcor-short, dcor-normal, and dcor-long
for data processing after a resource is added to a dataset. The CKAN instance
populates those queues and CKAN workers (list jobs e.g. via ckan jobs worker dcor-short)
fetching and running the jobs in the background. The workers are ran, like
ckan itself, via Supervisord and are defined via individual configuration
files in /etc/supervisor/conf.d. When you run dcor inspect (see next
section), these files will be created with your approval.
Initialization
The dcor_control package installed the entry point dcor which
allows you to manage your DCOR installation. Just type dcor --help
to find out what you can do with it.
For the initial setup, you have to run the inspect command. You
can run this command on a routinely basis to make sure that your DCOR
installation is setup correctly.
source /usr/lib/ckan/default/bin/activate
dcor inspect
Testing
If you are setting up a development instance, then you might want to be able
to run the DCOR tests. This step is not required if you are setting up an
instance for production. The following command will install the latest DCOR
CKAN extensions in editable mode. After installing the packages
from the tests/requirements.txt files, you can use pytest to test
the extensions.
source /usr/lib/ckan/default/bin/activate
dcor develop
find /dcor-repos -name requirements.txt -exec pip install -r {} \;
find /dcor-repos -name tests -exec pytest {} \;
SSL
You have two options. If you server is reachable through the internet, you should use Let’s encrypt (or a certificate from your organization) to set up SSL. If you are hosting your server on the intranet (clinics scenario), then you should create your own certificate and distribute it to your users
Creating an SSL certificate (Intranet only)
Start by creating your certificate (valid for 10 years):
openssl req -newkey rsa:4096 -x509 -config openssl-csr-config.txt -days 3650 -nodes -out dcor-example-com.crt -keyout dcor-example-com.key
using this openssl-csr-config.txt file:
[ req ]
default_md = sha256
prompt = no
x509_extensions = req_ext
distinguished_name = req_distinguished_name
[ req_distinguished_name ]
commonName = dcor.example.com
countryName = DE
stateOrProvinceName = Bavaria
localityName = Erlangen
organizationName = DCOR-med
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
IP.0 = 10.5.4.3
DNS.0 = dcor.example.com
email.0 = email@dcor.example.com
where dcor.example.com is your fully qualified domain name (FQDN) which maps to the server’s IP address. Using the FQDN makes connection tests easier (e.g. if you only have SSH access to the machine and need to use SSH tunneling to connect to the CKAN instance by mapping its FQDN in the /etc/hosts file to 127.0.0.1 on the testing client).
You may want to create an encrypted access token for your users.
Now proceed with the SSL configuration below, replacing “dcor.mpl.mpg.de” with your FQDN.
Configuring nginx (SSL and uWSGI proxy)
Encrypting data transfer should be a priority for you. If your server is available online, you can use e.g. Let’s Encrypt to obtain an SSL certificate. If you are hosting CKAN/DCOR internally in your organization, you will have to create a self-signed certificate and distribute the public key to the client machines manually.
First copy the certificate to /etc/ssl/private:
cp dcor.mpl.mpg.de.cert /etc/ssl/certs/
cp dcor.mpl.mpg.de.key /etc/ssl/private/
Note
If dclab, DCscope, or DCOR-Aid cannot connect to your CKAN instance,
it might be because the certificate in /etc/ssl/certs/ does not
contain the full certificate chain. In this case, just download the
entire certificate chain using Firefox (right-lick on the shield
symbol an look at the certificate - there should be a download
option for the chained certificate somewhere) and replace the content
of the .cert file with that.
Then, edit /etc/nginx/sites-enabled/ckan and replace its content with
the following (change dcor.mpl.mpg.de to whatever domain you use):
# Note that nginx only caches GET and HEAD (not POST) by default:
# http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_methods
proxy_cache_path /tmp/nginx_cache levels=1:2 keys_zone=ANONYM:30m max_size=250m;
proxy_cache_path /tmp/nginx_cache_static levels=1:2 keys_zone=STATIC:30m max_size=250m;
server {
# Use this if you don't have enough space on your root partition
# for caching large uploads (rw-access to www-data).
# client_body_temp_path /data/tmp/nginx/client_body 1 2;
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name dcor.mpl.mpg.de;
ssl_certificate "/etc/ssl/certs/dcor.mpl.mpg.de.cert";
ssl_certificate_key "/etc/ssl/private/dcor.mpl.mpg.de.key";
# Uncomment to avoid robots (only on development machines)
#location = /robots.txt { return 200 "User-agent: *\nDisallow: /\n"; }
# Block outdated versions of DCOR-Aid.
if ($http_user_agent ~* "^DCOR-Aid/(0\.[0-9]\.|0\.1[0-3]\.|0\.14\.[0-1])") {
return 400 "Client $http_user_agent outdated.";
}
# Avoid spamming the logs by these frequent invalid requests
location ~ ^(/user/reset|/userportal|/user\.action) {
deny all;
}
# file extensions that should not be used on a CKAN instance
location ~* \.(aspx|gif|html?|php\d?|pl|rar|sql|tar|tar.gz|zip)$ {
return 404;
}
# static/fully cached locations
location ~ ^/(api/i18n|base/|favicon.ico$|fonts|images/.*\.(png|jpg)$|webassets) {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_buffering on;
proxy_cache_key $host$scheme$proxy_host$request_uri;
# Use the static cache
proxy_cache STATIC;
proxy_cache_valid 200 1d;
proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504;
# Ignore anything CKAN or uWSGI say they are caching
proxy_ignore_headers Expires;
proxy_ignore_headers X-Accel-Expires;
proxy_ignore_headers Cache-Control;
proxy_ignore_headers Set-Cookie;
proxy_hide_header Expires;
proxy_hide_header X-Accel-Expires;
proxy_hide_header Cache-Control;
proxy_hide_header Pragma;
proxy_hide_header Set-Cookie;
# when a client closes the connection then keep the channel to uwsgi open.
# Otherwise uwsgi throws an IOError and possibly segfaults.
proxy_ignore_client_abort on;
}
# GET allow-list for ckan-related directories
location ~ ^/($|about$|contact$|dashboard|imprint$|privacy$|revision) {
limit_except GET {
deny all;
}
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
# Caching based on cookies does not work since CKAN implemented
# the beaker `ckan` session cookie.
# proxy_cache ANONYM;
# proxy_cache_bypass $cookie_remember_token;
# proxy_no_cache $cookie_remember_token;
# proxy_cache_valid 30m;
# proxy_cache_key $host$scheme$proxy_host$request_uri;
#
# when a client closes the connection then keep the channel to uwsgi open.
# Otherwise uwsgi throws an IOError and possibly segfaults.
proxy_ignore_client_abort on;
}
# package_revise after upload
location = /api/3/action/package_revise {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
# Determines whether the connection with a proxied server should be closed
# when a client closes the connection without waiting for a response.
# When a client closes the connection, then keep the channel to uwsgi open.
# Otherwise uwsgi throws an IOError and possibly segfaults.
proxy_ignore_client_abort on;
# Package_revise can take up to 200s to complete for datasets
# with large uploads.
proxy_connect_timeout 500s;
proxy_read_timeout 500s;
proxy_send_timeout 500s;
keepalive_timeout 500s;
# Remove the Connection header if the client sends it,
# it could be "close" to close a keepalive connection
proxy_set_header Connection "";
}
# GET/POST allow-list for ckan-related directories
location ~ ^/(api/2/util/|api/3/|ckan-admin|dataset/groups|group|login_generic|organization|uploads/(admin|group|user)/.+\.(png|jpg|jpeg)$|user) {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
# Caching based on cookies does not work since CKAN implemented
# the beaker `ckan` session cookie.
# proxy_cache ANONYM;
# proxy_cache_bypass $cookie_remember_token;
# proxy_no_cache $cookie_remember_token;
# proxy_cache_valid 30m;
# proxy_cache_key $host$scheme$proxy_host$request_uri;
#
# When a client closes the connection, then keep the channel to uwsgi open.
# Otherwise uwsgi throws an IOError and possibly segfaults.
proxy_ignore_client_abort on;
}
# GET allow-list for ckan-related directories (separate for dataset, since
# we have to allow POST in `dataset/groups` (adding a dataset to a group)
# with higher priority above).
location ~ ^/(dataset) {
limit_except GET {
deny all;
}
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
# Caching based on cookies does not work since CKAN implemented
# the beaker `ckan` session cookie.
# proxy_cache ANONYM;
# proxy_cache_bypass $cookie_remember_token;
# proxy_no_cache $cookie_remember_token;
# proxy_cache_valid 30m;
# proxy_cache_key $host$scheme$proxy_host$request_uri;
#
# when a client closes the connection then keep the channel to uwsgi open.
# Otherwise uwsgi throws an IOError and possibly segfaults.
proxy_ignore_client_abort on;
}
# Redirect /UUID shortcut to /dataset/UUID
location ~ "^/([a-f0-9\-]{36})$" {
return 301 /dataset/$1;
}
}
# Redirect all traffic to SSL
server {
listen 80;
listen [::]:80;
server_name dcor.mpl.mpg.de;
return 301 https://$host$request_uri;
}
# Optional: Reject traffic that is not directed at `dcor.mpl.mpg.de:80`
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
return 444;
}
# Optional: Reject traffic that is not directed at `dcor.mpl.mpg.de:443`
server {
listen 443 default_server;
listen [::]:443 default_server;
server_name _;
return 444;
ssl_certificate "/etc/ssl/certs/ssl-cert-snakeoil.pem";
ssl_certificate_key "/etc/ssl/private/ssl-cert-snakeoil.key";
}
Now, we need to modify the CKAN uWSGI file at
/etc/ckan/default/ckan-uwsgi.ini:
[uwsgi]
; Since we are behind a webserver (proxy), we use the socket variant.
; We use HTTP1.1 (keep-alives)
http11-socket = 127.0.0.1:8080
uid = www-data
gid = www-data
wsgi-file = /etc/ckan/default/wsgi.py
virtualenv = /usr/lib/ckan/default
module = wsgi:application
master = true
pidfile = /tmp/%n.pid
; 10 hours for very long-lasting uploads
harakiri = 36000
harakiri-verbose = true
; Restart workers after this many requests
max-requests = 1000
; How long to wait before forcefully killing workers
worker-reload-mercy = 30
; Delete sockets during shutdown
vacuum = true
callable = application
; Disable threads only if not using threads and performance is critical
enable-threads = false
; Do not use multiple interpreters (since we only have one service)
single-interpreter = true
; Shutdown when receiving SIGTERM
die-on-term = true
; Fail to start if application cannot load
need-app = true
; Make sure all options in this file exist.
strict = true
; Unfortunately, buffering the upload with nginx and then sending the upload
; to uWSGI does not work for some reason (uWSGI gets stuck when crunching the
; data). The intuitive choice would be to set this here to "1", but a look
; at the sources reveals that this should be set to the buffer size (2MB).
post-buffering = 2097152
post-buffering-bufsize = 2097152
; Reduce or increase this number to limit POST requests. By default,
; the size of POST requests is unlimited.
limit-post = 5242880
; Set the number of workers to something > 1, otherwise
; only one client can connect via nginx to uWSGI at a time.
; See https://github.com/ckan/ckan/issues/5933
; In addition, use two threads per worker.
workers = 4
; Use lazy apps to avoid the `__Global` error.
; See https://github.com/ckan/ckan/issues/5933#issuecomment-809114593
lazy-apps = true
; If we don't want to cache the files that users want to download
; (i.e. set `proxy_max_temp_file_size 0;` in nginx), then we have to
; set socket-timeout to a very large number (e.g. 7200).
; We may also want to increase this number if the storage location for
; resources has a low write speed (e.g. NFS). From the uWSGI sources,
; it looks like the default value is 4s.
socket-timeout = 500
; (Note that we are serving CKAN via http11-socket behind nginx).
; Otherwise, downloads will fail with `uwsgi_response_sendfile_do() TIMEOUT !!!`,
; because the client cannot download the file from nginx as fast as
; uWSGI can send the file to nginx. But in this case, we can really only
; have as many connections as we have workers.
; On the other hand, if we, set `proxy_max_temp_file_size 100000m;`
; in nginx, then all downloads will be cached by nginx. And nginx will
; handle all users. The purpose of setting `workers` to `4` in uWSGI
; is now only so that CKAN does not block for as long as it takes the
; system to copy the download from uwsgi to nginx's `proxy_temp_path`.
; In other words, CKAN will only be unresponsive if 4 downloads are
; started at the same time for as long as it takes the smallest download
; to be copied over the http socket from uWSGI to nginx.
; Custom logging
; disable logging in general (files easily get above 50MB)
disable-logging = true
; enable logging for a few specific cases
log-4xx = true
log-5xx = true
log-ioerror = true
; set the log format to match that of CKAN
log-date = %%Y-%%m-%%d %%H:%%M:%%S
logformat-strftime = true
logformat = %(ftime) uWSGI %(addr) (%(proto) %(status)) %(method) %(uri) => %(size) bytes in %(msecs) msecs to %(uagent)
threaded-logger = true
; https://stumbles.id.au/how-to-fix-uwsgi-oserror-write-error.html
disable-write-exception = true
ignore-write-errors = true
ignore-sigpipe = true
Unattended upgrades
Unattended upgrades offer a simple way of keeping the server up-to-date and patched against security vulnerabilities.
apt-get install unattended-upgrades apt-listchanges
Edit the file /etc/apt/apt.conf.d/50unattended-upgrades to your liking. The default settings should already work, but you might want to setup email notifications and automated reboots.
Note
If you have access to an internal email server and wish to get email notifications from your system, install
apt install bsd-mailx ssmtp
and edit /etc/ssmtp/ssmtp.conf:
Note that this is something different than CKAN email notifications.
In order for unattended upgrades to work properly: whenever updates are installed, make sure that needrestart automatically restarts the services by editing the file /etc/needrestart/needrestart.conf and setting:
$nrconf{restart} = 'a';
Supervisor
Sometimes the ckan-uwsgi start job might take a little longer and the default
(1s) is not long enough so supervisor becomes impatient. Edit the file
/etc/supervisor/conf.d/ckan-uwsgi.conf and add startsecs=60.
Also, since we are using die-on-term in the UWSGI configuration,
make sure to remove stopsignal=QUIT in the supervisor configuration
files for ckan and datapusher.
Systemd
It is important that all services required for CKAN to run should be started
before starting supervisor. This can be achieved by running
systemctl edit supervisor and pasting the following config:
[Unit]
# https://github.com/systemd/systemd/issues/1312#issuecomment-228874771
# Requires=solr.service
# After=solr.service
Wants=solr.service
ExecStartPre=systemctl is-active solr.service
Wants=redis.service
ExecStartPre=systemctl is-active redis.service
Wants=postgresql.service
ExecStartPre=systemctl is-active postgresql.service
[Service]
Restart=always
RestartSec=20
If solr is slow when starting up, add this to its unit file systemctl edit solr:
[Service]
ExecStartPost=/bin/sleep 250
Restart=on-failure
RestartSec=10s
Afterwards run:
systemctl daemon-reload