Installation
This section describes how to setup your own DCOR production instance.
Ubuntu and CKAN
Please use an Ubuntu 20.04 installation for any development or production usage. This makes it easier to give support and track down issues.
Before proceeding with the installation of CKAN, install the following packages:
apt update
# CKAN requirements
apt install -y libpq5 redis-server nginx supervisor
# needed for building packages that DCOR depends on (dclab)
apt install -y gcc python3-dev
# additional tools that you might find useful, but are not actually required
apt install -y aptitude net-tools mlocate screen needrestart python-is-python3
Install CKAN:
wget https://packaging.ckan.org/python-ckan_2.9-py3-focal_amd64.deb
dpkg -i python-ckan_2.9-py3-focal_amd64.deb
Note
Do NOT setup file uploads when following the instructions
at https://docs.ckan.org. DCOR has its own dedicated directories
for data uploads. The command dcor inspect
will try to
setup/fix that for you.
Follow the remainder of the installation guide at https://docs.ckan.org/en/2.9/maintaining/installing/install-from-package.html#install-and-configure-postgresql. Make sure to note down the PostgreSQL password which you will need in the initialization step.
Make sure to initiate the CKAN database with
source /usr/lib/ckan/default/bin/activate
export CKAN_INI=/etc/ckan/default/ckan.ini
ckan db init
DCOR by default stores all data on /data
. This makes it easier to
control backups and separate the CKAN/DCOR software from the actual data.
If you have not mounted a block device or a network share on /data
,
please create this directory with
mkdir /data
Scratch Space
It is important that you have some scratch space of at least 100 GB available on you system, so that the ckanext-dc_serve extension can create temporary condensed datasets before uploading them to S3. By default, the cache is located at /data/tmp/ckanext-dc_serve and is editable via the configuration option ckanext.dc_serve.tmp_dir.
Object Storage
You should use a cloud storage provider that you trust instead of setting this up yourself. If you know what you are doing (e.g. for testing) and would like to setup S3-compatible object storage yourself, you can use MinIO. On a Ubuntu/Debian machine, install the latest MinIO server like so:
wget https://dl.min.io/server/minio/release/linux-amd64/minio_RELEASEDATE.0.0_amd64.deb
dpkg -i minio_RELEASEDATE.0.0_amd64.deb
This also installed the minio
systemd service which we want to use.
First, make sure that the user defined in the service:
systemctl show minio | grep User=
actually exists. You can add a system user via:
useradd -r minio-user
Then, create a file /etc/defaul/minio
with the following content:
# Volume to be used for MinIO server (make sure minio-user has access).
MINIO_VOLUMES="/srv/minio"
# Use if you want to run MinIO on a custom port (console is the web interface).
MINIO_OPTS="--address :9000 --console-address :9001"
# Root user for the server.
MINIO_ROOT_USER=minio-root-user-account-name
# Root secret for the server.
MINIO_ROOT_PASSWORD=secret-password-for-minio-root-user
# set this for MinIO to reload entries with 'mc admin service restart'
MINIO_CONFIG_ENV_FILE=/etc/default/minio
Now you can enable and start the minio service:
systemctl enable minio
systemctl start minio
Create a “dcor” user (http://minio.server.name:9001/identity/users/add-user
)
with readwrite permissions and create an access key (via “Service Accounts”)
which you can then copy-paste to the ckan.ini
configuration:
dcor_object_store.access_key_id = access-key-id
dcor_object_store.secret_access_key = secret-access-key
DCOR Extensions
Installation
Whenever you need to run the ckan
/dcor
commands or have to update
Python packages, you have to first activate the CKAN virtual environment.
source /usr/lib/ckan/default/bin/activate
With the active environment, first install some basic requirements.
pip install --upgrade pip
pip install wheel
Then, install DCOR, which will install all extensions including their requirements.
pip install dcor_control
Background workers
DCOR comes with three job queues dcor-short, dcor-normal, and dcor-long for data processing after a resource is added to a dataset. The CKAN instance populates those queues and CKAN workers (e.g. via ckan jobs worker dcor-short) fetching and running the jobs in the background. The workers are run, like ckan itself, via supervisor and are defined via individual configuration files in /etc/supervisor/conf.d. When you run dcor inspect (see next section), these files will be created with your approval.
Initialization
The dcor_control
package installed the entry point dcor
which
allows you to manage your DCOR installation. Just type dcor --help
to find out what you can do with it.
For the initial setup, you have to run the inspect
command. You
can run this command on a routinely basis to make sure that your DCOR
installation is setup correctly.
source /usr/lib/ckan/default/bin/activate
dcor inspect
Testing
If you are setting up a development instance, then you might want to be able to run the DCOR tests. This step is not required if you are setting up an instance for production.
For testing purposes, you can use the DCOR vagrant box. It contains a full install of DCOR (including SOLR and object storage) and is updated regularly.
SSL
You have two options. If you server is reachable through the internet, you should use Let’s encrypt (or a certificate from your organization) to set up SSL. If you are hosting your server on the intranet (clinics scenario), then you should create your own certificate and distribute it to your users
Creating an SSL certificate (Intranet only)
Start by creating your certificate (valid for 10 years):
openssl req -newkey rsa:4096 -x509 -config openssl-csr-config.txt -days 3650 -nodes -out dcor-example-com.crt -keyout dcor-example-com.key
using this openssl-csr-config.txt file:
[ req ]
default_md = sha256
prompt = no
x509_extensions = req_ext
distinguished_name = req_distinguished_name
[ req_distinguished_name ]
commonName = dcor.example.com
countryName = DE
stateOrProvinceName = Bavaria
localityName = Erlangen
organizationName = DCOR-med
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
IP.0 = 10.5.4.3
DNS.0 = dcor.example.com
email.0 = email@dcor.example.com
where dcor.example.com is your fully qualified domain name (FQDN) which maps to the server’s IP address. Using the FQDN makes connection tests easier (e.g. if you only have SSH access to the machine and need to use SSH tunneling to connect to the CKAN instance by mapping its FQDN in the /etc/hosts file to 127.0.0.1 on the testing client).
You may want to create an encrypted access token for your users.
Now proceed with the SSL configuration below, replacing “dcor.mpl.mpg.de” with your FQDN.
Configuring nginx (SSL and uWSGI proxy)
Encrypting data transfer should be a priority for you. If your server is available online, you can use e.g. Let’s Encrypt to obtain an SSL certificate. If you are hosting CKAN/DCOR internally in your organization, you will have to create a self-signed certificate and distribute the public key to the client machines manually.
First copy the certificate to /etc/ssl/private
:
cp dcor.mpl.mpg.de.cert /etc/ssl/certs/
cp dcor.mpl.mpg.de.key /etc/ssl/private/
Note
If dclab, Shape-Out, or DCOR-Aid cannot connect to your CKAN instance,
it might be because the certificate in /etc/ssl/certs/
does not
contain the full certificate chain. In this case, just download the
entire certificate chain using Firefox (right-lick on the shield
symbol an look at the certificate - there should be a download
option for the chained certificate somewhere) and replace the content
of the .cert file with that.
Then, edit /etc/nginx/sites-enabled/ckan
and replace its content with
the following (change dcor.mpl.mpg.de
to whatever domain you use):
Now, we need to modify the CKAN uWSGI file at
/etc/ckan/default/ckan-uwsgi.ini
:
[uwsgi]
; Since we are behind a webserver (proxy), we use the socket variant.
; We use HTTP1.1 (keep-alives)
http11-socket = 127.0.0.1:8080
uid = www-data
gid = www-data
wsgi-file = /etc/ckan/default/wsgi.py
virtualenv = /usr/lib/ckan/default
module = wsgi:application
master = true
pidfile = /tmp/%n.pid
; 10 hours for very long-lasting uploads
harakiri = 36000
harakiri-verbose = true
; Restart workers after this many requests
max-requests = 1000
; How long to wait before forcefully killing workers
worker-reload-mercy = 30
; Delete sockets during shutdown
vacuum = true
callable = application
; Disable threads only if not using threads and performance is critical
enable-threads = false
; Do not use multiple interpreters (since we only have one service)
single-interpreter = true
; Shutdown when receiving SIGTERM
die-on-term = true
; Fail to start if application cannot load
need-app = true
; Make sure all options in this file exist.
strict = true
; Unfortunately, buffering the upload with nginx and then sending the upload
; to uWSGI does not work for some reason (uWSGI gets stuck when crunching the
; data). The intuitive choice would be to set this here to "1", but a look
; at the sources reveals that this should be set to the buffer size (2MB).
post-buffering = 2097152
post-buffering-bufsize = 2097152
; Reduce or increase this number to limit POST requests. By default,
; the size of POST requests is unlimited.
limit-post = 100000000000
; Set the number of workers to something > 1, otherwise
; only one client can connect via nginx to uWSGI at a time.
; See https://github.com/ckan/ckan/issues/5933
; In addition, use two threads per worker.
workers = 4
; Use lazy apps to avoid the `__Global` error.
; See https://github.com/ckan/ckan/issues/5933#issuecomment-809114593
lazy-apps = true
; If we don't want to cache the files that users want to download
; (i.e. set `proxy_max_temp_file_size 0;` in nginx), then we have to
; set socket-timeout to a very large number (e.g. 7200).
; We may also want to increase this number if the storage location for
; resources has a low write speed (e.g. NFS). From the uWSGI sources,
; it looks like the default value is 4s.
socket-timeout = 500
; (Note that we are serving CKAN via http11-socket behind nginx).
; Otherwise, downloads will fail with `uwsgi_response_sendfile_do() TIMEOUT !!!`,
; because the client cannot download the file from nginx as fast as
; uWSGI can send the file to nginx. But in this case, we can really only
; have as many connections as we have workers.
; On the other hand, if we, set `proxy_max_temp_file_size 100000m;`
; in nginx, then all downloads will be cached by nginx. And nginx will
; handle all users. The purpose of setting `workers` to `4` in uWSGI
; is now only so that CKAN does not block for as long as it takes the
; system to copy the download from uwsgi to nginx's `proxy_temp_path`.
; In other words, CKAN will only be unresponsive if 4 downloads are
; started at the same time for as long as it takes the smallest download
; to be copied over the http socket from uWSGI to nginx.
; Custom logging
; disable logging in general (files easily get above 50MB)
disable-logging = true
; enable logging for a few specific cases
log-4xx = true
log-5xx = true
log-ioerror = true
; set the log format to match that of CKAN
log-date = %%Y-%%m-%%d %%H:%%M:%%S
logformat-strftime = true
logformat = %(ftime) uWSGI %(addr) (%(proto) %(status)) %(method) %(uri) => %(size) bytes in %(msecs) msecs to %(uagent)
threaded-logger = true
; https://stumbles.id.au/how-to-fix-uwsgi-oserror-write-error.html
disable-write-exception = true
ignore-write-errors = true
ignore-sigpipe = true
Unattended upgrades
Unattended upgrades offer a simple way of keeping the server up-to-date and patched against security vulnerabilities.
apt-get install unattended-upgrades apt-listchanges
Edit the file /etc/apt/apt.conf.d/50unattended-upgrades to your liking. The default settings should already work, but you might want to setup email notifications and automated reboots.
Note
If you have access to an internal email server and wish to get email notifications from your system, install
apt install bsd-mailx ssmtp
and edit /etc/ssmtp/ssmtp.conf
:
Note that this is something different than CKAN email notifications.
In order for unattended upgrades to work properly: whenever updates are installed, make sure that needrestart automatically restarts the services by editing the file /etc/needrestart/needrestart.conf and setting:
$nrconf{restart} = 'a';
Supervisor
Sometimes the ckan-uwsgi start job might take a little longer and the default
(1s) is not long enough so supervisor becomes impatient. Edit the file
/etc/supervisor/conf.d/ckan-uwsgi.conf
and add startsecs=60
.
Systemd
It is important that all services required for CKAN to run should be started
before starting supervisor
. This can be achieved by running
systemctl edit supervisor
and pasting the following config:
[Unit]
Requires=solr.service
After=solr.service
Requires=redis.service
After=redis.service
Requires=postgresql.service
After=postgresql.service
[Service]
Restart=always
RestartSec=20
If solr is slow when starting up, add this to its unit file systemctl edit solr
:
[Service]
ExecStartPost=/bin/sleep 250
Restart=on-failure
RestartSec=10s
Afterwards run:
systemctl daemon-reload