Online validation configuration
An important part of the UD infrastructure is the automatic online data validation system. It runs on a virtual machine currently hosted at Charles University, maintained by Dan Zeman. It is accessible through the following links:
- Validation report
- Registration of language-specific validation data
On the virtual machine udvalidator (accessible through quest.ms.mff.cuni.cz), the relevant files and folders
reside in /usr/lib/cgi-bin/unidep. This folder has hundreds of subfolders, most of them are clones of UD repositories
from GitHub: All treebank repositories as well as tools, docs, and docs-automation. Special attention must be
paid to the access rights of the folders, including their .git subfolders. When doing manual work, I access the server
under my own user name, but all automation occurs under the user www-data. Therefore, that user must have full access
to the subfolders to be able to pull new versions from GitHub and to save other data. I use access control lists to
achieve that (the commands setfacl and getfacl).
There is also a clone of udapi-python (alternatively, Udapi could be installed
from pip, but it is not updated as frequently there) and a Python virtual environment with packages needed to run Udapi
and the validator (regex, colorama, termcolor).
A webhook is set up in the GitHub UniversalDependencies organization to ensure the virtual server is contacted every
time someone pushes changes to any UD GitHub repository. Organization owners can edit the webhook
here. The githook.pl script on the server
is responsible for processing the hook, which typically involves two steps:
- Pull the new contents of the affected repository from GitHub.
- Decide whether the changes call for revalidation of one or more treebanks. If so, run the validator and update the validation report.
However, the hook will not take care of adding a newly created treebank to the virtual server (I did not give user
www-data write access to the main folder). Therefore, new treebanks must be cloned manually on the validation server.
Similarly, if githook.pl or the other scripts called by it change, we must go to the server and activate them.
The scripts are kept in the docs-automation repository, which is updated automatically via webhook, but githook.pl
does not call them from there. Instead, it calls a copy in the main folder. More precisely, it is not a copy but
a hardlink; nevertheless, after git pull the main folder copy gets disconnected from the copy in docs-automation
and must be reconnected by running docs-automation/valdan/lnquest.sh. (Note: I do not remember why I opted for
hardlinks but I suspect the reason may have been that the scripts should think they live in the folder above all the
UD folders and they can access the UD folders via relative paths.)
On the other hand, the CGI scripts responsible for registration of language-specific validation data do not need this.
They are accessed through a symlink (langspec -> docs-automation/valrules/) and they are available immediately after
the automatic git pull.
After a successful edit and save of data in one of the forms provided by those scripts, the changes are pushed both to
docs-automation and to tools (while the source JSON files are always read from docs-automation, which is only
writable by a narrow group of users). This way we survive cases where people overlook or ignore the guidelines and
edit the JSON files in tools (which is currently writable by all members of the Contributors team).
Unsorted
python3 -m venv /usr/lib/cgi-bin/unidep/.venv
source .venv/bin/activate
More details in valdan/README and valdan/README-system-update.
Do virtuálního prostředí instalovat regex (potřebujeme přímo ve validátoru), colorama a termcolor (tyhle dva potřebuje Udapi). Případně lze pipem nainstalovat i udapi, ale já mám raději vybalenou kopii repozitáře udapi-python z GitHubu a nastavenou proměnnou prostředí $PYTHONPATH, která na tuto kopii navede python.