This morning we realised that a test case failed on Fedora 34 only (the link is in Italian) and we set to debugging.
The initial analysis
This is the initial reproducer:
$ PROJ_DEBUG=3 python setup.py test
test_recipe (tests.test_litota3.TestLITOTA3NordArkimetIFS) ... pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: Open of /lib64/../share/proj/proj.db failed
pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: no database context specified
Cannot instantiate source_crs
EXCEPTION in py_coast(): ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
ERROR
Note that opening /lib64/../share/proj/proj.db
sometimes succeeds, sometimes
fails. It's some kind of Schrödinger path, which works or not depending on how
you observe it:
# ls -lad /lib64
lrwxrwxrwx 1 1000 1000 9 Jan 26 2021 /lib64 -> usr/lib64
$ ls -la /lib64/../share/proj/proj.db
-rw-r--r-- 1 root root 8925184 Jan 28 2021 /lib64/../share/proj/proj.db
$ cd /lib64/../share/proj/
$ cd /lib64
$ cd ..
$ cd share
-bash: cd: share: No such file or directory
And indeed, stat(2)
finds it, and sqlite doesn't (the file is a sqlite
database):
$ stat /lib64/../share/proj/proj.db
File: /lib64/../share/proj/proj.db
Size: 8925184 Blocks: 17432 IO Block: 4096 regular file
Device: 33h/51d Inode: 56907 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2021-11-08 14:09:12.334350779 +0100
Modify: 2021-01-28 05:38:11.000000000 +0100
Change: 2021-11-08 13:42:51.758874327 +0100
Birth: 2021-11-08 13:42:51.710874051 +0100
$ sqlite3 /lib64/../share/proj/proj.db
Error: unable to open database "/lib64/../share/proj/proj.db": unable to open database file
A minimal reproducer
Later on we started stripping layers of code towards a minimal reproducer: here it is. It works or doesn't work depending on whether proj is linked explicitly, or via MagPlus:
$ cat tc.cc
#include <magics/ProjP.h>
int main() {
magics::ProjP p("EPSG:4326", "+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs");
return 0;
}
$ g++ -o tc tc.cc -I/usr/include/magics -lMagPlus
$ ./tc
proj_create: Open of /lib64/../share/proj/proj.db failed
proj_create: no database context specified
terminate called after throwing an instance of 'magics::MagicsException'
what(): ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
Aborted (core dumped)
$ g++ -o tc tc.cc -I/usr/include/magics -lproj -lMagPlus
$ ./tc
What is going on here?
A difference between the two is the path used to link to libproj.so:
$ ldd ./tc | grep proj
libproj.so.19 => /lib64/libproj.so.19 (0x00007fd4919fb000)
$ g++ -o tc tc.cc -I/usr/include/magics -lMagPlus
$ ldd ./tc | grep proj
libproj.so.19 => /lib64/../lib64/libproj.so.19 (0x00007f6d1051b000)
Common sense screams that this should not matter, but we chased an intuition and found that one of the ways proj looks for its database is relative to its shared library.
Indeed, gdb in hand, that dladdr
call returns /lib64/../lib64/libproj.so.19
.
From /lib64/../lib64/libproj.so.19
, proj strips two paths from the end,
presumably to pass from something like /something/usr/lib/libproj.so
to /something/usr
.
So, dladdr
returns /lib64/../lib64/libproj.so.19
, which becomes
/lib64/../
, which becomes /lib64/../share/proj/proj.db
, which exists on the
file system and is used as a path to the database.
But depending how you look at it, that path might or might not be valid: it
passes the stat(2)
check that stops the lookup for candidate paths, but
sqlite is unable to open it.
Why does the other path work?
By linking libproj.so in the other way, dladdr
returns
/lib64/libproj.so.19
, which becomes /share/proj/proj.db
, which doesn't
exist, which triggers a fallback
to a PROJ_LIB
constant defined at compile time, which is a path that works no
matter how you look at it.
Why that weird path with libMagPlus?
To complete the picture, we found that libMagPlus.so
is packaged with a
rpath set, which is known to cause
trouble
# readelf -d /usr/lib64/libMagPlus.so|grep rpath
0x000000000000000f (RPATH) Library rpath: [$ORIGIN/../lib64]
The workaround
We found that one can set PROJ_LIB
in the environment to override the normal
proj database lookup. Building on that, we came up with a simple way to
override it on Fedora 34 only:
if distro is not None and distro.linux_distribution()[:2] == ("Fedora", "34") and "PROJ_LIB" not in os.environ:
self.env_overrides["PROJ_LIB"] = "/usr/share/proj/"
This has been a most edifying and educational debugging session, with only the necessary modicum of curses and swearwords. Working in a team of excellent people really helps.