Message boards : Number crunching : "Output file . . .absent"
Message board moderation

To post messages, you must log in.

AuthorMessage
Jan Henrik
Avatar

Send message
Joined: 22 Mar 16
Posts: 13
Credit: 1,113,528,333
RAC: 940
Message 5288 - Posted: 4 May 2022, 21:07:42 UTC

got about 130 tasks with compute errors after 1 or 2 seconds
and the event log printed something like this:

Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Computation for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 finished
Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_0 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent
Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_1 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent
Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_2 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent
Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_4 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent
Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Output file universe_bh2_190723_398_6021738979_20000_1-999999_745100_1_r565751666_5 for task universe_bh2_190723_398_6021738979_20000_1-999999_745100_1 absent
Sun 01 May 2022 10:02:33 PM CEST | Universe@Home | Starting task universe_bh2_190723_398_6021763979_20000_1-999999_770100_0

happened only on 1 machine though and not on another [completely identical one]
______________
"less than a pixel"
ID: 5288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,700
RAC: 229,771
Message 5290 - Posted: 4 May 2022, 23:17:10 UTC

You have a slow storage system or a anti-virus program running that is locking the slots directories so the client can't access or read the output file.

A proud member of the OFA (Old Farts Association)
ID: 5290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jan Henrik
Avatar

Send message
Joined: 22 Mar 16
Posts: 13
Credit: 1,113,528,333
RAC: 940
Message 5342 - Posted: 7 May 2022, 20:00:47 UTC - in response to Message 5290.  

You have a slow storage system or a anti-virus program running that is locking the slots directories so the client can't access or read the output file.


Thanks for your input.
I'm on 22.04 LTS, have no AV (or would you recommend that?) and the storage is a recent NVMe which I checked anyway but is OK.
Could it be the server?
It obviously couldn't handle yesterdays uploads and since the pride-show starts already before officially starting when the needy bunker tasks,
the downloads were getting more and more difficult around that time. Buggy downloads? . . . just a theory.

Anyway it didn't happen since then. So for now it is just a peculiar curiosity.

Thanks again for your thoughts.

BTW: what are the requirements for joining the OFA?
______________
"less than a pixel"
ID: 5342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 151
Credit: 69,772,000
RAC: 10,241
Message 5345 - Posted: 7 May 2022, 22:42:47 UTC - in response to Message 5342.  
Last modified: 7 May 2022, 22:46:16 UTC

Buggy downloads? . . . just a theory.
Unlikely.
The error means that there were no result output files on your system to return to the project.
The usual causes are an AV programme taking offence and deleting the files, or system I/O is so heavy that the Science application writes the files- but they don't actually get written to the disk (or at the very least they are written to the disk but the changes don't make it to the computers file system so as far as it's concerned they don't exist).

But why it would occur after a few seconds? That should result in a computation error, and even that should result in some sort of result files being produced.

So i guess corrupted downloads are a possibility, but the resulting errors are very odd if that was the case.
Grant
Darwin NT
ID: 5345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
entity

Send message
Joined: 3 May 18
Posts: 2
Credit: 30,882,667
RAC: 13
Message 5346 - Posted: 8 May 2022, 1:21:42 UTC - in response to Message 5345.  

The task output might be more revealing than the BOINC client log.
ID: 5346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,700
RAC: 229,771
Message 5347 - Posted: 8 May 2022, 4:47:20 UTC - in response to Message 5346.  

The task output might be more revealing than the BOINC client log.

But the task output ((result file(s)) is what is missing. No help there.

If you are thinking of the stderr.txt output, that is what he grabbed the errors from. No help there either.

A proud member of the OFA (Old Farts Association)
ID: 5347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,700
RAC: 229,771
Message 5348 - Posted: 8 May 2022, 5:00:32 UTC

Since you are on Ubuntu 22.04 I assume you are running the 7.18.1 BOINC client.

There were changes in what BOINC is allowed to access because of permissions on the /tmp directory used by the systemd service daemon file in the latest releases.

Look at this issue. https://github.com/BOINC/boinc/issues/3355

Affecting some projects because the application is writing to /tmp where the user does not have access permission.

Some projects have had to rewrite their applications to work around the issue or have the user make changes to the BOINC systemd service file and change these parameters. GPUGrid was one of the projects affected I know about since I crunch there.

PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true

I have not experienced this issue at Universe here though. So I still think you had a slow storage system where the writing of the output file was delayed and never made it to disk or the storage system was bogged down and couldn't respond in the time for the BOINC process to read the file and push it upstream home to the project scheduler.

A proud member of the OFA (Old Farts Association)
ID: 5348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jan Henrik
Avatar

Send message
Joined: 22 Mar 16
Posts: 13
Credit: 1,113,528,333
RAC: 940
Message 5352 - Posted: 9 May 2022, 3:33:15 UTC

Thanks for all the input.

First of all it didn't reproduce so far, so it's not urgent, more a curiosity.


But why it would occur after a few seconds? That should result in a computation error, and even that should result in some sort of result files being produced.

So i guess corrupted downloads are a possibility, but the resulting errors are very odd if that was the case.

Yeah. I have errors that clearly identify as "Error while downloading", and those don't start to compute. So that's OK then. But why did the others even start to compute?

As for the client. Yes it's 7.18.1
I have 4 with the very same OS/update/upgrade/client etc. but it happened only with one.
I know 4 is a small sample, but that has me tilted back to the hardware side.

Although it's a recent NVMe SSD and I checked/benchmarked it, I could still be the "lucky" guy with a faulty brand new NVMe SSD.
The perpetrator has a complete hardware twin that managed to behave. So I will take both of the project and do tests/benchmarks and see if there is much difference.
(It's a good time for that now since the pride show obviously overloads the project-server, so when all egos are satisfied I might be done with my tests and we can go back to regular boincing then)
______________
"less than a pixel"
ID: 5352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jan Henrik
Avatar

Send message
Joined: 22 Mar 16
Posts: 13
Credit: 1,113,528,333
RAC: 940
Message 5690 - Posted: 1 Jul 2022, 8:08:45 UTC

. . . almost forgot about this:

yes! I could reproduce it on another machine and another project[though different app and different log-lingo yet same 1-2 second thing]

So it's not hardware or project-specific.

It happens sometimes after updates to the runtime environment.

restart of the client is not enough, have to restart the computer and then it's gone
______________
"less than a pixel"
ID: 5690 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : "Output file . . .absent"




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek