21) Message boards : Number crunching : extreme long wu's (Message 2050)
Posted 22 Mar 2017 by Jacob Klein
Post:
How well do you know BOINC? :)

Look, if the task would error out reasonably, on its own, when it detected the problem, then your reply may be sufficient. BOINC was built to be resilient, both client-side and server-side, when tasks error out on their own.

However, the mechanisms to handle this "unending task" situation are a bit different, and raise a higher alarm and concern.

I believe, in the case of an "unending task", BOINC relies on the <rsc_fpops_bound> configuration.

This projects uses the following for that setting, for this "Universe BHspin v2 0.01" application, as seen in client_state.xml:
<rsc_fpops_bound>999999999999999.000000</rsc_fpops_bound>


My 4 main computers have the following characteristics, as seen in client_state.xml:
<p_fpops>4312814363.944559</p_fpops>
<p_fpops>3788808068.659085</p_fpops>
<p_fpops>2768310915.476756</p_fpops>
<p_fpops>1770864494.146483</p_fpops>


I'm not sure how much "checkpoint" time is banked by the task, before the error presents itself. Let's assume none.

So, to my knowledge, this means that the task must run continuously for this long, on each of my machines, before BOINC will kill it for reason "<rsc_fpops_bound> exceeded":
999999999999999.000000 / 4312814363.944559 = 231867 seconds = 2.7 days
999999999999999.000000 / 3788808068.659085 = 263935 seconds = 3.1 days
999999999999999.000000 / 2768310915.476756 = 361231 seconds = 4.2 days
999999999999999.000000 / 1770864494.146483 = 564696 seconds = 6.5 days

So ... at least we have that maybe, especially for unattended setups - The task should die if ran continuously for a week. But if the task gets restarted, I think it loses any time that had accumulated since last checkpoint, which is why I said it must run continuously.

However, I'm typically restarting my machines every 1.5 days. :/ Which is why this is a problem for me.

There, now you know more about BOINC's coping mechanisms.
It is still a problem we should try harder to fix.

It'd be interesting to know how many tasks have failed due to "<rsc_fpops_bound> exceeded". And sad.
22) Message boards : Number crunching : extreme long wu's (Message 2048)
Posted 22 Mar 2017 by Jacob Klein
Post:
So now you're telling me I should monitor the 60 projects that my computer swarm are attached to? What a joke..

I'm done arguing with you. I will continue to try to solve this problem as much as I can, with or without your help.
23) Message boards : Number crunching : extreme long wu's (Message 2045)
Posted 22 Mar 2017 by Jacob Klein
Post:
At least a task running indefinitely will start to stand out as it drops down the list but remains as "In progress".


If a machine is running unattended, will it matter where it is on the list of running tasks, as it wastes the resource? Sigh.
24) Message boards : Number crunching : extreme long wu's (Message 2042)
Posted 22 Mar 2017 by Jacob Klein
Post:
The difference between a CPDN task failing, and the Universe@Home problem in this thread, is that a CPDN task may fail at some point, while this Universe@Home problem will allow a task to run indefinitely.

If you are unable to appreciate that difference, then you are unable to understand the true concern.

I am desperately trying to repro the issue, to help the admin fix it, because an "indefinitely-running-task" is the worst kind of BOINC task, across all 60 of my projects (including my own CPDN task failures).

Regards,
Jacob.
25) Message boards : Number crunching : extreme long wu's (Message 2029)
Posted 20 Mar 2017 by Jacob Klein
Post:
When running a new task, what is the first indication that we've encountered the problem in this thread?
26) Message boards : Number crunching : extreme long wu's (Message 2026)
Posted 18 Mar 2017 by Jacob Klein
Post:
krzyszp:

Would it help if I actively tried to get the problem to happen on a task on my machine(s)? Then what is the next step? Do you have debug output we can get to? Do we copy the files locally to run them outside of BOINC, to see if it is still reproducible on my machine?

Come on ---- Let's SOLVE this already! :)
27) Message boards : Number crunching : extreme long wu's (Message 2022)
Posted 17 Mar 2017 by Jacob Klein
Post:
Hi. You are posting on a forum where computer enthusiasts care passionately about their usage of computer resources. A bug like this can result in large numbers of computers wasting their resources, and wasting the environments resources, for no gain, in an indefinite loop with no way to exit the problem (tasks run indefinitely).

If you're not here to help, then don't post. I'm offering my help, despite my disagreement with the admins' response thus far.
28) Message boards : Number crunching : extreme long wu's (Message 2012)
Posted 15 Mar 2017 by Jacob Klein
Post:
If there is anything you'd like me to test or try, tell me what to do and I'll do it. I want it solved, and am willing to try things for you.
29) Message boards : Number crunching : extreme long wu's (Message 2010)
Posted 15 Mar 2017 by Jacob Klein
Post:
If my machines were susceptible to that problem to the extent that I found it unacceptable, I would choose another project. You seem to be asking them to cancel the project, or some portion thereof, for a problem that affects some people more than others and they can't find the solution for.

Are you expecting them to find the bad work units in advance? If they could do that, they could fix them.


I am attached to every possible project, about 60 of them. I routinely do work for about 15 of them. I'm also one of the main BOINC Alpha testers.

What I am asking for is not unreasonable. The request is: If a project has a situation where a task can get stuck in the worst possible state of running indefinitely (100% waste), the project does everything in their power to stop the bleeding, including possibly taking the app offline or cancelling affected batches of tasks.

It has happened to other projects before, and they have responded correctly. I'm hoping for a correct response with this project. In the meantime, I'm lucky I don't have unattended setups, and I easily set No New Tasks on all 4 of my PCs.
30) Message boards : Number crunching : extreme long wu's (Message 2005)
Posted 15 Mar 2017 by Jacob Klein
Post:
To be clear ...

Tasks that error out eventually, are a pain to deal with, but a non-attended setup will handle them gracefully enough.
Tasks that run continuously without end, are a pain to deal with, but a non-attended setup will end up crunching indefinitely, wasting electricity and wasting resources indefinitely.

I speak loudly, because I think we're dealing with the 2nd case here.

It sounds like you are saying "Oh, it's okay to render machines and CPUs completely useless and have them waste energy, if we make progress overall."
... and that's a very very bad idea.
31) Message boards : Number crunching : extreme long wu's (Message 2002)
Posted 15 Mar 2017 by Jacob Klein
Post:
Unattended machines could be wasting more than just days, on this issue!

Admin, please consider both of these:
- Stopping sending work for tasks that could end up in a never-ending state.
- Server-side-aborting tasks that could end up in a never-ending state.

That's what I'd do. Wasting CPU cycles is equivalent to stealing CPU cycles from other projects.
32) Message boards : Number crunching : extreme long wu's (Message 1999)
Posted 15 Mar 2017 by Jacob Klein
Post:
It is utterly unacceptable that this project still continues to waste CPU cycles, with a known bad app or batch of tasks. Literally, unacceptable. Wasting crunching power!!

I may never turn "No New Tasks" off, because of this non-responsiveness! :(

krzyszp, can't you do something to stop the bleeding, even??
33) Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2) (Message 1950)
Posted 13 Feb 2017 by Jacob Klein
Post:
I think we are tracking these problems in this thread, now:
http://universeathome.pl/universe/forum_thread.php?id=199

So far as I know, it is not yet fixed. I have set No New Tasks, until a fix is confirmed, to prevent wasting my resources.
34) Message boards : Number crunching : extreme long wu's (Message 1947)
Posted 12 Feb 2017 by Jacob Klein
Post:
krzyszp:

Any response yet for:
- Is this an application issue, or is it a bad batch of tasks?
- Is it fixed?
35) Message boards : Number crunching : extreme long wu's (Message 1929)
Posted 3 Feb 2017 by Jacob Klein
Post:
Sounds great, honestly!
Once you've resolved the issue, please let us know when it is safe to disable "No New Tasks", with no fear of getting a task that wastes the CPU. I'll be monitoring this thread.
36) Message boards : Number crunching : extreme long wu's (Message 1922)
Posted 1 Feb 2017 by Jacob Klein
Post:
Do the tasks get into a "running forever with no chance to complete" state?

If so, then this seriously needs to get fixed. I have some computers also running unattended, and if they're wasting their resources, that is unacceptable.

Admins: Has anything been done to determine the cause of these problems? I've tried to give relevant info in my other thread ... but we need input and advice from you too, on how these tasks are supposed to behave, and what you can do to fix the bad behaviors! Can you please help us?

I'm going to set No New Tasks for this project, until the admins can fix this. Please admins!

==========================
For instance, the problematic one I'm going to abort right now, had this:

Application: Universe BHspin v2 0.01
Task Name: universe_bh2_160803_59_3_20000_1-999999_360000_2
URL: http://universeathome.pl/universe/result.php?resultid=19571218

"CPU time at last checkpoint" is 02:07:46 (2 hrs)
"CPU time" is 14:32:06 (14.5 hrs)
Estimated time remaining: 03:19:50 (AND INCREASING)
Fraction done: 77.415%

log.txt has several "making checkpoint" entries, but the last one was at:
02:08:41 00:08:21 making checkpoint: j: 15000; iidd: 4305399

checkpoint.dat:
15000 4305399 0 1 2

error.dat:
error: in Renv_con() unknown Ka type: 1, iidd_old: 4436666error: in Menv_con() unknown Ka type: 1, iidd_old: 4436666

error.dat2 and error.dat3:
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 501481
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 501481
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1081777
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1081777
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1748698
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1748698
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 4203947
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 4203947
37) Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2) (Message 1920)
Posted 31 Jan 2017 by Jacob Klein
Post:
Is it possible for you to isolate which tasks/batches are problematic, then cancel them server-side?
38) Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2) (Message 1918)
Posted 31 Jan 2017 by Jacob Klein
Post:
How long are these tasks supposed to take?
Do they complete properly, when resumed from a checkpoint?
I ask, because I have another one that is misbehaving.
Please see all details below.

http://universeathome.pl/universe/result.php?resultid=19571218

"CPU time at last checkpoint" is 02:07:46 (2 hrs)
"CPU time" is 14:32:06 (14.5 hrs)
Estimated time remaining: 03:19:50 (AND INCREASING)
Fraction done: 77.415%
I'm not sure if this was resumed from checkpoint or not.

That's 12.5 hrs without checkpointing, and without completion. It is still using a full CPU core.
Is that expected? I'm on a high-end i7-5960X CPU, using Windows 10 Insider Slow Build 14986.

log.txt has several "making checkpoint" entries, but the last one was at:
02:08:41 00:08:21 making checkpoint: j: 15000; iidd: 4305399

checkpoint.dat has:
15000 4305399 0 1 2

error.dat has:
error: in Renv_con() unknown Ka type: 1, iidd_old: 4436666error: in Menv_con() unknown Ka type: 1, iidd_old: 4436666


error.dat2 and error.dat3 both have:
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 501481

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 501481

warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1081777

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1081777

warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1748698

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1748698

warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 4203947

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 4203947
39) Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2) (Message 1917)
Posted 30 Jan 2017 by Jacob Klein
Post:
Thanks. I'll keep an eye out, in case it happens again. Personally, I feel that the task got into a bad state somehow.
40) Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2) (Message 1915)
Posted 30 Jan 2017 by Jacob Klein
Post:
This task:
- did not suspend correctly (when I suspended BOINC or the task)
- did not exit correctly (when I exited BOINC)
- did not checkpoint anymore (with several hours of runtime wasted, in between BOINC exits)
- did not abort correctly (when I aborted it, it continued to run until the heartbeat check failed).

I don't know the cause, but something was really messed up with it.
I aborted it.


Previous 20 · Next 20




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek