Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2)
Message board moderation

To post messages, you must log in.

AuthorMessage
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1905 - Posted: 27 Jan 2017, 14:38:04 UTC
Last modified: 27 Jan 2017, 14:39:50 UTC

I have a task from your project, and the app version is listed as "Universe BHspin v2 0.01" running on Windows 10 x64.

It has a BUG!

It is not suspending, when I request it to suspend. This causes my CPU to be overloaded, which then causes problems with UI responsiveness and interaction. It also throws off Benchmarks.

Can you please fix your app? I may have to set your project for "No New Work" until you get it resolved.

Thanks,
Jacob
ID: 1905 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1906 - Posted: 27 Jan 2017, 17:16:30 UTC - in response to Message 1905.  
Last modified: 27 Jan 2017, 17:27:18 UTC

Do you have set "Leave application in memory while suspended" in BOINC Manager options?

Edit:
I just check that on my Windows 7 it is suspend correctly and release CPU cores.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1907 - Posted: 27 Jan 2017, 17:36:02 UTC - in response to Message 1906.  

Yes, I have the "Leave non-GPU tasks in memory while suspended" option checked.
I rely on that, to not waste work, when I suspend and resume.

I hope you can repro the problem.
ID: 1907 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1908 - Posted: 27 Jan 2017, 18:10:55 UTC - in response to Message 1907.  

Unfortunately I can't.
Also, since 22 July 2016 you are first person who inform me about this behavior...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1908 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1909 - Posted: 27 Jan 2017, 20:17:36 UTC
Last modified: 27 Jan 2017, 20:17:56 UTC

Maybe it's related to the specific task?

Can you take a look -- it is:
universe_bh2_160803_59_3_20000_1-999999_190000_0
http://universeathome.pl/universe/result.php?resultid=19531577
http://universeathome.pl/universe/workunit.php?wuid=8665237
ID: 1909 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1910 - Posted: 27 Jan 2017, 22:08:40 UTC - in response to Message 1909.  

It looks like other user finish it without problems,,,
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1910 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1911 - Posted: 27 Jan 2017, 22:20:46 UTC

What does another user finishing the task, have anything to do with whether your app responds to suspend requests???
ID: 1911 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1912 - Posted: 28 Jan 2017, 2:03:48 UTC - in response to Message 1911.  

Nothing really, but in case if something was wrong with WU I will see this in database or in result files.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1912 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1915 - Posted: 30 Jan 2017, 14:18:22 UTC
Last modified: 30 Jan 2017, 14:18:30 UTC

This task:
- did not suspend correctly (when I suspended BOINC or the task)
- did not exit correctly (when I exited BOINC)
- did not checkpoint anymore (with several hours of runtime wasted, in between BOINC exits)
- did not abort correctly (when I aborted it, it continued to run until the heartbeat check failed).

I don't know the cause, but something was really messed up with it.
I aborted it.
ID: 1915 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1916 - Posted: 30 Jan 2017, 14:21:48 UTC - in response to Message 1915.  

It's really something with your computer config.
Others not report such problems, also on my computers all this functions works (both: Linux and Windows).
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1916 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1917 - Posted: 30 Jan 2017, 14:30:02 UTC

Thanks. I'll keep an eye out, in case it happens again. Personally, I feel that the task got into a bad state somehow.
ID: 1917 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1918 - Posted: 31 Jan 2017, 13:50:11 UTC
Last modified: 31 Jan 2017, 14:30:43 UTC

How long are these tasks supposed to take?
Do they complete properly, when resumed from a checkpoint?
I ask, because I have another one that is misbehaving.
Please see all details below.

http://universeathome.pl/universe/result.php?resultid=19571218

"CPU time at last checkpoint" is 02:07:46 (2 hrs)
"CPU time" is 14:32:06 (14.5 hrs)
Estimated time remaining: 03:19:50 (AND INCREASING)
Fraction done: 77.415%
I'm not sure if this was resumed from checkpoint or not.

That's 12.5 hrs without checkpointing, and without completion. It is still using a full CPU core.
Is that expected? I'm on a high-end i7-5960X CPU, using Windows 10 Insider Slow Build 14986.

log.txt has several "making checkpoint" entries, but the last one was at:
02:08:41 00:08:21 making checkpoint: j: 15000; iidd: 4305399

checkpoint.dat has:
15000 4305399 0 1 2

error.dat has:
error: in Renv_con() unknown Ka type: 1, iidd_old: 4436666error: in Menv_con() unknown Ka type: 1, iidd_old: 4436666


error.dat2 and error.dat3 both have:
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 501481

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 501481

warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1081777

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1081777

warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 1748698

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 1748698

warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 4203947

warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 4203947
ID: 1918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 1919 - Posted: 31 Jan 2017, 19:46:18 UTC - in response to Message 1918.  

Definately you can stop it.
Is something wrong with Work Unit (not application).
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 1919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1920 - Posted: 31 Jan 2017, 19:55:51 UTC - in response to Message 1919.  

Is it possible for you to isolate which tasks/batches are problematic, then cancel them server-side?
ID: 1920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stephane Yelle

Send message
Joined: 5 Nov 15
Posts: 1
Credit: 8,191,756
RAC: 8
Message 1949 - Posted: 12 Feb 2017, 23:41:49 UTC
Last modified: 12 Feb 2017, 23:42:21 UTC

I also got a task that didn't seem to end... It was "stuck" at 98% progress.

http://universeathome.pl/universe/workunit.php?wuid=8416651
http://universeathome.pl/universe/result.php?resultid=18991326

It ran for 37,099.25 cpu sec before I aborted it (it had timed out anyway), three times more than my last successful task. Its progress wasn't saved when exiting Boinc.
ID: 1949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 1950 - Posted: 13 Feb 2017, 0:40:48 UTC
Last modified: 13 Feb 2017, 0:41:18 UTC

I think we are tracking these problems in this thread, now:
http://universeathome.pl/universe/forum_thread.php?id=199

So far as I know, it is not yet fixed. I have set No New Tasks, until a fix is confirmed, to prevent wasting my resources.
ID: 1950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RedMenace

Send message
Joined: 29 May 15
Posts: 2
Credit: 3,523,519
RAC: 0
Message 2098 - Posted: 31 Mar 2017, 21:06:25 UTC

These jobs are still not suspending properly. My cpu runs at 99% even while jobs show suspended. I have to cancel via task manager to reclaim my cpu. And please don't patronize me with settings suggestions. This is a bug that needs fixing. I am stopping all jobs until fixed.
ID: 2098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RedMenace

Send message
Joined: 29 May 15
Posts: 2
Credit: 3,523,519
RAC: 0
Message 2103 - Posted: 1 Apr 2017, 16:39:06 UTC

I set my main computer to no new tasks and it still gets tasks and still runs them 100%when nothing should be running. Can one of you please fix your code?
ID: 2103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 48
Credit: 15,956,546
RAC: 54
Message 2107 - Posted: 2 Apr 2017, 9:27:06 UTC
Last modified: 2 Apr 2017, 9:27:34 UTC

I have a few WUs do this as well, even downloading after the project was suspended, then taking forever to run.
Had to abort them more than once before they stopped downloading.
Happened on both Windows and Linux

Conan
ID: 2107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 169
Credit: 317,253,046
RAC: 6
Message 2111 - Posted: 2 Apr 2017, 15:14:24 UTC

Quarks is also screwed up. Suspending the project did not stop the tasks from running at 100%. I had to detach the whole project. Please fix.
ID: 2111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Not Suspending Properly (Universe BHspin v2)




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek