1) Message boards : Number crunching : extreme long wu's (Message 2586)
Posted 30 Jan 2018 by Jacob Klein
Post:
In this thread, an admin gave up trying to solve the issue.
Despite me providing as much seemingly-useful info as I could, and offering time to help reproduce and isolate the issue.

I have set No New Tasks for Universe@Home on all my PCs, so that my resources do not get wasted, and I've recommended the same to other BOINC friends.

I hope an admin takes this more seriously at some point.
2) Message boards : Number crunching : extreme long wu's (Message 2498)
Posted 19 Nov 2017 by Jacob Klein
Post:
If the admin was serious about fixing the problem, it would have been fixed.
I gave everything I could, to try to help him. But he never treated it like a significant problem.

All I can do is recommend people set "No New Tasks" for Universe@Home, until the admin takes this more seriously and has fixes for us to test.

Jacob Klein
3) Message boards : Number crunching : extreme long wu's (Message 2378)
Posted 26 Sep 2017 by Jacob Klein
Post:
Admins:
Could you please try again to fix this?

I am offering to help out with additional testing to help reproduce, isolate, and fix this issue.

Until it is fixed, you are wasting resources, and I'm encouraging everyone I can to set "No New Tasks" on Universe@Home. Your lack of effort leaves no choice.
4) Message boards : Number crunching : extreme long wu's (Message 2359)
Posted 15 Sep 2017 by Jacob Klein
Post:
If any WU calculate longer then 6 hours feel free to abort it.


Or turn off getting new tasks until this is fixed.


EXACTLY.

I have several PCs doing BOINC work, and I can't monitor the details of every task that they do. This problem is real, and it wastes resources, making a CPU thread completely useless, as it spins its wheels on a task that won't complete...

The devs here should put more effort into solving this problem, instead of not caring about wasted resources. Hell, for that reason alone I'd set "No New Tasks", but I'll also do it because the tasks here sometimes don't work and waste my CPU.

I still hope for a fix, but in the meantime, you don't deserve my CPU if you're going to abuse it. "No New Tasks" for you.
5) Message boards : Number crunching : extreme long wu's (Message 2341)
Posted 25 Aug 2017 by Jacob Klein
Post:
That doesn't work for unattended machines.
Please fix your problem already, so computer resources aren't continually wasted!
You were provided details, 6 months ago.
6) Message boards : Number crunching : extreme long wu's (Message 2279)
Posted 30 Jun 2017 by Jacob Klein
Post:
This is still an issue! :(

Can you PLEASE put more effort to STOP WASTING OUR CPU RESOURCES on this bug?? FIX IT!
I'm now PERMANENTLY setting "No New Tasks" for your project, because it ABUSES my resources!


I Just had another task, where it wasn't responding to suspend, and wasn't checkpointing, despite running for many hours. I confirmed that running it standalone exhibited the same problematic behavior. I aborted it. Details below.

OS: Windows 10 Pro x64, Insider Fast Ring, Build 16232
CPU: Intel i7-5960X
Executable: BHspin2_1_windows_intelx86.exe (Executed from an Admin Command Prompt)

param.in:
BHspin2 v:160803
SET num_tested 20000
SET hub_val 1000
SET idum -10000
SET OUTPUT 3
SET Mmina 5.0
SET Mminb 3.0
SET golambda 10
SET Beta 1.0
SET Fa 1.0
SET Sigma3 130
SET Sal -2.3
SET SS 0
SET ZZ 0.0001


error.dat:
error: in Renv_con() unknown Ka type: 1, iidd_old: 687281error: in Menv_con() unknown Ka type: 1, iidd_old: 687281


error.dat2
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 291867
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 291867
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 512091
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 512091


error.dat3
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 291867
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 291867
warning: derivative from dlnRdt() not accurate, error: 1e+030, K: 8, 512091
warning: derivative from dlnRdlnM() not accurate, error: 1e+030, K: 8, 512091


checkpoint.dat:
5000 600133 0 1 2


log.txt:
00:00:00 00:00:00 PROGRAM START: Fri Jun 30 09:49:13 2017
00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints
00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present
00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -10000
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: golambda = 10
00:00:00 00:00:00 PARAMIN: Beta = 1.0
00:00:00 00:00:00 PARAMIN: Fa = 1.0
00:00:00 00:00:00 PARAMIN: Sigma3 = 130
00:00:00 00:00:00 PARAMIN: Sal = -2.3
00:00:00 00:00:00 PARAMIN: SS = 0
00:00:00 00:00:00 PARAMIN unknown parameter: name: SS; value: 0
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -10000; num_tested: 20000
00:08:46 00:08:46 making checkpoint: j: 1000; iidd: 116149
00:08:46 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:08:46 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:08:46 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:08:46 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:08:46 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:08:46 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:08:46 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:08:46 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:20:05 00:11:19 making checkpoint: j: 2000; iidd: 234065
00:20:05 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:20:05 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:20:05 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:20:05 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:20:05 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:20:05 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:20:05 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:20:05 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:31:17 00:11:12 making checkpoint: j: 3000; iidd: 355368
00:31:17 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:31:17 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:31:17 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:31:17 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:31:17 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:31:17 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:31:17 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:31:17 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:41:49 00:10:32 making checkpoint: j: 4000; iidd: 481406
00:41:49 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:41:49 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:41:49 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:41:49 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:41:49 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:41:49 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:41:49 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:41:49 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:51:08 00:09:19 making checkpoint: j: 5000; iidd: 600133
00:51:08 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:51:08 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:51:08 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:51:08 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:51:08 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:51:08 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:51:08 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:51:08 00:00:00 gw_cpfile: error.dat appended to error.dat3
7) Message boards : Number crunching : extreme long wu's (Message 2247)
Posted 18 May 2017 by Jacob Klein
Post:
Okay. If CPU Time between checkpoints is > 2 hours, it is a problem.
Do examples (like the one I emailed to you), help you to solve the problem??

Definatelly, because I get proper data to find source of the problem...

Have you learned anything new, using the example that I sent to you?
I put forth effort to test it for hours, and then to get it to you, was hoping you would have replied by now.


Hello? Progress?
8) Message boards : Number crunching : extreme long wu's (Message 2218)
Posted 10 May 2017 by Jacob Klein
Post:
Okay. If CPU Time between checkpoints is > 2 hours, it is a problem.
Do examples (like the one I emailed to you), help you to solve the problem??

Definatelly, because I get proper data to find source of the problem...

Have you learned anything new, using the example that I sent to you?
I put forth effort to test it for hours, and then to get it to you, was hoping you would have replied by now.
9) Message boards : Number crunching : extreme long wu's (Message 2203)
Posted 3 May 2017 by Jacob Klein
Post:
Okay. If CPU Time between checkpoints is > 2 hours, it is a problem.
Do examples (like the one I emailed to you), help you to solve the problem??
10) Message boards : Number crunching : extreme long wu's (Message 2200)
Posted 3 May 2017 by Jacob Klein
Post:
Email sent. I hope that it helps you!
Question: You said "if you experience long delay in checkpoints" ...... but I need more info.
For a task that is working correctly: What is the longest amount of expected CPU Time between 2 checkpoints?
11) Message boards : Number crunching : extreme long wu's (Message 2197)
Posted 3 May 2017 by Jacob Klein
Post:
How long is this task supposed to run?
So far, it has run for 15.6 hours, even from a fresh standalone folder outside of BOINC, which only had the .exe and the param.in file... and is still running.
How long should I let it continue, for it to be useful for us to diagnose the issue?

OS: Windows 10 Pro x64, Insider Fast Ring, Build 16184

CPU: Intel i7-5960X

Executable: BHspin2_1_windows_intelx86.exe (Executed from an Admin Command Prompt)

param.in:
BHspin2 v:160803
SET num_tested 20000
SET hub_val 1000
SET idum -940000
SET OUTPUT 3
SET Mmina 5.0
SET Mminb 3.0
SET golambda 1
SET Beta 0.5
SET Fa 1.0
SET Sigma3 265
SET Sal -2.3
SET SS 0
SET ZZ 0.0001


error.dat:
error: in Renv_con() unknown Ka type: 1, iidd_old: 757382error: in Menv_con() unknown Ka type: 1, iidd_old: 757382


checkpoint.dat:
6000 703458 0 1 2


log.txt:
00:00:00 00:00:00 PROGRAM START: Tue May 02 22:52:30 2017
00:00:00 00:00:00 no checkpoint.dat file found00:00:00 00:00:00 cleaning checkpoints
00:00:00 00:00:00 gw_cpfile: source file "data0.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data1.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "data2.dat2" not present
00:00:00 00:00:00 gw_cpfile: source file "error.dat2" not present
00:00:00 00:00:00 reading checkpoint: istart: -1; pp: 0; n: -1
00:00:00 00:00:00 checkpoint read
00:00:00 00:00:00 default values set
00:00:00 00:00:00 Reading param.in file
00:00:00 00:00:00 PARAMIN: num_tested = 20000
00:00:00 00:00:00 PARAMIN: hub_val = 1000
00:00:00 00:00:00 PARAMIN: idum = -940000
00:00:00 00:00:00 PARAMIN: OUTPUT = 3
00:00:00 00:00:00 PARAMIN: Mmina = 5.0
00:00:00 00:00:00 PARAMIN: Mminb = 3.0
00:00:00 00:00:00 PARAMIN: golambda = 1
00:00:00 00:00:00 PARAMIN: Beta = 0.5
00:00:00 00:00:00 PARAMIN: Fa = 1.0
00:00:00 00:00:00 PARAMIN: Sigma3 = 265
00:00:00 00:00:00 PARAMIN: Sal = -2.3
00:00:00 00:00:00 PARAMIN: SS = 0
00:00:00 00:00:00 PARAMIN unknown parameter: name: SS; value: 0
00:00:00 00:00:00 PARAMIN: ZZ = 0.0001
00:00:00 00:00:00 param.in file read
00:00:00 00:00:00 idum: -940000; num_tested: 20000
00:03:54 00:03:54 making checkpoint: j: 1000; iidd: 119575
00:03:54 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:03:54 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:03:54 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:03:54 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:03:54 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:03:54 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:03:54 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:03:54 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:09:12 00:05:18 making checkpoint: j: 2000; iidd: 241537
00:09:12 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:09:12 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:09:12 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:09:12 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:09:12 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:09:12 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:09:12 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:09:12 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:14:35 00:05:23 making checkpoint: j: 3000; iidd: 360794
00:14:35 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:14:35 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:14:35 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:14:35 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:14:35 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:14:35 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:14:35 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:14:35 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:20:19 00:05:44 making checkpoint: j: 4000; iidd: 477940
00:20:19 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:20:19 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:20:19 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:20:19 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:20:19 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:20:19 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:20:19 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:20:19 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:25:39 00:05:20 making checkpoint: j: 5000; iidd: 594193
00:25:39 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:25:39 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:25:39 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:25:39 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:25:39 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:25:39 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:25:39 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:25:39 00:00:00 gw_cpfile: error.dat appended to error.dat3
00:31:26 00:05:47 making checkpoint: j: 6000; iidd: 703458
00:31:26 00:00:00 gw_cpfile: data0.dat appended to data0.dat2
00:31:26 00:00:00 gw_cpfile: data1.dat appended to data1.dat2
00:31:26 00:00:00 gw_cpfile: data2.dat appended to data2.dat2
00:31:26 00:00:00 gw_cpfile: error.dat appended to error.dat2
00:31:26 00:00:00 gw_cpfile: data0.dat appended to data0.dat3
00:31:26 00:00:00 gw_cpfile: data1.dat appended to data1.dat3
00:31:26 00:00:00 gw_cpfile: data2.dat appended to data2.dat3
00:31:26 00:00:00 gw_cpfile: error.dat appended to error.dat3
12) Message boards : Number crunching : extreme long wu's (Message 2195)
Posted 2 May 2017 by Jacob Klein
Post:
Alright. I now have 2 standalone instances of the exe running:

1) "From beginning": Started with a folder that only had BHspin2_1_windows_intelx86.exe and param.in
2) "From checkpoint": Started with a folder that had everything the BOINC slots folder had, except I removed: boinc_lockfile, boinc_task_state.xml, init_data.xml, job.xml

We'll see how long each takes... and see if the "From checkpoint" one gets stuck in an infinite loop. If so, I could send you a .zip of the files, for you to try.

I feel like I'm trying to solve this problem alone, sometimes.
13) Message boards : Number crunching : extreme long wu's (Message 2193)
Posted 2 May 2017 by Jacob Klein
Post:
If I try to run the
BHspin2_1_windows_intelx86.exe
... from an Admin Command Prompt, it runs for a couple seconds, then quits, and this is in the stderr.txt:

13:11:18 (12760): BOINC client no longer exists - exiting
13:11:18 (12760): timer handler: client dead, exiting

Should I try anything else?

More importantly:
Is any progress being made to fix this issue?
14) Message boards : Number crunching : extreme long wu's (Message 2184)
Posted 2 May 2017 by Jacob Klein
Post:
Please help me understand this.

I have a Universe@Home task with the following properties:
App: Universe BHspin v2 0.01
Task: universe_bh2_160803_107_1_20000_1-999999_940000_1
URL: http://universeathome.pl/universe/result.php?resultid=21698177
Elapsed: 11:32:39
CPU Time: 17:42:40
CPU Time since checkpoint: 17:10:42
Estimated time remaining: 1d 00:00:32
Status: Will not suspend correctly, and looks like it'll never complete.

Questions:
1) Does this task look like it is a "problem" task that will never complete?
2) What information can I get you to help you solve the problem? Note: I copied the slot and project folder, so I might be able to run it outside of BOINC (if you give me instructions how), if that helps.
3) Is any progress being made to fix this issue?

Setting "No New Tasks" again ... :/
15) Message boards : Number crunching : extreme long wu's (Message 2070)
Posted 26 Mar 2017 by Jacob Klein
Post:
Don't get me wrong, I was running it alongside other programs (I have other high priority RNA World VM tasks, GPUGrid GPU tasks, WuProp non-cpu-intensive tasks, etc.)

So, yeah, now it will run amidst more since I set the resource share to be equal to my other 10 projects... And I will continue to try to repro the issue.

Thanks.
16) Message boards : Number crunching : extreme long wu's (Message 2068)
Posted 25 Mar 2017 by Jacob Klein
Post:
After a week of setting this project with maximum resource share, in order to repro the problem, and monitoring the results closely, I have so far been unsuccessful in repro'ing it.

I will now set it for normal resource share, and try to keep a lookout for the problem.

Again, if anybody has the input files that recreate the issue, I'd LOVE to test with them. I wish I would have saved mine from the problematic tasks earlier, before I aborted them.
17) Message boards : Number crunching : extreme long wu's (Message 2067)
Posted 24 Mar 2017 by Jacob Klein
Post:
Is there any way to send me the inputs to the tasks that I aborted, so I may try them again locally?
18) Message boards : Number crunching : extreme long wu's (Message 2058)
Posted 23 Mar 2017 by Jacob Klein
Post:
@Brummig

1: I'm glad you're not having the nasty problem in this thread. Please don't dump on others that do have the problem and are trying to fix it for everyone else.

2: Regarding work fetch, sounds like you're looking for blood. Good thing I'm kind :) I recommend turning on the <work_fetch_debug> option, using either Options > Event Log Options, or editing cc_config.xml ... then, when odd behavior happens, inspect the work fetch output to make sure it behaved correctly. If something looks odd or doesn't make sense, feel free to send me a private message, and I'll try to help you, if you remain patient.

Same team, bro.
19) Message boards : Number crunching : extreme long wu's (Message 2055)
Posted 22 Mar 2017 by Jacob Klein
Post:
I want my resources to be busy and not wasted. That goes for all resource types -- my 4 PCs, their 16 CPU cores, their 6 GPUs, their 2 ASIC miners. I'm attached to projects that yield work for all those resource types.

BOINC aims to keep my resources busy and not wasted. It uses work fetch algorithms to provide that, so long as you are attached to projects that have work. BOINC expects those projects to work together, and uses things like Resource Share and Recent Estimated Credit to meet the user's requests.

For testing purposes, I am attached to all 60 projects. I routinely get and do work for about 15 of them. I expect the applications to behave - to not interfere with the other projects. I'm vocal about problems, because I want them fixed - some have been BOINC problems that I've worked with their devs to fix, others have been project problems that I've worked with their admins to fix.

If there's a problem, I'll report it, and try to get it fixed. If it is a problem where it actually hinders some other project, then I'll expect prompt action from the problem project - and this is the scenario we have here.

krzyszp:

I first saw the problem on January 27th, and posted about it, not knowing about this thread. This thread first saw the problem on December 28th.

So, that's been roughly 12 weeks.
And what progress do we have to show for it?
20) Message boards : Number crunching : extreme long wu's (Message 2052)
Posted 22 Mar 2017 by Jacob Klein
Post:
I'll answer.

I'm a Windows 10 Insider. I install and test builds of Windows 10, before they are released, using the Fast Ring, which has releases as fast as once every 1.5 days.

I also participate in RNA World, a project known for super long tasks, that checkpoint by using VirtualBox and doing VM snapshots. I've recently completed a task on their project that took 550 days CPU time.

And I'm a BOINC Alpha tester. I worked with David Anderson personally, to ensure that the work fetch algorithms that you rely on, work correctly.

I've been around. And I really want problems like this to be solved promptly.


Next 20




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek