Message boards : Number crunching : extreme long wu's

extreme long wu's

Post to thread Subscribe


Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
AuthorMessage
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,383,221
RAC: 0
Message 2378 - Posted: 26 Sep 2017, 2:00:08 UTC

Admins:
Could you please try again to fix this?

I am offering to help out with additional testing to help reproduce, isolate, and fix this issue.

Until it is fixed, you are wasting resources, and I'm encouraging everyone I can to set "No New Tasks" on Universe@Home. Your lack of effort leaves no choice.
ID: 2378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 10 Sep 15
Posts: 12
Credit: 5,407,933
RAC: 0
Message 2406 - Posted: 30 Sep 2017, 18:21:01 UTC
Last modified: 30 Sep 2017, 18:22:53 UTC

I noticed there is no "fraction done" progress when a task is faulty.
I made a BASH script that suspends all the tasks that behaves like this for a set interval (default = 10 minutes). I hope this helps volunteer-side.
However, on BOINC suspending tasks from a project turns out to be not getting new tasks too. So I suggest to choose a backup project (resource share=0) if you don't have another project for CPU because you won't get new tasks from U@H until you don't choose what to do (abort/resume) with the possible stuck task.
You have to modify boinc_path and boinccmd variables only.

https://pastebin.com/Rui6artX
ID: 2406 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hsdecalc

Send message
Joined: 2 Mar 15
Posts: 6
Credit: 2,662,437
RAC: 0
Message 2430 - Posted: 19 Oct 2017, 15:47:13 UTC

A set of >80 WUs ran without error, except for one.
Today I killed endless job again (26h), running, but boinc says "suppressed" , WIN10 German.
https://universeathome.pl/universe/workunit.php?wuid=11890988

I have a copy of the slot-folder. File error.dat:
error: in Renv_con() unknown Ka type: 1, iidd_old: 197723error: in Menv_con() unknown Ka type: 1, iidd_old: 197723
ID: 2430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2433 - Posted: 23 Oct 2017, 21:08:48 UTC - in response to Message 2362.  

I am going to try a little trick, and see how it works.

I tried my trick (https://universeathome.pl/universe/forum_thread.php?id=199&postid=2362#2362) for a couple of weeks with no problems. Then, I disabled LHC entirely for the last week or two, and ran only Universe. But I just got a long runner today that was going for about 8 or 9 hours as I recall. However, a reboot fixed it and it seems to have completed normally. So the long runners still occur about once a month, though it is a minor problem for me.
ID: 2433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2435 - Posted: 24 Oct 2017, 14:59:08 UTC - in response to Message 2433.  

But I just got a long runner today that was going for about 8 or 9 hours as I recall. However, a reboot fixed it and it seems to have completed normally.

Apparently the reboot did not fix it. I kept track of the work unit number this time, and the reboot just reset the time to a low value (30 minutes), but it was still stuck at 52% complete (at 19 hours before the reboot).

Here it is, and I will abort it shortly:
https://universeathome.pl/universe/result.php?resultid=27353831

But it completed normally on another machine, so whatever causes them to get stuck seems to be almost random. There may be nothing "wrong" with the work units, just something external in the system that sometimes causes problems. It will be hard to track down.
ID: 2435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 10 Sep 15
Posts: 12
Credit: 5,407,933
RAC: 0
Message 2436 - Posted: 24 Oct 2017, 20:14:37 UTC
Last modified: 24 Oct 2017, 20:17:26 UTC

My script catched a faulty task today. A lot of hours saved. :)

https://universeathome.pl/universe/result.php?resultid=27717637

ID: 2436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hsdecalc

Send message
Joined: 2 Mar 15
Posts: 6
Credit: 2,662,437
RAC: 0
Message 2442 - Posted: 29 Oct 2017, 15:21:02 UTC

Yesterday again: One Wu with latest checkpoint 30h ago!
Terminated Boinc and restart.
Job starts from last checkpoint (1:07 runtime)
Two hours later no new checkpoint. An endless running job, aborted.
Conclusion: No new tasks in the next months!
ID: 2442 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2443 - Posted: 30 Oct 2017, 15:14:04 UTC - in response to Message 1849.  
Last modified: 30 Oct 2017, 15:20:12 UTC

I have moved my Universe work to an Ivy Bridge (i7-3770) machine to see if it changes the long runners from what I was getting on my Haswell (i7-4770) machine. Both are running Ubuntu 16.04.3 and Linux 4.10.0-37. They are the only CPU work units running, and I have been getting a long runner only every six weeks or so.
If that doesn't do it, I will try a later version of Linux.
ID: 2443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2444 - Posted: 1 Nov 2017, 16:49:23 UTC - in response to Message 2443.  

I have moved my Universe work to an Ivy Bridge (i7-3770) machine to see if it changes the long runners from what I was getting on my Haswell (i7-4770) machine. Both are running Ubuntu 16.04.3 and Linux 4.10.0-37. They are the only CPU work units running, and I have been getting a long runner only every six weeks or so.
If that doesn't do it, I will try a later version of Linux.

I tried it on two Ivy Bridge machines (both i7-3770 with identical motherboards), and picked up long runners on both right away. One was running Ubuntu 16.04 (Linux 4.10.0-37) and the other Ubuntu 17.10 (Linux 4.13.016). So changing the OS or the CPUs does not fix it.

But all the long runners were picked up in the batch right after a reboot. That may explain why I normally don't get them very often, because my machines run 24/7 for weeks at a time. I will just leave this one running and see how it does.
ID: 2444 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2447 - Posted: 5 Nov 2017, 9:42:28 UTC - in response to Message 2444.  

But all the long runners were picked up in the batch right after a reboot. That may explain why I normally don't get them very often, because my machines run 24/7 for weeks at a time. I will just leave this one running and see how it does.

I had to reboot on 3 November to install a utility, but that did not cause any long runners this time. The long runners are probably induced when I pause Universe work units in order to allow others to finish. This would also explain why running multiple projects could produce long runners, since the Universe WUs would then be paused occasionally by the BOINC scheduler. But I have not really tested this out yet, I am just letting the PC run with only Universe work units to see how far it can go without a long runner. My previous record was somewhere around 2 months, so maybe this time I can exceed it by not doing anything to pause a WU. Hopefully no forced shutdowns, etc. either during the test.
ID: 2447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 57
Credit: 6,786,842
RAC: 10,656
Message 2448 - Posted: 5 Nov 2017, 11:56:26 UTC - in response to Message 2447.  
Last modified: 5 Nov 2017, 12:07:49 UTC

My Ubuntu 16.04 PC has a credit of 203,000, and gets completely shut down every day (hibernation has never worked properly). I don't do anything special for BOINC when shutting down. I've yet to experience a "long runner" WU on this or any of my other machines running U@H (which do occasionally get shut down completely, but not very often). So I'm not convinced you're on the right track there, but maybe I just need to crunch more WUs, and I don't have any better ideas.

However, I can't help but notice that there appears to be a problem with a couple of WUs successfully crunched by two of my machines (both R-Pi):

https://universeathome.pl/universe/workunit.php?wuid=11721442
https://universeathome.pl/universe/workunit.php?wuid=12326875

These appear to be causing indigestion for other machines which ordinarily routinely crunch successfully. One machine went "over time", and terminated the task. See also:

https://universeathome.pl/universe/forum_thread.php?id=282
ID: 2448 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2449 - Posted: 5 Nov 2017, 13:21:36 UTC - in response to Message 2448.  
Last modified: 5 Nov 2017, 13:56:02 UTC

My Ubuntu 16.04 PC has a credit of 203,000, and gets completely shut down every day (hibernation has never worked properly). I don't do anything special for BOINC when shutting down. I've yet to experience a "long runner" WU on this or any of my other machines running U@H (which do occasionally get shut down completely, but not very often). So I'm not convinced you're on the right track there, but maybe I just need to crunch more WUs, and I don't have any better ideas.

Your experience shows that R-Pi machines may not have the problem, even though the Intel machines do. However, you really don't have enough samples to draw much of a conclusion. I usually have to go several weeks of continuous operation before getting one (total credit of 9,440,456 over several machines now). And often when I check my "long runners", I find that they have completed successfully on other machines, and I think vice-versa. So it is some seemingly random phenomenon, and not necessarily "bad" work units (or at least not different from any others, and they all share some weakness).

I have informed the project developer of several of my aborted work units, as he requested, but apparently he could not reproduce it.

EDIT: My point may not have been clear, but I now think that the reboot of this machine did not induce the long runner. Rather, when I attached this machine to Universe, I probably paused the newly-downloaded Universe work units in order to allow some other project (Rosetta as I recall) to finish. That may have induced the problem, but that is what I am testing now.
ID: 2449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 57
Credit: 6,786,842
RAC: 10,656
Message 2454 - Posted: 6 Nov 2017, 9:00:24 UTC - in response to Message 2449.  

Just recently I've been playing with BOINC on Android. Initially I had it only crunch when the charger was plugged in, but every time I switched the phone to another power socket, BOINC tasks would suspend and I would lose a percentage of the progress. These loses are often quite large, and certainly more than could be explained by how often BOINC writes to disk. One U@H task lost 100 percentage points after days of crunching. So there does appear to be something wrong with what happens when a task suspends.
ID: 2454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thyme Lawn

Send message
Joined: 15 Oct 17
Posts: 3
Credit: 2,133,477
RAC: 2,147
Message 2456 - Posted: 7 Nov 2017, 23:14:39 UTC - in response to Message 2375.  

I have regular endless tasks to. Win 10 PC which run 7/24. Aborting the last WU after 14 hours I found the process two days later still running in Taskmanager!!! So the process wasn´t canceled but removed from Boinc. I have 80 hours wasted time. So I can´t run no more WUs because of this bad behavior.

My i7-6700K Windows 10 Pro system has just had its first task which got stuck in a loop (the _2 task in WU universe_bh2_160803_202_16199984_20000_1-999999_200200, with the _1 task being successfully completed using the android_arm_pie application).

The task made a single checkpoint after 0:03:27 and was aborted in BOINC Manager 5:20:29 later. It must have been in a loop because it kept running and had to be manually killed from Process Explorer (i.e. it was never executing the BOINC API code which checks for messages from the BOINC core client). It was run with no interruptions and BOINC's event log shows that the _0 task from universe_bh2_160803_202_94384906_20000_1-999999_385200 was successfully completed 1:14:40 before the problem task started:

07-Nov-2017 14:24:10 [Universe@Home] Starting task universe_bh2_160803_202_94384906_20000_1-999999_385200_0
07-Nov-2017 14:27:50 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:31:45 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:35:22 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:38:51 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:42:32 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:46:01 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:49:49 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:53:40 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 14:57:07 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:00:52 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:04:41 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:08:18 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:11:52 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:15:28 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:19:21 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:23:07 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:26:53 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:30:36 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:34:29 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:38:01 [Universe@Home] [checkpoint] result universe_bh2_160803_202_94384906_20000_1-999999_385200_0 checkpointed
07-Nov-2017 15:38:03 [Universe@Home] Computation for task universe_bh2_160803_202_94384906_20000_1-999999_385200_0 finished

07-Nov-2017 16:52:43 [Universe@Home] Starting task universe_bh2_160803_202_16199984_20000_1-999999_200200_2
07-Nov-2017 16:56:10 [Universe@Home] [checkpoint] result universe_bh2_160803_202_16199984_20000_1-999999_200200_2 checkpointed
07-Nov-2017 22:06:39 [Universe@Home] task universe_bh2_160803_202_16199984_20000_1-999999_200200_2 aborted by user
07-Nov-2017 22:07:38 [Universe@Home] [task] abort request timed out, killing task universe_bh2_160803_202_16199984_20000_1-999999_200200_2
07-Nov-2017 22:07:39 [Universe@Home] [task] abort request timed out, killing task universe_bh2_160803_202_16199984_20000_1-999999_200200_2

<--- Timeout every second until task was killed --->

07-Nov-2017 22:10:23 [Universe@Home] [task] abort request timed out, killing task universe_bh2_160803_202_16199984_20000_1-999999_200200_2
07-Nov-2017 22:10:24 [Universe@Home] [task] abort request timed out, killing task universe_bh2_160803_202_16199984_20000_1-999999_200200_2
07-Nov-2017 22:10:25 [Universe@Home] Computation for task universe_bh2_160803_202_16199984_20000_1-999999_200200_2 finished

"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 2456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2463 - Posted: 8 Nov 2017, 19:20:08 UTC - in response to Message 2447.  

After crunching for a week with no problems (setup as noted above with no suspensions or other projects running), I picked up four long runners in quick succession over a 24-hour period. They ran for about 17 hours and then got stuck.

But rather than aborting them, I rebooted and two of them got unstuck and completed normally.
https://universeathome.pl/universe/result.php?resultid=28274101
https://universeathome.pl/universe/result.php?resultid=28267843

However, the last two remain stuck. So after the first two completed, I rebooted again. Surprisingly, the last two then completed normally also.
https://universeathome.pl/universe/result.php?resultid=28274042
https://universeathome.pl/universe/result.php?resultid=28274043

So there are no "bad" work units, only something causing them to get stuck at random intervals ranging from days to about two months apart for me. That is strange. Maybe it is a hyper threading problem, or a limitation on the CPU cache that some of them hang? I don't know at this point.
ID: 2463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 9 Nov 17
Posts: 7
Credit: 59,625,667
RAC: 4,906
Message 2486 - Posted: 12 Nov 2017, 22:19:23 UTC - in response to Message 2406.  
Last modified: 12 Nov 2017, 22:31:54 UTC

Luigi R. wrote:
I noticed there is no "fraction done" progress when a task is faulty.
I made a BASH script that suspends all the tasks that behaves like this for a set interval (default = 10 minutes). I hope this helps volunteer-side.
However, on BOINC suspending tasks from a project turns out to be not getting new tasks too. So I suggest to choose a backup project (resource share=0) if you don't have another project for CPU because you won't get new tasks from U@H until you don't choose what to do (abort/resume) with the possible stuck task.
You have to modify boinc_path and boinccmd variables only.

https://pastebin.com/Rui6artX

Luigi, thank you very much for this script. It's a workaround, but an excellent one.
ID: 2486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 9 Nov 17
Posts: 7
Credit: 59,625,667
RAC: 4,906
Message 2487 - Posted: 12 Nov 2017, 22:29:57 UTC - in response to Message 2463.  
Last modified: 12 Nov 2017, 22:31:39 UTC

Jim1348 wrote:
After crunching for a week with no problems (setup as noted above with no suspensions or other projects running), I picked up four long runners in quick succession over a 24-hour period. They ran for about 17 hours and then got stuck.

But rather than aborting them, I rebooted and two of them got unstuck and completed normally.
[...]
However, the last two remain stuck. So after the first two completed, I rebooted again. Surprisingly, the last two then completed normally also.
[...]
So there are no "bad" work units, only something causing them to get stuck at random intervals ranging from days to about two months apart for me. That is strange. Maybe it is a hyper threading problem, or a limitation on the CPU cache that some of them hang? I don't know at this point.

It seems random to me. If I switch the option "Leave non-GPU tasks in memory while suspended" off, suspend tasks which get stuck, and a while later un-suspend them, I have seen any one of the following outcomes:

  • Resumed task finishes properly.
  • Resumed task gets stuck again after a while. Suspend and resume again once or twice, and it finishes properly.
  • Resumed task gets stuck again, even after retries. Abort it.


So far I had a handful of these on Linux, and could finish them all after suspend + resume. I also had a handful on Windows, and could only finish one but had to abort the others (since multiple retries of suspend and resume did not get them further).

Note that a stuck task does not suspend immediately after requesting it to do so; it needs some time (to reach a checkpoint? or simply to poll for a signal?).

Furthermore, I have not yet tried to suspend a stuck task with the option "Leave non-GPU tasks in memory while suspended" switched on. My guess is this won't help repair a stuck task.

ID: 2487 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2488 - Posted: 12 Nov 2017, 22:57:25 UTC - in response to Message 2487.  


It seems random to me. If I switch the option "Leave non-GPU tasks in memory while suspended" off, suspend tasks which get stuck, and a while later un-suspend them, I have seen any one of the following outcomes:

  • Resumed task finishes properly.
  • Resumed task gets stuck again after a while. Suspend and resume again once or twice, and it finishes properly.
  • Resumed task gets stuck again, even after retries. Abort it.



Actually, I have also seen situations where a reboot did not fix the stuck ones and they have to be aborted. I think you have a good list of the possibilities.

I will mention another possibility. I have just built a new Ryzen 1700 machine on 10 November, and have been running BHspin v2.0 on 15 cores (with one core being reserved for CPU support) for 2 1/2 days now with no long runners. That does not prove anything yet, but I am impressed with the consistency of the run times, which are all in the range of from 1.5 hours to a little over 4 hours, averaging around 2 1/2 hours. That is about the same average as for my i7-3770 machine (both on Ubuntu), but the times ranged from 1.5 hours to around 12 hours on the Intel chip. My guess is that has something to do with the "long runners", and the more consistent performance of the AMD chip hopefully means there will be fewer (or none) of them. But it will have to run for at least several weeks to get some idea about that.
ID: 2488 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 186
Credit: 53,077,414
RAC: 78,018
Message 2494 - Posted: 14 Nov 2017, 15:32:41 UTC - in response to Message 2488.  

My Ryzen machine has now picked up two long-runners, so it is not immune either. After a reboot, both started working again, but there is a hitch. The first one to get stuck resumed at the beginning (0 percent), while the second one to get stuck resumed progress where it left off at the time of the reboot (around 30 percent).

Also, there are now two other work units that are past the 6 1/2 hour point, even though they are continuing to make progress. But that is a little long for this machine, and I have not seen it before. So my present guess is that the first long runner induces problems with the other work units that are running at the time, causing them to slow down and eventually get stuck too.

Maybe it is due to the first long runner using up too much cache memory, or some other common resource? I have to leave it to the experts here. I am out of hardware/software combinations to try. Good luck to all.
ID: 2494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dzordzik

Send message
Joined: 4 Nov 17
Posts: 3
Credit: 22,130,767
RAC: 4
Message 2495 - Posted: 14 Nov 2017, 17:50:33 UTC

Hi, on my dual Xeon I get in last days many long WU´s that stay stuck. Canceled but it waste many resources. Will have it fixed or is better to stop working for this project until it will be fixed? I´m able to send you number of this WUs ...
ID: 2495 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next Post to thread

Message boards : Number crunching : extreme long wu's