Message boards : Number crunching : extreme long wu's
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

AuthorMessage
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2036 - Posted: 22 Mar 2017, 9:17:09 UTC - in response to Message 2035.  
Last modified: 22 Mar 2017, 9:17:31 UTC

I wonder how you would feel if it was one of your systems that was having its resources stolen.

That's exactly what CPDN was doing; running for hours and then crashing, usually with no credit. In doing nothing useful it was simply a waste of electricity, which ironically meant it was contributing to climate change for no benefit. It also meant it was delaying getting tasks completed (after spending days getting nowhere on my PC, they would be picked up by another host and then complete). So I dropped CPDN and switched to WCG.
ID: 2036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2037 - Posted: 22 Mar 2017, 9:21:43 UTC
Last modified: 22 Mar 2017, 9:24:14 UTC

This is the here and now. You are advocating he continues to waste other peoples machine resources, because "you're alright" basically, how noble.
ID: 2037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2038 - Posted: 22 Mar 2017, 9:22:41 UTC - in response to Message 2033.  
Last modified: 22 Mar 2017, 9:23:42 UTC

There is the case now on ATLAS that running them by themselves works, while running them with other projects at the same time results in validation errors

Which is interesting, because I have been running ATLAS for a long time alongside WCG, FiND, and various flavours of LHC@Home, with no problems. On the other hand, maybe ATLAS was the reason why I had problems with CPDN :).
ID: 2038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2039 - Posted: 22 Mar 2017, 9:30:17 UTC - in response to Message 2037.  

You are advocating he continues to waste other peoples machine resources

No, I'm saying if the project doesn't work for you right now, why not simply drop it and try again later? Then it wont waste any of your resources. Why should everyone be forced to drop the project just because some hosts are having problems?
ID: 2039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2040 - Posted: 22 Mar 2017, 9:34:24 UTC - in response to Message 2039.  

It is not just me, it could be a large number of people who don't or can't pay careful attention to their systems. I have never suggested anyone drop the project, I have suggested, quite reasonably, that he fixes it and does not make the problem any worse by pumping out work units.
ID: 2040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2041 - Posted: 22 Mar 2017, 10:31:55 UTC - in response to Message 2040.  
Last modified: 22 Mar 2017, 10:44:18 UTC

krzyszp has said on this thread:

The very long units are still a mystery as it not happens very often and only on some computers (e.g. on one of my machines it's never happens).

and

It's never "batch" of tasks - is always just few tasks (in worst batch it was 14 WU's) in batch of around 40'000 and they mostly calculate correctly on wingman machine. This is why is so difficult to find result.

The corollary of which is that the tasks are running just fine for most people. Yes, it's annoying if you're in that minority for which the tasks are problematic, yes krzyszp needs to find a solution, but again I ask, why should he stop sending tasks to the majority because a minority have a problem, given that the minority can temporarily handle the problem simply by clicking that "No new tasks" button?

And by the way, I check on my Pi most days, because every so often it encounters a problem that requires my attention. Even though it's headless and lives in an opaque box, it's a quick and simple task; I just look to see if it is returning results.
ID: 2041 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2042 - Posted: 22 Mar 2017, 12:19:00 UTC
Last modified: 22 Mar 2017, 12:19:26 UTC

The difference between a CPDN task failing, and the Universe@Home problem in this thread, is that a CPDN task may fail at some point, while this Universe@Home problem will allow a task to run indefinitely.

If you are unable to appreciate that difference, then you are unable to understand the true concern.

I am desperately trying to repro the issue, to help the admin fix it, because an "indefinitely-running-task" is the worst kind of BOINC task, across all 60 of my projects (including my own CPDN task failures).

Regards,
Jacob.
ID: 2042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 2043 - Posted: 22 Mar 2017, 12:28:22 UTC

I am sure the admins appreciate everyone's concern for the well-being of the project. But the comparison with CPDN is amusing, and a bit more than I can bear without comment. They have people there who run months, and even years without returning a successful work unit. It is often because they have mis-configured machines, trying to run the 32-bit Linux version on 64-bit machines, or whatever else they can dream up. They obviously never check their results. That really is not the project's fault, whether it is CPDN or Universe.

There is an entire thread on it: very instructive.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7674
ID: 2043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2044 - Posted: 22 Mar 2017, 13:28:01 UTC - in response to Message 2042.  
Last modified: 22 Mar 2017, 13:29:21 UTC

Of course if you want to help krzyszp, then you will need to keep crunching to spot the errant tasks, but anyone else can just kill the task and (if it becomes a routine problem for them) click "No new tasks", leaving the majority for whom it isn't a problem to carry on. At least a task running indefinitely will start to stand out as it drops down the list but remains as "In progress".

Universe@Home also has machines that run for long periods of time grabbing tasks and never successfully completing them. I sometimes get them as wingmen. If you look in the message boards you will see that krzyszp did have a go at addressing the problem a while back. I guess it's inevitable in a DC project that there will be people who don't bother to manage their machines properly. The old adage "when the cat's away, the mice will play" applies very much to computers.
ID: 2044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2045 - Posted: 22 Mar 2017, 13:33:52 UTC - in response to Message 2044.  

At least a task running indefinitely will start to stand out as it drops down the list but remains as "In progress".


If a machine is running unattended, will it matter where it is on the list of running tasks, as it wastes the resource? Sigh.
ID: 2045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2046 - Posted: 22 Mar 2017, 14:41:05 UTC - in response to Message 2041.  

Because his current application is stealing computer time from other projects. The BOINC work ethic is fit it and forget it, he by continuing to issue work units with the problem, goes against that ethic, and he should be held accountable for that.
ID: 2046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2047 - Posted: 22 Mar 2017, 14:45:30 UTC - in response to Message 2045.  
Last modified: 22 Mar 2017, 14:52:48 UTC

The task will appear on the appropriate task list on this website, slowly dropping down the list as other tasks complete, but remaining marked as "In Progress". That tells you from anywhere on the planet that has net access that the machine running that task has a problem, and it needs your attention. Of course any computer, attended or unattended, can go awry for any number of reasons. If you have a dedicated crunching machine that runs out of disk space because some other process generated a set of monster files (as happened today on my Pi, and spotted because the task list on this website showed it only had one task in progress), you will have a machine that is doing nothing, wasting electricity. Such problems can be addressed without physically attending the machine (I used ssh), but if you don't want to even attend the machine remotely, then you must live with the consequences. It's unreasonable to ask everyone else to stop work because you have a problem.

And yes, it would be nice if BOINC tasks ran flawlessly, but they are written by humans and run on machines over which the BOINC task author has no control, so it should be no surprise that intervention is often required by the person maintaining the machine.
ID: 2047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2048 - Posted: 22 Mar 2017, 14:51:49 UTC - in response to Message 2047.  

So now you're telling me I should monitor the 60 projects that my computer swarm are attached to? What a joke..

I'm done arguing with you. I will continue to try to solve this problem as much as I can, with or without your help.
ID: 2048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2049 - Posted: 22 Mar 2017, 14:54:32 UTC - in response to Message 2048.  

So now you're telling me I should monitor the 60 projects that my computer swarm are attached to

Well if you take on that task and don't have the resources to manage it, why should others be prevented from working?
ID: 2049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2050 - Posted: 22 Mar 2017, 15:35:19 UTC - in response to Message 2049.  
Last modified: 22 Mar 2017, 15:43:47 UTC

How well do you know BOINC? :)

Look, if the task would error out reasonably, on its own, when it detected the problem, then your reply may be sufficient. BOINC was built to be resilient, both client-side and server-side, when tasks error out on their own.

However, the mechanisms to handle this "unending task" situation are a bit different, and raise a higher alarm and concern.

I believe, in the case of an "unending task", BOINC relies on the <rsc_fpops_bound> configuration.

This projects uses the following for that setting, for this "Universe BHspin v2 0.01" application, as seen in client_state.xml:
<rsc_fpops_bound>999999999999999.000000</rsc_fpops_bound>


My 4 main computers have the following characteristics, as seen in client_state.xml:
<p_fpops>4312814363.944559</p_fpops>
<p_fpops>3788808068.659085</p_fpops>
<p_fpops>2768310915.476756</p_fpops>
<p_fpops>1770864494.146483</p_fpops>


I'm not sure how much "checkpoint" time is banked by the task, before the error presents itself. Let's assume none.

So, to my knowledge, this means that the task must run continuously for this long, on each of my machines, before BOINC will kill it for reason "<rsc_fpops_bound> exceeded":
999999999999999.000000 / 4312814363.944559 = 231867 seconds = 2.7 days
999999999999999.000000 / 3788808068.659085 = 263935 seconds = 3.1 days
999999999999999.000000 / 2768310915.476756 = 361231 seconds = 4.2 days
999999999999999.000000 / 1770864494.146483 = 564696 seconds = 6.5 days

So ... at least we have that maybe, especially for unattended setups - The task should die if ran continuously for a week. But if the task gets restarted, I think it loses any time that had accumulated since last checkpoint, which is why I said it must run continuously.

However, I'm typically restarting my machines every 1.5 days. :/ Which is why this is a problem for me.

There, now you know more about BOINC's coping mechanisms.
It is still a problem we should try harder to fix.

It'd be interesting to know how many tasks have failed due to "<rsc_fpops_bound> exceeded". And sad.
ID: 2050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 23
Message 2051 - Posted: 22 Mar 2017, 16:01:00 UTC - in response to Message 2050.  
Last modified: 22 Mar 2017, 16:01:33 UTC

I have to ask, why do you restart your machines so frequently? Until the problem is fixed, maybe the workaround for you is to address the frequency of restarts.

BTW, I have encountered long running WU's before, but on a completely different project:
http://atlasathome.cern.ch/forum_thread.php?id=360

That too has been a bit of a mystery, coming and going seemingly at random. It affected me, and it required a lot of manual intervention when it did so.
ID: 2051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2052 - Posted: 22 Mar 2017, 19:44:24 UTC - in response to Message 2051.  
Last modified: 22 Mar 2017, 19:45:00 UTC

I'll answer.

I'm a Windows 10 Insider. I install and test builds of Windows 10, before they are released, using the Fast Ring, which has releases as fast as once every 1.5 days.

I also participate in RNA World, a project known for super long tasks, that checkpoint by using VirtualBox and doing VM snapshots. I've recently completed a task on their project that took 550 days CPU time.

And I'm a BOINC Alpha tester. I worked with David Anderson personally, to ensure that the work fetch algorithms that you rely on, work correctly.

I've been around. And I really want problems like this to be solved promptly.
ID: 2052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2053 - Posted: 22 Mar 2017, 19:58:33 UTC - in response to Message 2052.  
Last modified: 22 Mar 2017, 20:10:45 UTC

Me too. My machines run 24/7/365, they run BOINC and web servers. I restart a machine when I need too. The restarts might be months apart.

Off topic, but I have had no work from RNA for months.
ID: 2053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 2054 - Posted: 22 Mar 2017, 20:32:01 UTC

I run my machines 24/7 too, but I normally check them every day. If I leave home for a while, I might switch them to another project, depending on the state of things.

But the purpose of Universe (or any project) is to get the work of that project done, not to give me something to do. If we can meet on some common ground to achieve that purpose, so much the better. But their obligation is to the project, not to adapt to all possible idiosyncrasies of crunching schedules. This is a volunteer project, not a paid one.
ID: 2054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2055 - Posted: 22 Mar 2017, 20:40:34 UTC - in response to Message 2054.  
Last modified: 22 Mar 2017, 20:43:15 UTC

I want my resources to be busy and not wasted. That goes for all resource types -- my 4 PCs, their 16 CPU cores, their 6 GPUs, their 2 ASIC miners. I'm attached to projects that yield work for all those resource types.

BOINC aims to keep my resources busy and not wasted. It uses work fetch algorithms to provide that, so long as you are attached to projects that have work. BOINC expects those projects to work together, and uses things like Resource Share and Recent Estimated Credit to meet the user's requests.

For testing purposes, I am attached to all 60 projects. I routinely get and do work for about 15 of them. I expect the applications to behave - to not interfere with the other projects. I'm vocal about problems, because I want them fixed - some have been BOINC problems that I've worked with their devs to fix, others have been project problems that I've worked with their admins to fix.

If there's a problem, I'll report it, and try to get it fixed. If it is a problem where it actually hinders some other project, then I'll expect prompt action from the problem project - and this is the scenario we have here.

krzyszp:

I first saw the problem on January 27th, and posted about it, not knowing about this thread. This thread first saw the problem on December 28th.

So, that's been roughly 12 weeks.
And what progress do we have to show for it?
ID: 2055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

Message boards : Number crunching : extreme long wu's




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek