Message boards : Number crunching : extreme long wu's
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

AuthorMessage
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2056 - Posted: 23 Mar 2017, 7:39:32 UTC

I posted about this in December also. It had been running fine until then. Something must have changed to create the issue. What changed?
ID: 2056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 25
Message 2057 - Posted: 23 Mar 2017, 9:16:28 UTC
Last modified: 23 Mar 2017, 9:23:30 UTC

@Jacob. So you are running a system that is running a preview version of Windows, and which reboots every 1.5 days (apparently highlighting a failing in the Boinc watchdog), and adrianxw says "me too" to that. Have you heard the expression "bleeding edge"? Is it any wonder the problem is rare and that krzyszp can't reproduce it? My Pi also runs 365/7/24, running Boinc when it would otherwise be idling, but it's running a standard release of Raspbian, and ordinarily it does not reboot. It may not be bleeding edge, but that's perhaps why it works just fine.

BTW, ATLAS@Home has had a problem with sporadic long-running tasks for far longer than Universe@Home.

Regarding your work on Work Fetch, would you be the person to complain to about the time I spent trying to get my Windows PC fetching sensible amounts of work for each project? LHC@Home and WCG routinely grabbed more work than they could possibly complete by the deadline, and did so aggressively, whilst ATLAS would grab one task now and again, but mostly it just kept reporting back that the work queue was full. The problem was exacerbated by Boinc choosing to run the most recently downloaded task flat out whilst ignoring a task that was hard up against the deadline (and which would subsequently time out). It took much fiddling with project settings and XML files to make the system work in a reasonable manner. Boinc is great, but it's very far from perfect.
ID: 2057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2058 - Posted: 23 Mar 2017, 12:04:27 UTC

@Brummig

1: I'm glad you're not having the nasty problem in this thread. Please don't dump on others that do have the problem and are trying to fix it for everyone else.

2: Regarding work fetch, sounds like you're looking for blood. Good thing I'm kind :) I recommend turning on the <work_fetch_debug> option, using either Options > Event Log Options, or editing cc_config.xml ... then, when odd behavior happens, inspect the work fetch output to make sure it behaved correctly. If something looks odd or doesn't make sense, feel free to send me a private message, and I'll try to help you, if you remain patient.

Same team, bro.
ID: 2058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 25
Message 2059 - Posted: 23 Mar 2017, 13:54:21 UTC - in response to Message 2058.  
Last modified: 23 Mar 2017, 14:05:17 UTC

Jacob, I'm not dumping on those that have the problem. I simply asked those with the problem not to dump on the silent, problem-free majority by requesting that the project be suspended until the issue is fixed (which may be some time, of course).

Thanks for offering to look into the fetch problem, but for the moment I would very much prefer to leave things alone (since my fixes are currently working). Moreover, I suspect I wouldn't be able to recreate the problem, because all the CERN projects have been consolidated into LHC@Home (leaving two bruisers, LHC and WCG, to slog it out with no collateral damage). The last to be consolidated was ATLAS, and the last ATLAS@Home task was sent out a week or so ago.

And no, I'm not looking for blood, but you did choose to mention your contribution was the Work Fetch algorithm that (presumably) gave me so much grief, when all I asked was why you had to reboot your machine so often.
ID: 2059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Sebastian M. Bobrecki
Volunteer tester

Send message
Joined: 4 Feb 15
Posts: 17
Credit: 158,222,691
RAC: 0
Message 2060 - Posted: 23 Mar 2017, 14:25:37 UTC

From beginning of march i have computed over 19k of task and only less than 3% have this issue. So for me it is not a big problem. Of course it will be good to have this fixed, but I remember times where projects that are bigger, with better financing and having opinion that they are stable have a lot bigger fail ratio.
ID: 2060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2061 - Posted: 23 Mar 2017, 15:31:25 UTC

Let me explains something:

"Failed computing" factors is:

windows_x86_64 - 0.6751%
0.01 windows_intelx86 - 6.1462%
x86_64-pc-linux-gnu - 0.3296%

Number of "197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED" is equal to 29 at the moment for about 45'000 WU's
All of them was finished successfully by wingman, e.g. http://universeathome.pl/universe/workunit.php?wuid=9391200

Can you just show me direction to find out, why windows 32 have substantially bigger fail rate then win64 and both of them are less reliable then Linux?
Also, it is seriously extremely difficult to find a bug when other machines finishes their tasks correctly...
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2062 - Posted: 23 Mar 2017, 16:17:17 UTC
Last modified: 23 Mar 2017, 16:54:59 UTC

Did you change something around Christmas time? It was okay before Christmas, and problematic afterwards. I don't just mean the program, changes may be in the data etc. What compiler do you use?
ID: 2062 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2063 - Posted: 23 Mar 2017, 17:02:42 UTC - in response to Message 2062.  
Last modified: 23 Mar 2017, 17:04:26 UTC

Did you change something around Christmas time? It was okay before Christmas, and problematic afterwards. I don't just mean the program, changes may be in the data etc. What compiler do you use?

No, same application is here from end of July last year.
Only parameters in input files was changed, but those are very small changes (some parameters slowly increased during the time)...
I use g++ on Linux and Visual Studio 2010 for Windows.
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2063 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2064 - Posted: 23 Mar 2017, 17:12:37 UTC
Last modified: 23 Mar 2017, 17:15:26 UTC

I am assuming that the problem started about the time of the thread, certainly, I had not seen the fault before I posted in here. But that is an assumption. Are we SURE the problem started about then? Do you have error rates for periods earlier?

What compiler are you using for Windows?

Trying to narrow some gaps, throw out certain ideas etc.
ID: 2064 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2065 - Posted: 23 Mar 2017, 19:26:34 UTC

Sorry, can't edit, I can see you added the compilers later, was not there when I wrote the above.
ID: 2065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 23 Mar 16
Posts: 95
Credit: 23,431,842
RAC: 25
Message 2066 - Posted: 23 Mar 2017, 20:13:36 UTC
Last modified: 23 Mar 2017, 20:19:53 UTC

I was reading something on CPDN earlier that said that they see very small differences between the results returned by a task on one operating system and the results returned by the same task on a different operating system. Now if, say, an intermediate result were compared with something, then that very small difference between operating systems could mean that a task that runs just fine on one OS could take a different path and go completely awry on another. It could also be that the small changes made to the input file were sufficient to trigger that chaotic behaviour. That may or may not help in tracking down the problem, but it does at least provide a possible explanation for the mystery.
ID: 2066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2067 - Posted: 24 Mar 2017, 2:21:22 UTC

Is there any way to send me the inputs to the tasks that I aborted, so I may try them again locally?
ID: 2067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2068 - Posted: 25 Mar 2017, 15:29:45 UTC - in response to Message 2067.  
Last modified: 25 Mar 2017, 15:38:34 UTC

After a week of setting this project with maximum resource share, in order to repro the problem, and monitoring the results closely, I have so far been unsuccessful in repro'ing it.

I will now set it for normal resource share, and try to keep a lookout for the problem.

Again, if anybody has the input files that recreate the issue, I'd LOVE to test with them. I wish I would have saved mine from the problematic tasks earlier, before I aborted them.
ID: 2068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 2069 - Posted: 25 Mar 2017, 23:25:29 UTC - in response to Message 2068.  

Jacob,
I would try the opposite. Get a couple of other programs to run alongside Universe and see if you can induce the problem. That would explain why krzyszp can not reproduce it, since he probably runs it in isolation.

I was running Einstein on 3 cores and Universe on 4 cores (with a GTX 1060 on Folding supported by the other core) when I would get the problem every few weeks. I am trying Universe by itself now, but will need to let it go for at least a month without a long runner before it means anything. Is it possible that BOINC somehow perturbs Universe due to the presence of another program? I don't know.
ID: 2069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 21 Feb 15
Posts: 53
Credit: 1,385,888
RAC: 0
Message 2070 - Posted: 26 Mar 2017, 0:45:00 UTC - in response to Message 2069.  
Last modified: 26 Mar 2017, 0:47:16 UTC

Don't get me wrong, I was running it alongside other programs (I have other high priority RNA World VM tasks, GPUGrid GPU tasks, WuProp non-cpu-intensive tasks, etc.)

So, yeah, now it will run amidst more since I set the resource share to be equal to my other 10 projects... And I will continue to try to repro the issue.

Thanks.
ID: 2070 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2071 - Posted: 27 Mar 2017, 18:59:20 UTC - in response to Message 2070.  

I just have released new version of BHspin application, we will see if the problem still continue in less then 24 hours.
(The new app version is for Windows and Linux only at the moment
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Krzysztof Piszczek - wspieram ...
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 4 Feb 15
Posts: 841
Credit: 144,180,465
RAC: 2
Message 2075 - Posted: 28 Mar 2017, 0:19:30 UTC - in response to Message 2071.  

I had to rollback version due to bug in this version :(
Krzysztof 'krzyszp' Piszczek

Member of Radioactive@Home team
My Patreon profile
Universe@Home on YT
ID: 2075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 2077 - Posted: 28 Mar 2017, 3:36:20 UTC - in response to Message 2075.  
Last modified: 28 Mar 2017, 3:39:48 UTC

I had to rollback version due to bug in this version :(

I deleted my first batch of 0.02 BHppin v2, but have just received more at 02:30 UTC on 28 March.
Are they safe to run, or should I delete them also?
ID: 2077 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
adrianxw

Send message
Joined: 1 Oct 16
Posts: 32
Credit: 268,033
RAC: 0
Message 2139 - Posted: 17 Apr 2017, 8:01:12 UTC

All gone quiet...
ID: 2139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
NotRealName

Send message
Joined: 5 Feb 17
Posts: 6
Credit: 2,135,900
RAC: 0
Message 2141 - Posted: 17 Apr 2017, 23:19:23 UTC - in response to Message 2061.  

Are you saying that only about 10% of computers are experiencing issues? It is strange because there was a performance drop by 70% during last month: http://universeathome.pl/universe/history.php
ID: 2141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

Message boards : Number crunching : extreme long wu's




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek