Message boards : Number crunching : Long running work units
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
fractal

Send message
Joined: 23 Mar 15
Posts: 3
Credit: 155,811,308
RAC: 94
Message 1166 - Posted: 8 May 2016, 22:47:40 UTC - in response to Message 1165.  

This is what BM reports, but I not really believe in the numbers... Just see CPU usage...


CPU usage is normal, as in 98% or higher. I see no problems in the longer running WU's, only that they run longer... excepting the ones that never finish! Hope you can figure those out.

8-)

Yeah, the never finishing ones are a pain in the rear.

Sad to say, but I abort anything with "0" or "10" in the third field on sight. Some of them do finish but some of them don't, and it is hard to tell which is which. Turning off "keep suspended in memory", suspending and resuming them resets them back to the last check point which is often days earlier. So, rather than wait for a day to find out, I just abort any "0" or "10" sequence work and hope someone with an AMD processor picks them up.
ID: 1166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 10 Sep 15
Posts: 12
Credit: 20,067,933
RAC: 0
Message 1186 - Posted: 19 May 2016, 19:43:20 UTC
Last modified: 19 May 2016, 19:55:27 UTC

ID: 1186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cyrusNGC_224@P3D

Send message
Joined: 21 Feb 15
Posts: 46
Credit: 926,538,317
RAC: 647
Message 1189 - Posted: 23 May 2016, 20:24:20 UTC - in response to Message 1166.  

Some of them do finish but some of them don't, and it is hard to tell which is which. Turning off "keep suspended in memory", suspending and resuming them resets them back to the last check point which is often days earlier. So, rather than wait for a day to find out, I just abort any "0" or "10" sequence work and hope someone with an AMD processor picks them up.
"0" WUs crash, "10" WUs running longer. But not all "0" on all CPUs und all OS crash. Mostly Win machines crunch "0" WUs normal.
ID: 1189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tex1954

Send message
Joined: 22 Feb 15
Posts: 23
Credit: 37,205,060
RAC: 36
Message 1191 - Posted: 24 May 2016, 14:06:31 UTC - in response to Message 1186.  
Last modified: 24 May 2016, 14:25:10 UTC

My 4770k runs task in ~4900s, but today I've got 6 tasks >17000s long. Yes, they are "_10_" tasks.

Cpu usage is ~100%, I'm sure about this.

Wingmen seem to run (those tasks) in their normal time.


Looking at all those, I see a progression of WU's _5_ up to _10_ on all my setups. Currently, my 2P 24T setup has all those _10_ WU's y'all are abandoning I guess. So far, at 8 hours run time and they are about 68% complete.

Should I abandon them? I think not... I'm here to help the project.

However, it was and still is of some concern that the long tasks make the same points as the short tasks and that motivates folks to abandon them. For those only interested in points, I suppose that is to be expected.

I KNOW there is a way to help compensate point-wise for long tasks... one only has to identify them and use a multiplier on the points, even for fixed point setups. Other projects do this... but their LONG tasks are identified ahead of time which is perhaps something that this project is unable to predict.

Anyway, point production = electricity used in many peoples minds and I'm sure it would benefit the project to pic some average time breakpoints on a certain CPU (via FLOPS/Sec or something) and adjust point output using a simple multiple.. like 4 hours on a 3770 = 333, 4-7.9 - 666, 8-11.9 = 999 and so forth.

In fact, one could simply use a Time(seconds)/FLOPS/Sec value like this:

round((Time/FLOPs) /4) * 333

or something simple like that to determine points... Even use a fix CPU average like 3750 or 4750 for FLOPS so people could not cheat.. and use it on the fastest (lowest) time of two giving same points to both. (assumes Primary and Wingman)

8-)
ID: 1191 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 1192 - Posted: 24 May 2016, 16:08:21 UTC - in response to Message 1189.  
Last modified: 24 May 2016, 16:09:59 UTC

"0" WUs crash, "10" WUs running longer. But not all "0" on all CPUs und all OS crash. Mostly Win machines crunch "0" WUs normal.

I have just set up a Ubuntu 16.04 machine (i7-4790), and have completed 36 of the "0" without problems.

Universe@Home 0.09 Universe BHspin universe_bh_366_0_20000_1-999999_929300_0 03:01:50 (02:29:01) 5/23/2016 6:03:08 AM 5/23/2016 6:46:11 AM 81.95
Reported: OK i7-4790-PC (LAN)


But I have not gotten any of the "10" yet, so that will tell the story.
ID: 1192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 4 Feb 15
Posts: 48
Credit: 15,956,546
RAC: 60
Message 1193 - Posted: 8 Jun 2016, 1:38:58 UTC
Last modified: 8 Jun 2016, 1:46:09 UTC

Over the last month I have found 3 of these "10" type work units on my Linux 64 bit computer that have run from over 17 to over 20 hours (over 65,000 to over 75,000 seconds), I have finished one today.
Each has been partnered with a Windows 7 computer when this has happened.
The Windows computers have all had normal run times.
All have been paid same 333.33 points.
I have only had 1 of the "0" type that I can find and it has had the shortest run time of any of my work units at 5,500 seconds.
My normal run times are from 7,500 to 14,500 seconds for all other types.

My Windows 32 bit computer has had fairly consistant run times from 13,000 to 24,000 seconds for all types.

Other than that I have had no errors running BHSpin work units on either Linux or Windows.

Conan
ID: 1193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Feb 15
Posts: 253
Credit: 200,562,581
RAC: 0
Message 1194 - Posted: 8 Jun 2016, 15:49:51 UTC

Looking through my BoincTask History, I find that I have completed twelve of the "10s" without problem in the past three days. They all ran around 7 hours 40 minutes on this i7-4790 machine. Again, this is with Ubuntu 16.04, so maybe there have been fixes to Linux? I have not used it before.

universe_bh_368_10_20000_1-999999_229800_1
universe_bh_368_10_20000_1-999999_257300_1
universe_bh_368_10_20000_1-999999_270300_0
universe_bh_368_10_20000_1-999999_253800_0
universe_bh_368_10_20000_1-999999_257800_0
universe_bh_368_10_20000_1-999999_277800_1
universe_bh_368_10_20000_1-999999_114300_1
universe_bh_368_10_20000_1-999999_775800_0
universe_bh_368_10_20000_1-999999_775300_1
universe_bh_368_10_20000_1-999999_776300_0
universe_bh_368_10_20000_1-999999_752300_0
universe_bh_368_10_20000_1-999999_606800_1
ID: 1194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
cyrusNGC_224@P3D

Send message
Joined: 21 Feb 15
Posts: 46
Credit: 926,538,317
RAC: 647
Message 1199 - Posted: 10 Jun 2016, 12:15:23 UTC

As noted above, for example, AMD Bulldozer (Vishera, Steamroller, Excavator) based CPUs are running these in normal time (Linux).
ID: 1199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Luigi R.

Send message
Joined: 10 Sep 15
Posts: 12
Credit: 20,067,933
RAC: 0
Message 1207 - Posted: 11 Jun 2016, 9:08:15 UTC - in response to Message 1194.  
Last modified: 11 Jun 2016, 9:13:52 UTC

Looking through my BoincTask History, I find that I have completed twelve of the "10s" without problem in the past three days. They all ran around 7 hours 40 minutes on this i7-4790 machine. Again, this is with Ubuntu 16.04, so maybe there have been fixes to Linux? I have not used it before.

[...]

It doesn't look there is something wrong with them. They are only longer, without a proper credits' scaling. They don't go into error.

I don't know if there could be a bug that causes useless loops or an operative system's inefficiency.. the run-time is ok (~100% of cpu-time).


I think it would be wise to abort them before starting to compute, as we know there are other configurations (os+hw) that are not affected by this issue.
ID: 1207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 169
Credit: 317,253,046
RAC: 6
Message 1252 - Posted: 10 Jul 2016, 20:15:45 UTC

My 2P 2670 got another batch of about 35 tasks with 10 in the name that take 3x longer than normal.
ID: 1252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Theadalus

Send message
Joined: 9 Nov 15
Posts: 6
Credit: 193,753,698
RAC: 0
Message 1255 - Posted: 12 Jul 2016, 9:00:45 UTC - in response to Message 1252.  

My 2P 2670 got another batch of about 35 tasks with 10 in the name that take 3x longer than normal.

3x longer... then you are lucky; I have 1 machine where they take 9 - 10 times longer!

Worst thing is they are credited same as short tasks!
ID: 1255 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 169
Credit: 317,253,046
RAC: 6
Message 1291 - Posted: 4 Aug 2016, 0:29:21 UTC

Looks like BHSpin2 is also affected by these same long WUs. Guess I'll check those as well so pre-abort them.
ID: 1291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Long running work units




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek