Message boards : Number crunching : Automatic abort work unit
Message board moderation

To post messages, you must log in.

AuthorMessage
kcharuso

Send message
Joined: 26 Jul 22
Posts: 4
Credit: 54,025,333
RAC: 9,527
Message 6114 - Posted: 7 Apr 2023, 11:39:48 UTC

hi people,
i was wondering if there is a way to automaticlly abort work unit. often times i noticed work units that were on its 9 hours of crunching while having 99% completion indicated. i usually assumed these work units must have some error and will not be credited. i then aborted that task to free computation resources for other work unit.

is there an option i can set in cc.config or app.config to have these work units automatically aborted after certain time say, 5 hours of crunching if the task is not yet completed.

from my rough observations, there were one or two work units that ended up like so every day. i never have this problem before so i dont know whats up. please assist

thanks
ID: 6114 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,033
RAC: 357,217
Message 6115 - Posted: 7 Apr 2023, 19:46:04 UTC - in response to Message 6114.  

There is no such mechanism. Suggest you investigate why your tasks are not completing. Most likely cause is an overcommitted cpu. Reduce your available cores to 90% instead of 100%.

There are no bad work units at Universe. They all complete normally for all hosts other than your own.

A proud member of the OFA (Old Farts Association)
ID: 6115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 151
Credit: 69,771,333
RAC: 16,935
Message 6116 - Posted: 7 Apr 2023, 22:52:03 UTC

Here's the stderr.txt file from one of your systems for a Task that errored out.

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>
22:32:10 (2992): Can't acquire lockfile (32) - waiting 35s
22:32:46 (2992): Can't acquire lockfile (32) - exiting
22:32:46 (2992): Error: The process cannot access the file because it is being used by another process.

 (0x20)
01:18:54 (11308): called boinc_finish(0)

</stderr_txt>
]]>




And another

<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>
22:32:11 (5604): Can't acquire lockfile (32) - waiting 35s
22:32:46 (5604): Can't acquire lockfile (32) - exiting
22:32:46 (5604): Error: The process cannot access the file because it is being used by another process.

 (0x20)
01:32:30 (3792): Can't acquire lockfile (32) - waiting 35s
01:33:05 (3792): Can't acquire lockfile (32) - exiting
01:33:05 (3792): Error: The process cannot access the file because it is being used by another process.

 (0x20)
02:06:33 (13224): Can't acquire lockfile (32) - waiting 35s
02:07:08 (13224): Can't acquire lockfile (32) - exiting
02:07:08 (13224): Error: The process cannot access the file because it is being used by another process.

 (0x20)
02:21:37 (8564): Can't acquire lockfile (32) - waiting 35s
02:22:12 (8564): Can't acquire lockfile (32) - exiting
02:22:12 (8564): Error: The process cannot access the file because it is being used by another process.

 (0x20)
02:41:58 (2464): Can't acquire lockfile (32) - waiting 35s
02:42:33 (2464): Can't acquire lockfile (32) - exiting
02:42:33 (2464): Error: The process cannot access the file because it is being used by another process.

 (0x20)
02:43:30 (10852): called boinc_finish(0)

</stderr_txt>
]]>


It's either a mechanical HDD that can't keep up with the disk input/output load, or more likely a badly behaved AV programme.
White list the C:\ProgramData\BOINC directory and the issue should no longer occur.
Reducing the number of CPU cores/threads in use shouldn't be necessary.


Then there was another system where all the work timed out- it looks like for whatever reason you added it as a new system/it got a new ID number, so all the existing work timed out.
Grant
Darwin NT
ID: 6116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kcharuso

Send message
Joined: 26 Jul 22
Posts: 4
Credit: 54,025,333
RAC: 9,527
Message 6117 - Posted: 8 Apr 2023, 3:02:35 UTC - in response to Message 6116.  

thank you very much. i am enlightened. all suggestions are now under implementation. ill keep yall updated. thanks again
ID: 6117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kcharuso

Send message
Joined: 26 Jul 22
Posts: 4
Credit: 54,025,333
RAC: 9,527
Message 6118 - Posted: 8 Apr 2023, 3:51:42 UTC - in response to Message 6115.  

Good day Mr. Myers,

i am wondering about the completion time of work units not just from Universe but other projects as well. let me describe my setup to perhaps indicate possible issue. im running a B550 with 3950x on 32g and 1tb M2 with a pair of 1080. in Boinc im running universe for the cpu(30) tasks and einstine for gpu(2) tasks. the machine is 24/7 full load for months at the time with zero issue.

a month ago i was given 5700xt for free so i installed it in my last avaliable pcie slot and add milkyway just for this new gpu and nothing else it will crunch for. the machine operated flawlessly as if nothing was change....for a week or so. i can do my daily work at the same time with no lag or heat and planty of ram was still avaliable. i measured 780w from the socket while i can supply over 1000w from the psu. everything looks great within capability of hardware and software.

however, i noticed random Boinc tasks in every projects started to take longer to complete and errors or invalid task begins to appear. lastly, is the issue with universe tasks i described. it seems that the newly added 5700xt is the issue but i couldn figure out why. i monitored as many parameters as the machine can provide but nothing indicate stress or over utilization. in fact, less than 50% of resorces and capabilies were put to used. cpu @ 100% (70c) all gpus @ 98-100% (82c)

i really want to fix this issue while also put the free 5700xt to use. i dont know if this is possible or is there any setting i can configure. most importantly, i wanted to know the cause of the gradual slow down. have you came across similar issue or anything i can look into

thank you
ID: 6118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 151
Credit: 69,771,333
RAC: 16,935
Message 6119 - Posted: 8 Apr 2023, 4:40:36 UTC - in response to Message 6118.  
Last modified: 8 Apr 2023, 4:41:49 UTC

Remove the 5700XT and see if the issues go away- something else could have happened around the same time the new video card was added that's causing the problem. If removing the card fixes it, then it's the problem, if not, then something else has occurred that's causing issues.

It's worth getting Process Explorer to see what's running on the system; as much as Task Manager has improved over the years, Process Explorer is still way better for a more detailed look at what is using system resources.
In my case, after using Windows Security for years without issue, it started sucking up CPU time and none of the following updates sorted the problem out so i disabled it & now use a 3rd party security application.
Grant
Darwin NT
ID: 6119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile nexiagsi16v

Send message
Joined: 28 Feb 15
Posts: 23
Credit: 42,229,680
RAC: 1,553
Message 6120 - Posted: 8 Apr 2023, 20:12:46 UTC - in response to Message 6118.  

You use NVidia cards and a AMD card in one system? This could be a problem. Stop the work for the AMD card (or take it out of the system) and watch what happen.
Maybe your M2 is to hot?
ID: 6120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,033
RAC: 357,217
Message 6121 - Posted: 9 Apr 2023, 3:08:26 UTC - in response to Message 6118.  
Last modified: 9 Apr 2023, 3:08:41 UTC

So you have 32 possible threads for use.

You use 30 threads for Universe.

You use 2 threads for Einstein.

You use unknown # of threads for Milkyway

30 + 2 + ? = more than threads than you have available.

A computer needs at least one or preferably two free threads for "system maintenance", IOW running all the background tasks of the OS, like inputs from mouse or keyboard, disk read/writes, network polling etc. etc.

IOW, you have an overcommitted cpu that has to drop attention from all your BOINC demands to find time for its own upkeep. You are dragging time-slices away from your BOINC work and the running tasks are not getting serviced in a timely manner and has to wait several cycles before the cpu can get back around to servicing them.

A proud member of the OFA (Old Farts Association)
ID: 6121 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kcharuso

Send message
Joined: 26 Jul 22
Posts: 4
Credit: 54,025,333
RAC: 9,527
Message 6122 - Posted: 10 Apr 2023, 9:37:54 UTC - in response to Message 6121.  

hi, im running 26 universes 2 milkyways and 2 einsteins. in total, i'm running 30 work units so i still have 1 core or 2 threads available for whatever is needed in the system. as for memory, little under 10gb is used from 32gb available. cpu temp is stable 76c and hottest gpu is the ATI one at 81c. everything is within the recommendation as far as i can tell from the tools i have.

also, 2 milkyway tasks by the 5700xt give more points than just running 1 task even after the points is half. strange but good. however, i did not get the similar results in einstein running 1080 where running 1 task per gpu is always better than running 2.

does the speed of pcie matter in Boinc? according to the motherboard specs. i am now only run 4x for all 3 gpus. is there any benefit if the gpus are operating in 8x or 16x?

as of now, there are still 3-4 errors a day from each project. im ok with that but if i can get rid of them, it will be super.
ID: 6122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Grant (SSSF)

Send message
Joined: 23 Apr 22
Posts: 151
Credit: 69,771,333
RAC: 16,935
Message 6123 - Posted: 10 Apr 2023, 10:44:47 UTC - in response to Message 6122.  

does the speed of pcie matter in Boinc? according to the motherboard specs. i am now only run 4x for all 3 gpus. is there any benefit if the gpus are operating in 8x or 16x?
It would depend on the application, but in general the PCIe bus bandwidth has little (if any) impact on processing GPU Tasks (at least if it's PCIe 4).


as of now, there are still 3-4 errors a day from each project. im ok with that but if i can get rid of them, it will be super.
Look at the std_err output for those errored Tasks to see what is occurring.

As far a Universe is concerned, you haven't had any for 4 or more days now (and the only one that was recent wasn't really an error, it was just the Project cancelling a Task before you started processing it).
Grant
Darwin NT
ID: 6123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 10 May 20
Posts: 308
Credit: 4,733,484,033
RAC: 357,217
Message 6125 - Posted: 10 Apr 2023, 18:00:22 UTC - in response to Message 6122.  

4X PCIE speed is slower on Einstein Separation tasks. I run 3 identical 2080 gpus on a Asus C7H motherboard that has the top two gpus running at 8X PCIE speed with the lanes coming from the cpu. The bottom gpu runs at 4X speed and gets its lanes from the chipset.

The bottom gpu running at 4X is noticeably slower on all gpu tasks from Einstein, Milkyway and Asteroid tasks.

You will see a slowdown. But running 3 gpus at 4X speed is going to be more productive than just 2 gpus running at 8X speed.

A proud member of the OFA (Old Farts Association)
ID: 6125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rsNeutrino

Send message
Joined: 1 Nov 17
Posts: 27
Credit: 291,940,933
RAC: 10,883
Message 6137 - Posted: 15 Apr 2023, 17:07:53 UTC - in response to Message 6122.  
Last modified: 15 Apr 2023, 17:11:02 UTC

i measured 780w from the socket while i can supply over 1000w from the psu.

That is a lot of power for permanently running 2 cards. This could introduce stability problems. If the current draw from some power rails of the PSU is too high, the voltage begins to sag. One has also to understand that the power draw is also fluctuating rapidly within the system components, depending on wich sub-modules/ALUs/CUDA cores etc. are used at each moment in the cards and CPU during program execution. So when all components are active at the same time in a certain millisecond and the power rail they are on cannot supply a perfect 3.3V (CPU) and 12V (GPU) this can cause stability problems.
CPU-internally errors may happen that get autimatically detected and corrected, wich slows the CPU down by causing repeat calculations. This can also trigger Windows Hardware Error Architecture events (WHEA errors) in the windows event log.

HWinfo can read voltages of many system rails and components, it may help determining if the voltage is dropping below spec. Wiki - ATX spec says "Generally, supply voltages must be within ±5% of their nominal values at all times. ".
HWinfo also has a WHEA counter built in.

-> Higher clocks need higher voltages (U) wich leads to higher current in silicon (I) so to massively higher power throughput (W=U*I) wich leads to higher termals (W_therm=W_el).

If it really is the case that the power is insufficient, getting back to stability can be achieved by any of the following:

1. Downclocking unstable components to get them error-free (CPU, GPU, RAM), i recommend looking into this for CPU.
Maybe AMD PBO can help with that, on my 5900x I can use PPT/EDC/TDC Power limits like point 2:

2. Downclocking components with high power draw to reduce power draw, e.g. I recommend setting power limits in MSI Afterburner, it is what I did with my GPU (1080Ti, 70%, to ~200W) for Folding@home, making it very quiet and extents the life at the same time.
Power Limit to around 70% reduces performance maybe 5-20% depending.

3. Like others already said, remove components to reduce power draw and all other system interference on HW and SW level.

4. Beef up power supply. Hard in your case, also very dependant on the quality of the PSU and cable (connection) quality. 780W at 12V = 65 Ampere (a cable cross-section table said 16,00 mm2 for single cable)
In general I don't recommend going over 70% of the PSU power spec to avoid power problems, also because they age.

The worst would be using cheap PSUs, they usually are massively below rated power spec, could inject "dirty" power with all sorts of signals mixed in, and could break/explode/catch fire and ruin all other system components with a voltage spike in the process, so a good quality PSU is a must.
ID: 6137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Automatic abort work unit




Copyright © 2024 Copernicus Astronomical Centre of the Polish Academy of Sciences
Project server and website managed by Krzysztof 'krzyszp' Piszczek