[Openshmem-list] shmem_tool_sync

Mon Mar 20 23:58:34 UTC 2017

>
>
> shmem_tool_sync() would be nice for the tool developers since we could
> count on SHMEM at the time of dump.  When shmem_finalize is in a DSO it's
> suddenly hard to assert that SHMEM calls are safe to use if the tool is
> dumping from a static destructor.  From the user perspective, the tools
> should already provide their own mechanisms for dumping data, e.g. TAU has
> the TAU_DB_DUMP call and setting the TAU_TRACK_SIGNALS=1 environment
> variable will cause TAU to dump profiles if the application crashes.
>
>
Yeah, and my past use of TAU_DB_DUMP with NWChem is certainly part of the
motivation here.  shmem_tool_sync provides a standard wrapper for any tools
that want to do this.

> At F2F I floated the idea of a SHMEM_T interface to expose and manipulate
> SHMEM internal variables.  I could imagine a user setting flags to instruct
> the tool to dump on certain events, every Nth collective, etc.
>
>
Piggybacking on collectives is a means of last resort that I never want to
see anyone implement, because it can lead to unexpected performance
variation.

Given the slow adoption of MPI_T support, I'm not sure how effective that
will be.  I'd rather see implementers try some things and see if a best
practice emerges.

Jeff

> ~John C.
>
> On Mon, Mar 20, 2017 at 5:07 PM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
>
>> The discussion of shmem_finalize made me wonder if it would not be
>> helpful to have an explicit call to tell any active tools to dump their
>> logs.
>>
>> For example, if I have a code like NWChem that goes through a series of
>> phases, and the odds of phase i+1 crashing is nontrivial, but I want to
>> know how phase i did with some tool, then I can call shmem_tool_sync
>> between the phases to ensure that my tool state is (1) persistent and (2)
>> in an analyzable state.
>>
>> It really sucks if an app crashes in such a way that tool state is not
>> useable, particularly when said tool state would help figure out why the
>> app crashed.
>>
>> One advantage of this call vs some hidden implementation is that the
>> hidden implementation is either asynchronous or probably runs more often
>> than it needs to.  For example, if I automatically dump logs every Nth
>> collective, then I might slow down the app noticeably.  On the other hand,
>> if I have to do the dump asynchronously, I cannot take advantage of I/O
>> aggregation or any other coordinated analysis.
>>
>> It would be ideal if somebody from the TAU team could comment on this.
>>
>> Jeff
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> http://jeffhammond.github.io/
>>
>> _______________________________________________
>> Openshmem-list mailing list
>> Openshmem-list at openshmem.org
>> http://www.openshmem.org/mailman/listinfo/openshmem-list
>>
>>
>
>
> --
> John C. Linford, Ph.D.
> Senior Computer Scientist
> ParaTools, Inc. <http://www.paratools.com>
> 5520 Research Park Drive, Suite 100
> Baltimore, MD 21228
> Phone: 540-808-9250
> -------------------------------------------------------------------------
>

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.openshmem.org/pipermail/openshmem-list/attachments/20170320/a19e8a4b/attachment.html>