[GLASSFISH-21359] Issue with the glassfish jvm Created: 15/May/15  Updated: 11/Jun/15  Resolved: 11/Jun/15

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: tejas.pathak Assignee: Debayan_Gupta
Resolution: Invalid Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Issue with the glassfish jvm , Glassfish jvm went down with the following error as seen in the jvm logs :
snippet of error :
[#|2015-05-13T09:41:20.233-0400|WARNING|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=87;_ThreadName=Thread-2;|GMS1078: NetworkUtility.deserialized current objects: thread=GMS-McastMsgProcessor-Group-cluster1-thread-8 messages=

{LMWID=39, targetPeerId=10.1.2.109:9195:228.9.213.46:22325:cluster1:instance1}

failed while deserializing name=HM|#]

[#|2015-05-13T09:41:20.233-0400|WARNING|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=87;_ThreadName=Thread-2;|GMS1071: damaged multicast packet discarded
com.sun.enterprise.mgmt.transport.MessageIOException: failed to deserialize a message : name = HM
at com.sun.enterprise.mgmt.transport.MessageImpl.readMessagesInputStream(MessageImpl.java:349)
at com.sun.enterprise.mgmt.transport.MessageImpl.parseMessage(MessageImpl.java:239)
at com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender$MessageProcessTask.run(BlockingIOMulticastSender.java:350)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

#]

[#|2015-05-13T09:41:20.265-0400|WARNING|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=84;_ThreadName=Thread-2;|GMS1078: NetworkUtility.deserialized current objects: thread=GMS-McastMsgProcessor-Group-cluster1-thread-9 messages=

{LMWID=39, targetPeerId=10.1.2.109:9195:228.9.213.46:22325:cluster1:instance1}

failed while deserializing name=HM|#]

[#|2015-05-13T09:41:20.265-0400|WARNING|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=84;_ThreadName=Thread-2;|GMS1071: damaged multicast packet discarded
com.sun.enterprise.mgmt.transport.MessageIOException: failed to deserialize a message : name = HM
at com.sun.enterprise.mgmt.transport.MessageImpl.readMessagesInputStream(MessageImpl.java:349)
at com.sun.enterprise.mgmt.transport.MessageImpl.parseMessage(MessageImpl.java:239)
at com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender$MessageProcessTask.run(BlockingIOMulticastSender.java:350)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

#]

Following are the versions of the software :
Operating System : Windows server 2012
Jdk version : jdk1.6.0_45
Glassfish version : Oracle GlassFish Server 3.1.2.2 (build 5)



 Comments   
Comment by tejas.pathak [ 19/May/15 ]

Any update regarding this issue ?

Comment by tejas.pathak [ 22/May/15 ]

Any update regarding this issue ?

Comment by tejas.pathak [ 02/Jun/15 ]

Any update regarding this issue ?

Comment by Debayan_Gupta [ 03/Jun/15 ]

This issue is clearly due to packets are getting corrupt during transmission. Since it is dependent with network you are using, there is very less chance that it can be reproduced in some other network. I would suggest that you capture some packets using wireshark (www.wireshark.org) and analyze where the packets are getting corrupted. You can send the captured .pcap files also to be analyzed from our side.

Comment by tejas.pathak [ 03/Jun/15 ]

Thanks Debayan for the update. I will try to monitor the packets using wireshark, i also would like to know about following things :
a) Is there any way to stop this from happening in glassfish ( a topup patch or something ).
b) Also do i need to enable any additional logging in glassfish to capture these errors in detail.

Comment by Debayan_Gupta [ 04/Jun/15 ]

You are welcome Tejas. As I mentioned, the problem lies with network and not with application server. In the log, I can see two ip addresses 10.1.2.109 and 228.9.213.46. To isolate the problem, you could try the same application with destinations having different ip addresses or try to send some packets on those hosts from a sample application(you can follow the example : http://staff.www.ltu.se/~peppar/java/multicast_example/). Although I do not have much idea about what your application is doing, I would suggest you give wireshark a try. You can enable the logs to FINE levels of the components you are using e.g. messaging etc. (instructions: http://docs.oracle.com/cd/E19798-01/821-1751/ghgwi/index.html).

Comment by tejas.pathak [ 04/Jun/15 ]

Thanks Debayan for the update. Next time when i will run into the same issue , i will try the to debug whatever you have mentioned

Comment by Debayan_Gupta [ 05/Jun/15 ]

That is great Tejas. Since we need to come to a conclusive end, is it fine from your side if we close this issue till you face the problem again? You can reopen the issue whenever you face the same. Please confirm. Also, reach us for any kind of help regarding this.

Comment by Debayan_Gupta [ 11/Jun/15 ]

The issue is not repeatable as well as have some network related problems as mentioned in previous comments. User should investigate according to the provided suggestion. In case it is solely happening due to glassfish (with the wireshark reports as requested) , this issue could be reopened.





[GLASSFISH-21021] GlassFish 3.1.2.2 to 8 cluster replication fails (working in 3.1.1) Created: 31/Mar/14  Updated: 10/Apr/14  Resolved: 08/Apr/14

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: gfuser9999 Assignee: Joe Fialli
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS: Any
Version: 3.1.2.5/3.1.2.8
Working: 3.1.1.3 (working)
Setup: Setup a 2 node cluster on the same box (so no mcast or network issues)



 Description   

Setup a 2 node cluster c1 on the same box (so no mcast or network issues)
and deploy clusterjsp.ear. Enable debug on org.shoal.ha and org.shoal.ha.cache
loggers as it will tell the issue.

1. Start node ci1
2. Wait until complete start
3. You can warm up the clusterjsp
4. On ci1 logs you can ensure the c1 node is fine

[#|2014-03-31T12:24:14.856+0800|INFO|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=1;_ThreadName=main;|**GroupServiceProvider:: REGISTERED member event listeners for <group, instance> => <c1, ci1>|#]

5. Start node ci2. and wait for complete

6. On ci1 logs

[#|2014-03-31T12:25:38.524+0800|INFO|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=17;_ThreadName=GMS ViewWindowThread Group-c1;|GMS1092: GMS View Change Received for group: c1 : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: ci1, MemberType: CORE, Address: 10.xx.xx.225:9192:228.9.219.163:13930:c1:ci1
2: MemberId: ci2, MemberType: CORE, Address: 10.xx.xx.225:9111:228.9.219.163:13930:c1:ci2

#]

..
[#|2014-03-31T12:25:38.583+0800|FINE|oracle-glassfish3.1.2|org.shoal.ha.cache.mapper|_ThreadID=107;_ThreadName=GMS-processNotify-Group-c1-thread-5;ClassName=org.shoal.ha.mapper.DefaultKeyMapper MethodName=printMemberStates;|DefaultKeyMapper[ci1].onViewChange (isJoin: true) currentView: ci2; previousView ci1
ReplicaChoices[ci2]: ci2

#]

[#|2014-03-31T12:25:38.585+0800|FINE|oracle-glassfish3.1.2|org.shoal.ha.cache.mapper|_ThreadID=107;_ThreadName=GMS-processNotify-Group-c1-thread-5;ClassName=org.shoal.ha.mapper.DefaultKeyMapper MethodName=printMemberStates;|DefaultKeyMapper[ci1].onViewChange (isJoin: true) currentView: ci2; previousView ci1
ReplicaChoices[ci2]: ci2

#]

ie: that the ci1 and ci2 is seen by ci1

7. On ci2 logs we can see that c1 cluster is formed

[#|2014-03-31T12:25:38.547+0800|INFO|oracle-glassfish3.1.2|ShoalLogger|_ThreadID=17;_ThreadName=GMS ViewWindowThread Group-c1;|GMS1092: GMS View Change Received for group: c1 : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: ci1, MemberType: CORE, Address: 10.186.xx.xx:9192:228.9.219.163:13930:c1:ci1
2: MemberId: ci2, MemberType: CORE, Address: 10.186.xx.xx:9111:228.9.219.163:13930:c1:ci2

#]

=====================
ISSUE
=====================
7. On ci2 the mapper does not know of ci1 node

[#|2014-03-31T12:25:38.560+0800|FINE|oracle-glassfish3.1.2|org.shoal.ha.cache.mapper|_ThreadID=83;_ThreadName=GMS-processNotify-Group-c1-thread-4;ClassName=org.shoal.ha.mapper.DefaultKeyMapper MethodName=printMemberStates;|DefaultKeyMapper[ci2].onViewChange (isJoin: true) currentView: ; previousView |#]

[#|2014-03-31T12:25:38.562+0800|FINE|oracle-glassfish3.1.2|org.shoal.ha.cache.mapper|_ThreadID=83 _ThreadName=GMS-processNotify-Group-c1-thread-4;ClassName=org.shoal.ha.mapper.DefaultKeyMapper MethodName=printMemberStates;|DefaultKeyMapper[ci2].onViewChange (isJoin: true) currentView: ; previousView |#]

8. RUN clusterjsp ONLY on ci2 and create a new session
you see

[#|2014-03-31T12:27:54.257+0800|FINE|oracle-glassfish3.1.2|org.shoal.ha.cache.command.save|_ThreadID=108;_ThreadName=http-thread-pool-7920(4);ClassName=org.shoal.ha.cache.impl.store.ReplicatedDataStore;MethodName=put;|Skipped replication of 664c58762e15c054433486c4a3c1 since there is only one instance running in the cluster.|#]

=======
IMPACT
======
The last cluster node started does not have session replicated. (ci1 works but not ci2; ie
the last started node being ci2). If not for sticky there would be session loss
detected but the really issue is that the STATE is not replicated as it
seems that KeyMapper did not get state update to reflect the new cluster state)

The worst thing is that this work in GFv311 but not GFv312 due to some code
that check and print
"Skipped replication of xxxx since there is only one instance running in the cluster."
which now expose the fact that the state is not right (member = empty)



 Comments   
Comment by gfuser9999 [ 01/Apr/14 ]

Additional note
This happens with 2 node cluster. where the 2nd node will not replicate
If you create a 3 node cluster with name

{ci1,ci2,ci3}

and start in sequence once
each of previous node is start, then the DefaultKeyMapper members are
i) ci1 -> replica=

{ci2,ci3}

ii) ci2 -> replica=

{ci3}

iii) ci3 -> replica=

{ci2}

Although at this point this seems fine as there is a replica member, if
i shutdown either of ci2 or ci3, i will get the same failure to
replicate as there is no member

org/shoal/ha/cache/impl/store/ReplicatedDataStore.java
290                     KeyMapper keyMapper = dsc.getKeyMapper();
291 
292                     // fix for GLASSFISH-18085
293                     String[] members = keyMapper.getCurrentMembers();
294                     if (members.length == 0) {
295                         _saveLogger.log(Level.FINE, "Skipped replication of " + k + " since there is only one instance running in the cluster.");
296                         return result;
297                     }

This line 293 seems that if membership is not managed well we have problem.

Comment by Joe Fialli [ 01/Apr/14 ]

Unable to investigate this issue without full server logs for all cluster members.

From the server log fragments, it is obvious that no DAS is running yet there was no mention
of why this is the case. Would recommend running test case with DAS and starting cluster with
asadmin startcluster command.

Comment by gfuser9999 [ 08/Apr/14 ]
  1. The issue may not be seen (sometimes)
    when both instance ci1 and ci2 (is started at the same time w/o DAS)
    but is not seen when DAS is up
  2. Without DAS, however, if you start ci1 (wait until it is up) and then start
    ci2 you will see the issue.
    Also when ci2 is started later and you shutdown ci1 and then start ci1 again
    the issue then is seen with ci1(moved to it)


It would seems that the replication does not work well on manual start up on a two-node
maintenance use-case setup. According to operational guide of GF,

  • it is mentioned that DAS need not be UP all the time (and does not need to be up
    once the instances are synced)
  • In fact instance may be restarted and managed by a service daemon (local startup)
    and hence the setup does mimick what a 2 node cluster would be.

Seems to mean that there is some reliance to have some newly started node bootstrap it's member state

Comment by Joe Fialli [ 08/Apr/14 ]

All cluster testing is performed using "asadmin startcluster".

This removes a race condition for which node is the master of the cluster.
So for stablest performance, it is strongly recommended to start the
cluster with the DAS running. You can then shutdown DAS and an orderly
takeover of master will occur..

If you start both ci1 and ci2 within the MASTER_DISCOVERY_TIMEOUT (which I believe defaults to 5 seconds,
but I do not remember for certain), there will be a battle over cluster master that is not well tested at
all. The failure you reporting is quite likely due to both instances believing
they are master of one instance clusters. (without the full server logs there is
no way to tell. However, this reported usage scenario is not supported anyways.)

Starting one node and then waiting till that node server log states that it is the master for the
cluster before starting the other node can fix that.

Still for reliable operation, there is much more testing with the asadmin startcluster command.
There is sufficient testing of stopping/killing the DAS and cluster still runs.
However, the testing is never done on just a 2 node cluster, that is quite small. If there is one failure, there is no other nodes to replicate to.

Minimally, one should be running with a cluster of 3 members were 2 members are always running. (thus replication would always be enabled within the cluster)

Comment by gfuser9999 [ 09/Apr/14 ]
  1. Just for a fact, no production GF system will start the cluster using
    startcluster. The reason is that everyone will use "asadmin service" to
    create a SERVICE to start the instance of the cluster individually
    (for each instance on a separate box).
  2. Next, no i am not starting ci1 and ci2 together and they are not fighting
    for master (as ci1 is long started (up for a long time) before ci2 is started up)
  3. So the previous arguments is not convincing. (Sorry i can;t upload the logs
    since i do not find any attachment feature to upload these).
    I find it difficult to see why this is not a normal "supported" use case?
  • It is odd like asking people not to use OS service to start the instance !!!
    (since that's the only way to do reliably for a automatic startup using service)
  • Next, it also implicitly insisting that DAS must always be up (or
    insists that the cluster MUST be a 3 node cluster to function properly)
  • Sure i understand the reliably thing, but "functionally" it looks a bit broken
Comment by Joe Fialli [ 09/Apr/14 ]

Without the server logs, there is no way to tell what is going wrong. (There is a MASTER event in each server log that states the master for the cluster)
Based on your description, something is happening that each instance in the cluster thinks it is the master. That would explain the behavior you
have described. Without the server logs, there is no way to diagnose what is the issue.

GF 3.1.2 added support for clusters to work without multicast.
http://docs.oracle.com/cd/E26576_01/doc.312/e24934/clusters.htm#CHDGAIBJ
With the DAS running, this all just works without any thought at cluster creation time by default.
With no DAS running, the non-multicast cluster member server addresses and ports must be properly set.

Without the commands used to create the cluster and cluster members, it is not possible to know which mode you are using.
Even the server logs would show that.

Comment by Joe Fialli [ 10/Apr/14 ]

To workaround this issue, the DAS must be running when the cluster is first being formed.
The "asadmin start-cluster" command does not have to be used, but the DAS must be running when cluster is initially forming.
After the initial formation of the cluster, the DAS is no longer necessary except for the bootstrap case of forming the cluster again.





[GLASSFISH-19142] update gms adapter to use new nucleus logging API Created: 10/Oct/12  Updated: 26/Oct/12  Resolved: 26/Oct/12

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 4.0
Fix Version/s: 4.0

Type: Improvement Priority: Minor
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Update nucleus/gms-adapter and nucleus/gms-bootstrap to use the new nucleus logging API as documented
at http://aseng-wiki.us.oracle.com/asengwiki/display/GlassFish/Logging+Guide



 Comments   
Comment by Joe Fialli [ 26/Oct/12 ]

Fixed by svn commit rev. 56750





[GLASSFISH-18145] regression in instances joining cluster after "asadmin start-cluster" Created: 07/Jan/12  Updated: 10/Jan/12  Resolved: 09/Jan/12

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b14, 3.1.2_b15, 3.1.2_b16
Fix Version/s: 3.1.2_b17

Type: Bug Priority: Major
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File easqezorro1_domain.log    
Tags: 3_1_2-approved

 Description   

Also see shoal http://java.net/jira/browse/SHOAL-118.

The regression was introduced on 11/14 and/or 11/17 in fixing bug 13375653.
Link: https://bug.oraclecorp.com/pls/bug/webbug_print.show?c_rptno=13375653

Regression is multiple joins for same instance can occur in DAS after "asadmin start-cluster".
These multiple joins result in a slower time from start-cluster to when "asadmin get-health <cluster-name>"
shows all members have joined the cluster.

The time for the cluster to startup was well under 45 seconds for Glassfish Shoal SQE tests starting
a 9 instance cluster. Regression has resulted in timings that take over 65 seconds sometimes.
(there is quite a variance in how much slower startup can be. we did observe it taking over 65 seconds
for the submittted case)



 Comments   
Comment by Joe Fialli [ 07/Jan/12 ]

Fix has been identified, sanity tested by running one Glassfish Shoal SQE test and analyzing output.
No duplicate joins are occurring in DAS server log when run with shoal-gms-impl patch.
Waiting for full SQE test run on patch to ensure no regressions.

Change Control Form

  • What is the impact on the customer of the bug?

How likely is it that a customer will see the bug and how serious is the bug?
The customer is likely to observe that the time from calling "start-cluster" to "asadmin get-health <clustername>"
to list that all instances have started has gotten longer. Additionally, multiple joins per instance may occur.
(most code works off joined and ready notifications, so this problem has not impacted failover which has a joined and ready handler.)

Is it a regression?
It is a performance regression on how long it takes for all instances in cluster to join cluster after a start-cluster.
Yes. The regression was introduced fixing bug 13375653
Link: https://bug.oraclecorp.com/pls/bug/webbug_print.show?c_rptno=13375653

The regression impacts the Shoal Glassfish SQE tests since it double checks that the cluster is entirely up and healthy
and the time the test has to wait to check if cluster started correctly had to be increased from 45 seconds to 90 seconds.
Addressing this issue will remove having to change the existing SQE tests to account for this performance regression in
starting a 10 instance cluster.

Does it meet other bug fix criteria (security, performance, etc.)?
Fix will improve start-cluster performance.

  • What is the cost/risk of fixing the bug?

How risky is the fix? How much work is the fix? Is the fix complicated?

Fix is not risky. Only one file is changed and the change is
to add a single if conditional. The change eliminates duplicate
joins for a cluster member as the cluster starts up. Change was reviewed by Mahesh.

  • Is there an impact on documentation or message strings?
    There is no impact to documentation or message strings.
  • Which tests should QA (re)run to verify the fix did not destabilize GlassFish?
    Glassfish Shoal SQE tests are being run now.
  • Which is the targeted build of 3.1.2 for this fix?
    Next build.
Comment by Joe Fialli [ 09/Jan/12 ]

Shoal 1.6.17 integrated in into gf 3.1.2 workspace.
Fix available as part of glassfish 3.1.2 b17.

Comment by Joe Fialli [ 10/Jan/12 ]

Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012.
Fix should be in next promoted build which is 4.0 b19.





[GLASSFISH-18085] fail to replicate sessions larger than 64kB when only one active clustered instance Created: 25/Dec/11  Updated: 20/Jun/12  Resolved: 20/Jun/12

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.1_b12
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: janouskovec Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SPARC, Solaris 10


Tags: 3_1_2-exclude, 3_1_2-release-note-added, 3_1_2-release-notes

 Description   

Cannot replicate sessions larger than 64kB. I got following messages in log:

[#|2011-12-24T13:13:12.446+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 155?379 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:12.555+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 155?379 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:35.648+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=77;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 155?378 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:35.654+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=77;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 155?378 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:44.250+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 311?060 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:44.304+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 311?060 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:54.048+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=79;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?448 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:13:54.053+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=79;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?448 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:18.749+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=82;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?449 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:18.753+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=82;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?449 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:19.650+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=83;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?543 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:19.655+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=83;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?543 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:23.449+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=84;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?542 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:23.454+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=84;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?542 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:24.049+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=85;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?440 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:24.054+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=85;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?440 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:29.049+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=87;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?440 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:29.055+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=87;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?440 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:29.549+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=88;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?018 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:29.553+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=88;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?018 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:37.949+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=89;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?018 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:37.954+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=89;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?018 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:38.449+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=90;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?623 exceeds max multicast size 65?536|#]
[#|2011-12-24T13:14:38.454+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=90;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 167?623 exceeds max multicast size 65?536|#]

The Shoal (or some other component in GF) should be split session to multiple multicast datagrams if session is larger than 64kB.



 Comments   
Comment by Mahesh Kannan [ 04/Jan/12 ]

Assigning this to GMS module

Comment by Joe Fialli [ 04/Jan/12 ]

More information is needed from the submitter on this issue.

For GlassFish 3.1.1, Shoal GMS is based on top of Grizzly 1.9 and there is no UDP multicast support in Grizzly 1.9.
The 64K limit for GMS UDP broadcast messages can not be addressed in short term.

However, the replication subsystem (used by failover) does not use GMS broadcast messages for session data.
Thus, we need more information on what is using GMS UDP broadcast messages.

Either we require additional information that states that the test case runs fine when session
data is smaller than 64k and only fails when session data is larger. We also need to know
what event results in the above messages. (i.e. was a clustered server instance killed or stopped)

Or we need to know a more basic description of the cluster environment. (number of instances in cluster)

Turning on logging can confirm (or disprove) that session replication is involved.
The entire server log with above messages in context may or may not be enough.
In case there is not enough information in server.log to determine what module
is calling GMS, below are ways to enable logging for replication subsystem and shoal
GMS subsystem.

Finer Logging for replication subsystem can be enabled with following command.
% asadmin set-log-levels --target <clustername> org.shoal.ha=FINE

To enable finer logging for Shoal Group Management Service(GMS), use following command (when cluster is running)

% asadmin set-log-levels --target <clustername> ShoalLogger=FINE

No further investigation of this issue can take place until more information is provided.
The full server logs may provide enough context and information that logging need not be enable.

Comment by Joe Fialli [ 05/Jan/12 ]

Marked as 3_1_2-exclude since unable to diagnose this issue without further information that was requested in previous
comment from submmitter.

GMS will not remove the 64K limit for UDP broadcast messages in GF 3.1.2 so all that
can be done is to find out what module is invoking this call and evaluate if this
call should be a point to point message (which has a much larger max message size and that
max message size can be increased.)

Comment by Joe Fialli [ 19/Jan/12 ]

We have been able to replicate the reported WARNING in internal testing.

[#|2011-12-24T13:13:12.446+0100|WARNING|glassfish3.1.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-2;|GMS1073: Multicast datagram of size 155?379 exceeds max multicast size 65?536|#]

We were only able to recreate the reported issue when only one member of the cluster was running and HA was enabled.

The workaround is to always ensure that at least 2 clustered instances are running.
(use "asadmin get-health <clusterName>" to monitor a clusters health.
There is also information in the server log that indicates how many cluster members are running in a cluster) The fix (post GF 3.1.2) is to disable replication when
there is only one instance in the cluster and there is no eligible replicas to replicate
HA to. The current bug is that HA is sending replica session to null which translates to
a UDP broadcast. Broadcast over UDP multicast will never support a size larger than 64K.

This error occurs when there is only one clustered instance running and HA is enabled.
The error message is an indication that replication is not working since there is
no other clustered instances to replicate to.

Here is stack trace from internal run that confirms this:

[#|2012-01-18T08:27:53.117-0800|INFO|glassfish3.1.2|ShoalLogger.mcast|_ThreadID=70;_ThreadName=Thread-2;|context for exceeds max UDP message size
java.lang.Exception: stack trace context for exceeds max UDP broadcast size
at com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender.doBroadcast(BlockingIOMulticastSender.java:313)
at com.sun.enterprise.mgmt.transport.AbstractMulticastMessageSender.broadcast(AbstractMulticastMessageSender.java:70)
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.broadcast(GrizzlyNetworkManager.java:298)
at com.sun.enterprise.mgmt.ClusterManager.send(ClusterManager.java:409)
at com.sun.enterprise.mgmt.ClusterManager.send(ClusterManager.java:419)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.sendMessage(GroupCommunicationProviderImpl.java:336)
at com.sun.enterprise.ee.cms.impl.base.GroupHandleImpl.sendMessage(GroupHandleImpl.java:142)
at org.shoal.ha.group.gms.GroupServiceProvider.sendMessage(GroupServiceProvider.java:260)
at org.shoal.ha.cache.impl.interceptor.TransmitInterceptor.onTransmit(TransmitInterceptor.java:83)
at org.shoal.ha.cache.api.AbstractCommandInterceptor.onTransmit(AbstractCommandInterceptor.java:98)
at org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterManager.onTransmit(ReplicationCommandTransmitterManager.java:86)
at org.shoal.ha.cache.api.AbstractCommandInterceptor.onTransmit(AbstractCommandInterceptor.java:98)
at org.shoal.ha.cache.impl.interceptor.CommandHandlerInterceptor.onTransmit(CommandHandlerInterceptor.java:74)
at org.shoal.ha.cache.impl.command.CommandManager.executeCommand(CommandManager.java:122)
at org.shoal.ha.cache.impl.command.CommandManager.execute(CommandManager.java:114)
at org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterWithList$BatchedCommandListDataFrame.run(ReplicationCommandTransmitterWithList.java:213)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Comment by Rebecca Parks [ 30/Jan/12 ]

For the Release Notes I think I understand what the user-visible problem is. I can doc the 64 kB limit. I'm guessing that there's no workaround.

Comment by Rebecca Parks [ 30/Jan/12 ]

Upon rereading I see that the workaround is to ensure that at least two instances are running. This would seem like a no-brainer.

Comment by Joe Fialli [ 20/Jun/12 ]

Fixed in shoal version 1.6.20. Glassfish 3.1.2 patch 1.
Fix committed in shoal svn version 1744 on Jan 26, 2012.

Following FINE log message confirmed fix.
_loadLogger.log(Level.FINE, "Skipped replication of " + key + " since there is only one instance running in the cluster.");





[GLASSFISH-18047] specifying a network interface name for gms-bind-interface-address does not work correctly on Linux or Windows Created: 19/Dec/11  Updated: 04/Jan/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b14
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Initially discovered on Linux 2.6.18-164.0.0.0.1.el5 #1 SMP Thu Sep 3 00:21:28 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
Confirmed also to occur on Windows XP
Did not occur on dual stack Mac OS X 10.6.8 or IPv4 only Solaris 5.10.


Issue Links:
Dependency
blocks GLASSFISH-18024 virtual network interfaces introduced... Resolved
Tags: 3_1_2-exclude

 Description   

Specifying network interface "eth0" on linux OS is not working correctly. (confirmed same failure on Windows)

Specified this issue as minor since documentation does not state that it is valid to specify a network interface
name for gms-bind-interface-address. This capability was added to assist in machine network configuration setups
where some machines are multihomed and we were not consistently selecting appropriate network interface on all machines
in cluster. Specifying the network interface for the cluster to use bypasses the automated selection of the first network address to use.

The binding address returned by InetAddress.getByName() is returning "eth0/127.0.0.1".
The loopback interface is not appropriate for GMS inter-machine commmunications (only
when all instances are on one machine, only used for development.)

com.sun.enterprise.mgmt.transport.NetworkUtility identifies this issue exists.

%java -classpath shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility

Display name: eth0
Name: eth0
PreferIPv6Addresses: false
InetAddress: /fe80:0:0:0:223:8bff:fe64:7a56%7
InetAddress: /10.133.184.160
Up? true
Loopback? false
PointToPoint? false
Supports multicast? true
Virtual? false
Hardware address: [0, 35, -117, 100, 122, 86]
MTU: 1500
Network Inet Address (preferIPV6=false) /10.133.184.160
Network Inet Address (preferIPV6=true) /fe80:0:0:0:223:8bff:fe64:7a56%7
resolveBindInterfaceName(eth0)=127.0.0.1 /* this value should be 10.133.184.160 */

This issue did not occur on Mac or Solaris.



 Comments   
Comment by Joe Fialli [ 19/Dec/11 ]

A fix is completed for this issue.

Here are network utility results with fix.

**************************************************
Display name: eth0
Name: eth0
PreferIPv6Addresses: false
InetAddress: /fe80:0:0:0:223:8bff:fe64:7ac4%2
InetAddress: /10.133.184.158
Up? true
Loopback? false
PointToPoint? false
Supports multicast? true
Virtual? false
Hardware address: [0, 35, -117, 100, 122, -60]
MTU: 1500
Network Inet Address (preferIPV6=false) /10.133.184.158
Network Inet Address (preferIPV6=true) /fe80:0:0:0:223:8bff:fe64:7ac4%2
Dec 19, 2011 8:22:23 AM com.sun.enterprise.mgmt.transport.NetworkUtility resolveBindInterfaceName
INFO: Inet4Address.getByName(eth0) returned a local address eth0/127.0.0.1 so ignoring it
Dec 19, 2011 8:22:23 AM com.sun.enterprise.mgmt.transport.NetworkUtility resolveBindInterfaceName
INFO: Inet6Address.getByName(eth0) returned a local address eth0/127.0.0.1 so ignoring it
resolveBindInterfaceName(eth0)=10.133.184.158

The INFO message confirming the fix will be deleted before put back.

Comment by Joe Fialli [ 19/Dec/11 ]

The issue that is blocked required to specify gms-bind-interface-address
as network interface due to some machines in cluster having virtual software XEN
creating virtual network interfaces that are interfering with the automated selection
of an IP address to represent a machine.

Comment by Joe Fialli [ 04/Jan/12 ]

Did not feel comfortable including this fix at late stages of 3.1.2.
This functionality is not explicitly documented and this method was suggested as
an easier configuration alternative than what is documented.

Here is link to documented way to specify which network interface on
a multi-home machine to use for GMS.
Link: http://docs.oracle.com/cd/E18930_01/html/821-2426/gjfnl.html#gjdlw





[GLASSFISH-18024] virtual network interfaces introduced by virtualization systems regress Glassfish 3.1.2 GMS auto selection of an appropriate network interface to use Created: 16/Dec/11  Updated: 22/May/13  Resolved: 22/May/13

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b14
Fix Version/s: 4.0

Type: Bug Priority: Minor
Reporter: mzh777 Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OEL 5,
JDK 1.6.0_24 64 bits


Attachments: File autotimerFO.war     Zip Archive EJB_Autotimer_FO.zip     Zip Archive issue-18024.zip    
Issue Links:
Dependency
depends on GLASSFISH-18047 specifying a network interface name f... Open
Tags: 312_failover, 312_qa, 3_1_2-exclude, 3_1_2-release-note-added, 3_1_2-release-notes

 Description   

The EJB automatic timer migration works in shutdown instance mode. That means after automatic timer is created, use asadmin stop-instance, the timer migration will happen. But it doesn't work in crashing mode when the instance containing timer was killed.

The EJB timer app and logs are attached. The steps to reproduce the error are in EJB_Autotimer_FO/ant.output. The DAS log and instance logs are under EJB_Autotimer_FO/testListAutoTimer/logs/st-domain and st-cluster.



 Comments   
Comment by mzh777 [ 16/Dec/11 ]

Since the killing of instance103 is happened after the tests, attach more logs of stack trace during fail-over.

Comment by marina vatkina [ 16/Dec/11 ]

Joe, can you take a look? See the logs issue-18024.zip

The tx-log-dir is stored correctly on instance103:

[#|2011-12-15T16:50:08.408-0800|INFO|glassfish3.1.2|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=10;_ThreadName=Thread-2;|Storing GMS instance instance103 data TX_LOG_DIR : /net/asqe-logs.us.oracle.com/export1/hatxLogsMing/instance103/tx|#]

But on instance104 it's not found:

[#|2011-12-16T00:57:32.670+0000|INFO|glassfish3.1.2|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=34;_ThreadName=Thread-2;|[GMSCallBack] Recovering for instance: instance103 logdir: null|#]

Comment by Joe Fialli [ 16/Dec/11 ]

There appears to be an inconsistency in machine configurations for the test.
Some cluster members are defaulting to IPv6 addresses for GMS while others are defaulting to IPv4.
instance103, instance106 and instance110 have a GMS_LISTENER for tcp at an IPv6 address.

Potentially the workaround from http://java.net/jira/browse/GLASSFISH-17926 needs to be employed so the IPv6 addresses
will not be in log. Since instance103 was involved in not getting info and it has an IPv6 address below,
really need to workaround known issue described in GF-17926. (fix was integrated in BG workspace on 12/9,
I am uncertain if that is gf 4.0 b14)

From the server.log.
[#|2011-12-16T00:48:45.055+0000|INFO|glassfish3.1.2|ShoalLogger|_ThreadID=20;_ThreadName=Thread-2;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for CLUSTER_STOP_EVENT(before change analysis) are :
1: MemberId: instance101, MemberType: CORE, Address: 10.133.184.158:9160:228.9.217.5:29944:st-cluster:instance101
2: MemberId: instance102, MemberType: CORE, Address: 10.133.184.159:9106:228.9.217.5:29944:st-cluster:instance102
3: MemberId: instance103, MemberType: CORE, Address: fe80:0:0:0:fcff:ffff:feff:ffff%6:9099:228.9.217.5:29944:st-cluster:instance103
4: MemberId: instance104, MemberType: CORE, Address: 10.133.184.158:9166:228.9.217.5:29944:st-cluster:instance104
5: MemberId: instance105, MemberType: CORE, Address: 10.133.184.159:9168:228.9.217.5:29944:st-cluster:instance105
6: MemberId: instance106, MemberType: CORE, Address: fe80:0:0:0:fcff:ffff:feff:ffff%6:9115:228.9.217.5:29944:st-cluster:instance106
7: MemberId: instance107, MemberType: CORE, Address: 10.133.184.159:9158:228.9.217.5:29944:st-cluster:instance107
8: MemberId: instance109, MemberType: CORE, Address: 10.133.184.159:9152:228.9.217.5:29944:st-cluster:instance109
9: MemberId: instance110, MemberType: CORE, Address: fe80:0:0:0:fcff:ffff:feff:ffff%6:9095:228.9.217.5:29944:st-cluster:instance110
10: MemberId: server, MemberType: SPECTATOR, Address: 10.133.184.158:9165:228.9.217.5:29944:st-cluster:server

#]
Comment by mzh777 [ 16/Dec/11 ]

Network diagnostic util results on asqe-x2250-st3.us.oracle.com:

  1. java -classpath shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility
    Java property java.net.preferIPv6Addresses=false
    AllLocalAddresses() = [/fe80:0:0:0:fcff:ffff:feff:ffff%6, /fe80:0:0:0:223:8bff:fe64:7a56%7, /10.133.184.160, /fe80:0:0:0:200:ff:fe00:0%5, /192.168.122.1]
    interface name:vif0.0 isUp?:true
    Found first interface.vif0.0 isUp?:true
    Dec 16, 2011 10:30:23 AM com.sun.enterprise.mgmt.transport.NetworkUtility getFirstNetworkInterface
    INFO: getFirstNetworkInterface result: interface name:vif0.0 address:/fe80:0:0:0:fcff:ffff:feff:ffff%6
    getFirstNetworkInterface() = name:vif0.0 (vif0.0) index: 6 addresses:
    /fe80:0:0:0:fcff:ffff:feff:ffff%6;

getFirstInetAddress(preferIPv6Addresses:false)=null
getFirstInetAddress()=/fe80:0:0:0:fcff:ffff:feff:ffff%6
getFirstInetAddress( true ) = /fe80:0:0:0:fcff:ffff:feff:ffff%6
getFirstInetAddress( false ) = null
getLocalHostAddress = asqe-x2250-st3/10.133.184.160
getFirstNetworkInteface() = name:vif0.0 (vif0.0) index: 6 addresses:
/fe80:0:0:0:fcff:ffff:feff:ffff%6;

getNetworkInetAddress(firstNetworkInteface, true) = /fe80:0:0:0:fcff:ffff:feff:ffff%6
getNetworkInetAddress(firstNetworkInteface, false) = null

-------------------------------------------------------

All Network Interfaces

**************************************************
Display name: vif0.0
Name: vif0.0
PreferIPv6Addresses: false
InetAddress: /fe80:0:0:0:fcff:ffff:feff:ffff%6
Up? true
Loopback? false
PointToPoint? false
Supports multicast? false
Virtual? false
Hardware address: [-2, -1, -1, -1, -1, -1]
MTU: 1500
Network Inet Address (preferIPV6=false) null
Network Inet Address (preferIPV6=true) /fe80:0:0:0:fcff:ffff:feff:ffff%6
resolveBindInterfaceName(vif0.0)=fe80:0:0:0:fcff:ffff:feff:ffff%6

**************************************************
Display name: peth0
Name: peth0
PreferIPv6Addresses: false
InetAddress: /fe80:0:0:0:fcff:ffff:feff:ffff%2
Up? true
Loopback? false
PointToPoint? false
Supports multicast? false
Virtual? false
Hardware address: [-2, -1, -1, -1, -1, -1]
MTU: 1500
Network Inet Address (preferIPV6=false) null
Network Inet Address (preferIPV6=true) /fe80:0:0:0:fcff:ffff:feff:ffff%2
resolveBindInterfaceName(peth0)=fe80:0:0:0:fcff:ffff:feff:ffff%2

**************************************************
Display name: eth0
Name: eth0
PreferIPv6Addresses: false
InetAddress: /fe80:0:0:0:223:8bff:fe64:7a56%7
InetAddress: /10.133.184.160
Up? true
Loopback? false
PointToPoint? false
Supports multicast? true
Virtual? false
Hardware address: [0, 35, -117, 100, 122, 86]
MTU: 1500
Network Inet Address (preferIPV6=false) /10.133.184.160
Network Inet Address (preferIPV6=true) /fe80:0:0:0:223:8bff:fe64:7a56%7
resolveBindInterfaceName(eth0)=127.0.0.1

**************************************************
Display name: virbr0
Name: virbr0
PreferIPv6Addresses: false
InetAddress: /fe80:0:0:0:200:ff:fe00:0%5
InetAddress: /192.168.122.1
Up? true
Loopback? false
PointToPoint? false
Supports multicast? true
Virtual? false
Hardware address: null
MTU: 1500
Network Inet Address (preferIPV6=false) /192.168.122.1
Network Inet Address (preferIPV6=true) /fe80:0:0:0:200:ff:fe00:0%5
resolveBindInterfaceName(virbr0)=192.168.122.1

**************************************************
Display name: lo
Name: lo
PreferIPv6Addresses: false
InetAddress: /0:0:0:0:0:0:0:1%1
InetAddress: /127.0.0.1
Up? true
Loopback? true
PointToPoint? false
Supports multicast? false
Virtual? false
Hardware address: null
MTU: 16436
Network Inet Address (preferIPV6=false) /127.0.0.1
Network Inet Address (preferIPV6=true) /0:0:0:0:0:0:0:1%1
resolveBindInterfaceName(lo)=127.0.0.1

  1. ifconfig -a
    eth0 Link encap:Ethernet HWaddr 00:23:8B:64:7A:56
    inet addr:10.133.184.160 Bcast:10.133.191.255 Mask:255.255.248.0
    inet6 addr: fe80::223:8bff:fe64:7a56/64 Scope:Link
    UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
    RX packets:103208929 errors:0 dropped:0 overruns:0 frame:0
    TX packets:9896652 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:0
    RX bytes:26781337018 (24.9 GiB) TX bytes:3843440786 (3.5 GiB)

eth1 Link encap:Ethernet HWaddr 00:23:8B:64:7A:57
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:dffa0000-dffc0000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:525791 errors:0 dropped:0 overruns:0 frame:0
TX packets:525791 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:43047668 (41.0 MiB) TX bytes:43047668 (41.0 MiB)

peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:1500 Metric:1
RX packets:106850888 errors:0 dropped:0 overruns:0 frame:0
TX packets:9907515 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:27051284760 (25.1 GiB) TX bytes:3844287158 (3.5 GiB)
Memory:dffe0000-e0000000

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

veth1 Link encap:Ethernet HWaddr 00:00:00:00:00:00
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

veth2 Link encap:Ethernet HWaddr 00:00:00:00:00:00
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

veth3 Link encap:Ethernet HWaddr 00:00:00:00:00:00
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

veth4 Link encap:Ethernet HWaddr 00:00:00:00:00:00
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

vif0.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:1500 Metric:1
RX packets:9896433 errors:0 dropped:0 overruns:0 frame:0
TX packets:103208954 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3843147487 (3.5 GiB) TX bytes:26781338668 (24.9 GiB)

vif0.1 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

vif0.2 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

vif0.3 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

vif0.4 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:24 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:2954 (2.8 KiB)

xenbr0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF
UP BROADCAST RUNNING NOARP MTU:1500 Metric:1
RX packets:65446109 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12163201681 (11.3 GiB) TX bytes:0 (0.0 b)

Comment by Joe Fialli [ 16/Dec/11 ]

Some of the machines in cluster are configured with Xen and some are not.
See network interface "vif0.0" in http://wiki.xen.org/xenwiki/XenNetworking
for more info on all the virtual network interfaces being introduced by Xen
on some of the machines running the cluster (but not all of the machines.)

The ones with Xen on them have network interface vif0.0 that has only an IPv6 address
as first network interface. Other machines in cluster have their first network interface
as a dual stack (Ipv4 and IPv6). There is an issue if not all machines have same
homogenous networking configured AND one does not specify which network interface to
use on a multi-home machine.

Quickest possible Workaround:
asadmin create-cluster --bindaddress eth0 <clustername>

This will workaround that not all multihome machines in cluster are not configured in same manner.
This states to use "eth0" network interface for all cluster members.

Or remove/disable Xen network interfaces on machines if they should not be there.

Comment by Joe Fialli [ 19/Dec/11 ]

Due to XEN being installed on some machines that compose a cluster for this issue,
not all machines are selecting network interfaces with matching characteristics.
The XEN introduced network interfaces are IPv6 only and the machines in cluster without
XEN have dual stack network interface for eth0. Thus, half machines are using IPv6 only
and other half are using IPv4 addresses (as preferred for dual stack).

Suggested workaround of create-cluster --gms-bind-interface-address eth0 hit the
reported bug GLASSFISH-18047.

Comment by Joe Fialli [ 04/Jan/12 ]

Downgraded this issue to minor.

Ming did confirm that when the machine with XEN installed was removed from the cluster,
that the test did pass.

A review of the server logs showed that not all instances were on same subnet.
All instances that were on a machine with XEN installed on it were incorrectly
selecting the XEN virtual network interface for GMS communications.

Summary of issue:

The introduction of non-multicast mode for Group Management Services (GMS) in Glassfish 3.1.2 altered which network interface was automatically selected to be used on a multi-homed machine for clustering communications. This change can result in some clustered instances
no longer being able to join their running cluster.

In Glassfish 3.1-3.1.1, a network interface that did not support multicast was not considered as a candidate to be selected as the network interface to be used for cluster communications.
Thus, the automatic selection of network interfaces was impacted. Specifically,
virtual network interfaces that used to be ignored since interface did not support multicast,
have been incorrectly selected as the default network interfaces for cluster communications.

Workarounds:

  • Either disable/remove the network interfaces that are being selected incorrectly.

Or

  • Specify which network interface to use on the machine(s) selecting the incorrect network interface. Here is pointer to documentation on how to specify which network interface
    to use on a multi-home machine.

Link: http://docs.oracle.com/cd/E18930_01/html/821-2426/gjfnl.html#gjdlw

Comment by Joe Fialli [ 09/Feb/12 ]

Fix for this is integrated into shoal-1.6.18 (shoal svn 1745).

Note that shoal 1.6.17 is in glassfish 3.1.2 so this is not fixed in Glassfish 3.1.2.

Comment by Tom Mueller [ 16/May/13 ]

Is this fixed in 4.0 since shoal 1.6.18 is in 4.0?

Comment by Joe Fialli [ 22/May/13 ]

It was quite a complex environment that this issue was reported against.
We don not have automated test to verify such an environment.

Here is detailed commit message on how this issue was addressed in the Shoal GMS.

> Altered algorithm for selecting network interface. Unless java.net.preferIPv6Addresses is set to true,
> will favor network interface supporting IPv4 and multicast. Will settle for network interface that
> does not support multicast if one exists. Lastly will settle for network interface that does not
> support preferred IPv address format.
>
> Fix for GLASSFISH-18047: allow a network interface name as BIND_INTERFACE_ADDRESS.
> Allows one to set network interface such as "eth0" if all machines involved have same network interface name for a cluster.

The issue has been marked resolved but it is unverified.





[GLASSFISH-17926] Elasticity auto-scale up test failed on Ubuntu laptop due to GMS failures when multiple network interfaces exist Created: 07/Dec/11  Updated: 20/Dec/11  Resolved: 08/Dec/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b07, 4.0_b12
Fix Version/s: 3.1.2

Type: Bug Priority: Major
Reporter: mzh777 Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 11.10 on Dell E6420 laptop
java version "1.6.0_24"


Tags: 312_gms, 312_qa, 40_qa

 Description   

GF4.0 b12.

I ran into inconsistent behavior of Elasticity for GF4. The auto-scale test passed in home with wireless network connection while has intermittent failures at work with ethernet cable connection. The NetworkUtils test results when the failure was reproduce:
$ ifconfig -a
...
wlan0 Link encap:Ethernet HWaddr 08:11:96:0c:14:b0
inet6 addr: fe80::a11:96ff:fe0c:14b0/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:12 errors:0 dropped:0 overruns:0 frame:0
TX packets:136 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1006 (1.0 KB) TX bytes:22446 (22.4 KB)

$ java -classpath glassfish/modules/shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility
AllLocalAddresses() = [/fe80:0:0:0:5e26:aff:fe7e:bef8%2, /10.132.179.12]
getFirstNetworkInterface() = name:wlan0 (wlan0) index: 3 addresses:
/fe80:0:0:0:a11:96ff:fe0c:14b0%3;

Dec 7, 2011 1:36:09 PM com.sun.enterprise.mgmt.transport.NetworkUtility getNetworkInetAddress
INFO: enter getFirstInetAddress networkInterface=name:wlan0 (wlan0) index: 3 addresses:
/fe80:0:0:0:a11:96ff:fe0c:14b0%3;
preferIPv6=true
getFirstInetAddress( true ) = /fe80:0:0:0:a11:96ff:fe0c:14b0%3
Dec 7, 2011 1:36:09 PM com.sun.enterprise.mgmt.transport.NetworkUtility getNetworkInetAddress
INFO: enter getFirstInetAddress networkInterface=name:wlan0 (wlan0) index: 3 addresses:
/fe80:0:0:0:a11:96ff:fe0c:14b0%3;
preferIPv6=false
getFirstInetAddress( false ) = null
getLocalHostAddress = chicago/127.0.1.1
getFirstNetworkInteface() = name:wlan0 (wlan0) index: 3 addresses:
/fe80:0:0:0:a11:96ff:fe0c:14b0%3;

getFirstInetAddress(firstNetworkInteface, true) = /fe80:0:0:0:a11:96ff:fe0c:14b0%3
getFirstInetAddress(firstNetworkInteface, false) = null

All Network Interfaces
Display name: wlan0
Name: wlan0
InetAddress: /fe80:0:0:0:a11:96ff:fe0c:14b0%3
Up? false
Loopback? false
PointToPoint? false
Supports multicast? true
Virtual? false
Hardware address: [8, 17, -106, 12, 20, -80]
MTU: 1500
Dec 7, 2011 1:36:09 PM com.sun.enterprise.mgmt.transport.NetworkUtility getNetworkInetAddress
INFO: enter getFirstInetAddress networkInterface=name:wlan0 (wlan0) index: 3 addresses:
/fe80:0:0:0:a11:96ff:fe0c:14b0%3;
preferIPv6=false
Exception in thread "main" java.lang.NullPointerException
at com.sun.enterprise.mgmt.transport.NetworkUtility.displayInterfaceInformation(NetworkUtility.java:695)
at com.sun.enterprise.mgmt.transport.NetworkUtility.main(NetworkUtility.java:674)

The original server.log stack trace:
[#|2011-12-01T14:14:13.673-0800|WARNING|44.0|elasticity-logger|_ThreadID=22;_ThreadName=Thread-2;|Error during groupHandle.sendMessage(cloud-2, ConferencePlanner; size=1567)
com.sun.enterprise.ee.cms.core.GMSException: java.io.IOException: failed to connect to fe80:0:0:0:a11:96ff:fe0c:14b0%3:9188:228.9.10.114:7100:ConferencePlanner:cloud-2
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.sendMessage(GroupCommunicationProviderImpl.java:380)
at com.sun.enterprise.ee.cms.impl.base.GroupHandleImpl.sendMessage(GroupHandleImpl.java:142)
at org.glassfish.elasticity.group.gms.GroupServiceProvider.sendMessage(GroupServiceProvider.java:276)
at org.glassfish.elasticity.engine.message.MessageProcessor.sendMessage(MessageProcessor.java:151)
at org.glassfish.elasticity.expression.ElasticExpressionEvaluator.evaluate(ElasticExpressionEvaluator.java:95)
at org.glassfish.elasticity.expression.ElasticExpressionEvaluator.evaluate(ElasticExpressionEvaluator.java:50)
at org.glassfish.elasticity.engine.util.ExpressionBasedAlert.execute(ExpressionBasedAlert.java:91)
at org.glassfish.elasticity.engine.container.AlertContextImpl.run(AlertContextImpl.java:71)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: failed to connect to fe80:0:0:0:a11:96ff:fe0c:14b0%3:9188:228.9.10.114:7100:ConferencePlanner:cloud-2
at com.sun.enterprise.mgmt.transport.grizzly.grizzly2.GrizzlyTCPMessageSender.send(GrizzlyTCPMessageSender.java:134)
at com.sun.enterprise.mgmt.transport.grizzly.grizzly2.GrizzlyTCPMessageSender.doSend(GrizzlyTCPMessageSender.java:99)
at com.sun.enterprise.mgmt.transport.AbstractMessageSender.send(AbstractMessageSender.java:74)
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.send(GrizzlyNetworkManager.java:285)



 Comments   
Comment by Joe Fialli [ 07/Dec/11 ]

The GMS selecting the incorrect network interface was due to the wireless network interface being in a slightly unusual and inconsistent state.

From ifconfig -a:

wlan0 Link encap:Ethernet HWaddr 08:11:96:0c:14:b0
inet6 addr: fe80::a11:96ff:fe0c:14b0/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1

Note that the interface is not running but it does have an IPv6 address assigned.

Fix in Shoal GMS NetworkUtility was to check if the network interface is UP
and it has ip address assignged. Before fix, the code incorrectly assumed if IP
address assigned that the network interface was up.

Fix for this issue was confirmed to work on this system.

Workaround for misconfigured network interface wlan0 was to explicitly
run the following:

% sudo ifconfig wlan0 down

After running this, the inconsistent wlan0 network interface no longer
had an IP address assigned when running "ifconfig -a wlan0".

Comment by Joe Fialli [ 08/Dec/11 ]

Fix committed to shoal 1.6 source code workspace.

Will integrate into 3.1.2 and 4.0 workspace today.

Comment by Joe Fialli [ 09/Dec/11 ]

shoal 1.6.15 contains this fix. integrated in gf 3.1.2 and 4.0 workspace today.





[GLASSFISH-17798] get-health always say instance as not started Created: 22/Nov/11  Updated: 17/Oct/12

Status: In Progress
Project: glassfish
Component/s: group_management_service
Affects Version/s: 4.0
Fix Version/s: not determined

Type: Bug Priority: Minor
Reporter: Anissa Lam Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File clustersetup.sh     Text File server.log    
Tags: 3_1_2-exclude, 3_1_x-exclude

 Description   

This is on latest workspace,rev# 51051 on the 3.1.2 branch.
Tried several times, and always reproducible.
I created a cluster (clusterABC) with 4 instances, all using the localhost-domain1 node.
I can start the instances, but get-health always says they are not started.

Here is the copy&paste of my commands. I will attach server.log as well.

~/Awork/V3/3.1.2/3.1.2 1)  cd $AS3/bin
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 2)  asadmin list-clusters
clusterABC not running
Command list-clusters executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 3)  asadmin list-instances --long
NAME   HOST       PORT   PID  CLUSTER     STATE         
ABC-4  localhost  24848  --   clusterABC   not running  
ABC-3  localhost  24849  --   clusterABC   not running  
ABC-2  localhost  24850  --   clusterABC   not running  
ABC-1  localhost  24851  --   clusterABC   not running  
Command list-instances executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 4)  asadmin start-instance ABC-1
Waiting for ABC-1 to start ..........
Successfully started the instance: ABC-1
instance Location: /Users/anilam/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/nodes/localhost-domain1/ABC-1
Log File: /Users/anilam/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/nodes/localhost-domain1/ABC-1/logs/server.log
Admin Port: 24851
Command start-local-instance executed successfully.
The instance, ABC-1, was started on host localhost
Command start-instance executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 5)  asadmin start-instance ABC-2
Waiting for ABC-2 to start ..........
Successfully started the instance: ABC-2
instance Location: /Users/anilam/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/nodes/localhost-domain1/ABC-2
Log File: /Users/anilam/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/nodes/localhost-domain1/ABC-2/logs/server.log
Admin Port: 24850
Command start-local-instance executed successfully.
The instance, ABC-2, was started on host localhost
Command start-instance executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 6)  asadmin list-instances --long
NAME   HOST       PORT   PID    CLUSTER     STATE         
ABC-4  localhost  24848  --     clusterABC   not running  
ABC-3  localhost  24849  --     clusterABC   not running  
ABC-2  localhost  24850  12517  clusterABC   running      
ABC-1  localhost  24851  12507  clusterABC   running      
Command list-instances executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 7)  asadmin get-health clusterABC
ABC-1 not started
ABC-2 not started
ABC-3 not started
ABC-4 not started
Command get-health executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 8)  asadmin start-cluster clusterABC
Command start-cluster executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 9)  asadmin list-instances --long
NAME   HOST       PORT   PID    CLUSTER     STATE     
ABC-4  localhost  24848  12540  clusterABC   running  
ABC-3  localhost  24849  12541  clusterABC   running  
ABC-2  localhost  24850  12517  clusterABC   running  
ABC-1  localhost  24851  12507  clusterABC   running  
Command list-instances executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 10)  asadmin get-health clusterABC
ABC-1 not started
ABC-2 not started
ABC-3 not started
ABC-4 not started
Command get-health executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 11)  



 Comments   
Comment by Joe Fialli [ 22/Nov/11 ]

Unable to recreate reported issue with build 51075.
Attached a shell script called clustersetup.sh to standardize HOW the cluster and instances are created.
(must configure GF_HOME to point to a valid 3.1.2 installation)
My results of running the script are counter to reported issue.

$GF_HOME/bin/asadmin list-instances
instance01 running
instance02 running
instance03 running
Command list-instances executed successfully.
$GF_HOME/bin/asadmin get-health myCluster
instance01 started since Tue Nov 22 11:38:25 EST 2011
instance02 started since Tue Nov 22 11:38:25 EST 2011
instance03 started since Tue Nov 22 11:38:25 EST 2011
Command get-health executed successfully.

***********
Analysis:

there is no evidence that multicast is working from the submitted DAS server.log.
Is it possible that this was attempted while connected with VPN?
VPN will interfer with multicast working.

Please submit output of "ifconfig -a" and also follow HA admin guide instructions for validating
that multicast is working properly for your system.
http://download.oracle.com/docs/cd/E18930_01/html/821-2426/gjfnl.html#gklhd
The instructions assume two different machines but you can check if multicast is working between processes
on same machine by opening two terminal windows on same machine.
Note that multicast does not work when one is connected via VPN.
(it disables multicast as a protection mechanism).

Specifying bindinterfaceaddress of 127.0.0.1 allows one to work with clusters on one machine while
connected via VPN.

Comment by Joe Fialli [ 22/Nov/11 ]

The attached shell script creates a domain, a cluster and 3 instances for the cluster and starts
up the cluster. Validates that cluster started using "asadmin get-health" and "asadmin list-instances".
User must edit script variable GF_HOME to point to a valid GF v3.1.2 installation.

Comment by Anissa Lam [ 22/Nov/11 ]

Yes, I saw the issue when I was working from home and using VPN.
So, is this a known issue that get-health will NOT provide a correct state of the instance when it is on VPN ?

I think that since there is no way to fix the code if one is on VPN, then even though you cannot gives the exact state like 'FAILED', 'STOPPED' and the timestamp, it should at least report the correct status. It shouldn't just say 'not started', instead, it should at least report the instance is running or not. Can the code detect that multicast is not working and code it like list-instances to find out the status of the instance and return that ?

Console is displaying whatever get-health returns, and telling user that the instance is 'not running' when it actually is doesn't sound acceptable. Especially when the Status from list-instance is displayed on the same screen, that says 'RUNNING', and the next line says 'not running' giving conflicting information.

Comment by Joe Fialli [ 22/Nov/11 ]

get-health reports the status of GMS.
GMS in multicast mode (the default) only works when multicast is working.

Please see bobby's blog, you are misinterpreting results.
asadmin get-health only works correctly when GMS is working correctly.
(asadmin get-health is a GMS client and it can only work as well as GMS subsystem is working)

http://blogs.oracle.com/bobby/entry/validating_multicast_transport_where_d

Comment by Anissa Lam [ 22/Nov/11 ]

As a user, when i am seeing the following:

~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 8) asadmin start-cluster clusterABC
Command start-cluster executed successfully.

~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 9) asadmin list-instances --long
NAME HOST PORT PID CLUSTER STATE
ABC-4 localhost 24848 12540 clusterABC running
ABC-3 localhost 24849 12541 clusterABC running
ABC-2 localhost 24850 12517 clusterABC running
ABC-1 localhost 24851 12507 clusterABC running
Command list-instances executed successfully.

~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 10) asadmin get-health clusterABC
ABC-1 not started
ABC-2 not started
ABC-3 not started
ABC-4 not started
Command get-health executed successfully.
~/Awork/V3/3.1.2/3.1.2/dist-gf/glassfish/bin 11)

I can only say that 'get-health' is giving me the wrong information. The server instance is obviously running, why 'get-health' says it is not started ?
If there is any issue that prevents "get-health" to give the correct information, then it should return an error informing the user what the problem is. Giving the wrong info and says executed successfully is not acceptable.

Comment by Joe Fialli [ 22/Nov/11 ]

reduced priority from critical to minor.

My recommendation is to change "not started" to "unknown".
The asadmin get-health command tells the state of the cluster
from the GMS point of view. If multicast is not working properly
and cluster is not forming properly, that is what the command should relay.

Comment by Bobby Bissett [ 23/Nov/11 ]

"I can only say that 'get-health' is giving me the wrong information. The server instance is obviously running, why 'get-health' says it is not started ?
If there is any issue that prevents "get-health" to give the correct information, then it should return an error informing the user what the problem is. Giving the wrong info and says executed successfully is not acceptable."

That's the way it is. The whole POINT of get-health is to tell you the state of the cluster. If the instances are up, but can't communicate, then there's a serious problem and the only way the user will know it is by running get-health and seeing the wrong result. This is all documented.

In the admin console, you can say whatever you want. The enum name is "NOT_RUNNING" but you can say whatever you want.

Comment by Bobby Bissett [ 23/Nov/11 ]

When the admin console gets the output from the get health command, it's getting the enum name from this enumeration:

// NOT_RUNNING means there is no time information associated
public static enum STATE {
NOT_RUNNING (strings.getString("state.not_running")),
RUNNING (strings.getString("state.running")),
REJOINED (strings.getString("state.rejoined")),
FAILURE (strings.getString("state.failure")),
SHUTDOWN (strings.getString("state.shutdown"));

private final String stringVal;

STATE(String stringVal)

{ this.stringVal = stringVal; }

@Override
public String toString()

{ return stringVal; }

}

There is no point in changing the name of the state in the enum, it's separate from the i18n'ed value that is presented to the user. So when the admin console sees that state, it can output anything you want. Are you using the LocalStrings.properties file in the gms-bootstrap module to get the actual text to use? If so, we can change that to say "not joined" instead. Otherwise, this issue doesn't really affect gms since you can use whatever text you want.

Just wanted to check to see if you're using our props file or your own for the text the user sees.

Comment by Anissa Lam [ 23/Nov/11 ]

I get it now.
I feel that it will be very nice if user can perform validate-multicast on the console.
Will it be possible to make validate-multicast a remote command so that console can call that ? Or its too much to ask for 3.1.2 ?
thanks Joe and Bobby for helping me to understand this.

Comment by Bobby Bissett [ 23/Nov/11 ]

Nope, validate-multicast has to be a local command only because it needs to be run on each machine that will host an instance. In fact, it's better if the server is not up when the command is run. If you're bored, you can watch a screen cast with the details

http://www.youtube.com/watch?v=sJTDao9OpWA

There is an RFE for a tool that's more centralized, which I think fits what you're looking for. It won't happen for 3.1.2, but it's possible it could happen later: GLASSFISH-13056

Comment by Joe Fialli [ 23/Nov/11 ]

Too big a change for 3.1.2 release to change the output of asadmin get-health that
is documented in asadmin get-health --help documentation.

Recommend considering fixing this in a major release.

We could release note in 3.1.2 that "asadmin get-health" "not started" status applies
to both the instance not running or the instance is running but the current configuration
is not allowing GMS communications. (could be multicast is not enabled properly or
non-multicast GMS mode is misconfigured.)

Comment by Joe Fialli [ 23/Nov/11 ]

Exclude changing asadmin get-health output in a minor release.





[GLASSFISH-17777] localization issue with get-health command for REST client Created: 21/Nov/11  Updated: 22/Nov/11  Resolved: 22/Nov/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b05
Fix Version/s: 3.1.2_b12, 4.0

Type: Bug Priority: Major
Reporter: Anissa Lam Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Current get-health Properties field in the action report doesn't allow for localization. I just realize that when trying to add this info to the admin console.
Current action report says:

{"message":"ABC-1 stopped since Sun Nov 20 12:02:12 PST 2011\nABC-3 started since Sun Nov 20 12:00:20 PST 2011\nABC-4 failed since Sun Nov 20 12:03:07 PST 2011\nABC-5 not started","command":"get-health AdminCommand","exit_code":"SUCCESS",
"properties":

{ "ABC-5":"not started", "ABC-4":"failed since Sun Nov 20 12:03:07 PST 201", "ABC-3":"started since Sun Nov 20 12:00:20 PST 201", "ABC-1":"stopped since Sun Nov 20 12:02:12 PST 201"}

,"
extraProperties":{"methods":[

{"name":"GET"}

,{}]}}

This probably returns based on the locale that the server is running, but the console maybe running on a different browser locale.
This make it hard to localize it. Also, it maybe better to separate the Status (failed, started, not started) etc. from the actual data (time).
Maybe properties can be a List of Map.
The Map has the following key:

Name: instance name
Status: Started, Stopped, Not Started, Failed etc. (an enum will be good)
timestamp: the time expressed as a Long, so that it can be converted back to Date.

This is similar to the list-instances command. You can try :4848/management/domain/list-instances.json

Sorry i didn't realize there maybe an issue when suggesting the fix before.

As of now, the get-health info is displayed in the Instance General Info page, which may have i18n issue. When this bug is fixed, console code will change to extract the info differently.



 Comments   
Comment by Bobby Bissett [ 21/Nov/11 ]

Hi Anissa,

For the state, we have an Enum of states already in HealthHistory.STATE in the gms-bootstrap module. Can you use those? I'll have to express the timestamp as String.valueOf(long) so that it can be stored, but that's simple enough for you to change back to long.

I'm not clear at all about how to stuff all this into properties in the report.getTopMessagePart() object. I know how to express the List of Maps of List of Maps as json or xml, but not how to get it into the report. Example of what I mean in json:

"properties":{ "ABC-1": { "Status" : "foo", "timestamp": "123" }}

Instead, could I store the props with a separator between the state and timestamp? I could do something like this in the get health command code:

top.addProperty(instanceName, health.state.name() + ":" + String.valueOf(health.time))

...and the result should be something like this:

"properties":

{ "ABC-1": "RUNNING:12345678", "ABC-2": "FAILURE:12345678", "ABC-3": "NOT_RUNNING:"}

Note that there's no time associated with the NOT_RUNNING state, so I can either leave it empty after the colon or use a flag like -1. Probably empty is better so you can use String.isEmpty() to check it. Let me know what you'd prefer.

Thanks,
Bobby

Comment by Anissa Lam [ 21/Nov/11 ]

Having:

"properties": [
            {  "name" : "ABC-1",
               "status" : "RUNNING",
               "time": "12345678"
            },

            {  "name" : "ABC-2",
               "status" : "FAILURE",
               "time": "12345678"
            },

	   {  "name" : "ABC-3",
               "status" : "NOT-RUNNING",
               "time": ""
            },
}
can be parsed much easier than

"properties":{ "ABC-1": "RUNNING:12345678", "ABC-2": "FAILURE:12345678", "ABC-3": "NOT_RUNNING:"}

It can be in the extraProperties like the list-instances command instead of using properties.
Can you take a look at how list-instances is done and do the same ?
thanks

Comment by Bobby Bissett [ 21/Nov/11 ]

Ok, I think I have it. This work for you?

{"message":"inst1 started since Mon Nov 21 13:42:39 EST 2011\ninst2 stopped since Mon Nov 21 15:46:09 EST 2011\ninst3 not started","command":"get-health AdminCommand","exit_code":"SUCCESS","extraProperties":{"methods":[

{"name":"GET"}

,{}],"instances":[

{"status":"RUNNING","name":"inst1","time":"1321900959398"}

,

{"status":"SHUTDOWN","name":"inst2","time":"1321908369298"}

,

{"status":"NOT_RUNNING","name":"inst3","time":""}

]}}

So in extraProperties, the key "instances" is mapped to a list of maps containing the name, status, and time for each instance. The status matches the enum mentioned above.

Comment by Anissa Lam [ 21/Nov/11 ]

yes, that will work. thanks.

Comment by Bobby Bissett [ 21/Nov/11 ]

Just commited fix in 3.1.2 branch. Will fix in trunk as well and mark fixed here.

Comment by Bobby Bissett [ 22/Nov/11 ]

Changes checked into revisions 51038 (3.1.2 branch) and 51054 (trunk).





[GLASSFISH-17571] get-health action report needs to be fixed so that it is parseable. Created: 02/Nov/11  Updated: 03/Nov/11  Resolved: 03/Nov/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.1
Fix Version/s: 3.1.2_b09, 4.0

Type: Bug Priority: Critical
Reporter: Anissa Lam Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
blocks GLASSFISH-17570 information from get-health is not di... Reopened

 Description   

The action report from get-health needs to be fixed.
As of now, this is what is returned:

{"message":"ABC-1 not started\nABC-2 not started","command":"get-health AdminCommand","exit_code":"SUCCESS","extraProperties":{"methods":[

{"name":"GET"}

,{}]}}

This makes it impossible to parse to get information for each of the instance of the cluster.

One suggestion can be providing the info in the properties field, like in list-clusters command:

{"message":"clusterABC partially running\nclusterXYZ not running","command":"list-clusters AdminCommand","exit_code":"SUCCESS", "properties":

{"clusterABC":"PARTIALLY_RUNNING","clusterXYZ":"NOT_RUNNING"}

," extraProperties":{"methods":[

{"name":"GET"}

,{"messageParameters":{"id":

{"acceptableValues":"","optional":"true","type":"string","defaultValue":"domain"}

}}]}}



 Comments   
Comment by Joe Fialli [ 02/Nov/11 ]

Initially assigning to Bobby since he implemented this feature.

Anissa:
Since the bug is reported against 3.1.1, please provide feedback if you request that this be fixed for 3.1.2 or for 4.0 time frame. Do not want to fix for 3.1.2 unless there are plans to use the requested change.

Comment by Bobby Bissett [ 02/Nov/11 ]

Also, Anissa: can you tell me what I do to see the action report output?

Comment by Anissa Lam [ 02/Nov/11 ]

I am going through the exercise of CLI parity and realize we are not displaying the get-health info in GUI. And found the action report output cannot be consumed by clients through REST API.

To see the action report, you can do the following in the browser, eg. for a cluster with name clusterABC:
http://localhost:4848/management/domain/clusters/cluster/clusterABC/get-health.json

and also try:
http://localhost:4848/management/domain/clusters/list-clusters.json

using chrome will see the display on screen. You can omit .json to see the html output.

Comment by Bobby Bissett [ 03/Nov/11 ]

Anissa,

I need more information from you on what you want done. I think you're suggesting I change the output message (which goes to the user) from something about the state of the instances to something about the cluster instead:

--quote--
One suggestion can be providing the info in the properties field, like in list-clusters command:

{"message":"clusterABC partially running\nclusterXYZ not running","command":"list-clusters AdminCommand","exit_code":"SUCCESS", "properties":

{"clusterABC":"PARTIALLY_RUNNING","clusterXYZ":"NOT_RUNNING"}

," extraProperties":{"methods":[

{"name":"GET"}

,{"messageParameters":{"id":

{"acceptableValues":"","optional":"true","type":"string","defaultValue":"domain"}

}}]}}
--end quote--

The get-health command takes a cluster as an input, and is suppose to tell the user the state of every instance in that cluster. So I can't change "ABC-1 not started\nABC-2 not started" to "clusterABC partially running." That loses all the information that the command is supposed to give.

I can use report.setExtraProperties() to give more info, but the info will still be the same:
<instance name> <state> [since <time>]

Am not sure why this can't be parsed. The output messages in LocalStrings.properties are:

get.health.instance.state=

{0} {1}
get.health.instance.state.since={0}

{1}

since

{2}

The strings above are meant to match the output format that was used in 2.X. So I'm not sure what change you want me to make, and I still don't know how or what code is getting this info to parse it. Am assigning back to you for more info.

Comment by Bobby Bissett [ 03/Nov/11 ]

I didn't see your update to the issue until after I edited it. Weird.

I think I know what you're looking for now, and will give something a try. Will add some output to the issue for you to ok or not.

Comment by Bobby Bissett [ 03/Nov/11 ]

Hi Annisa,

This output look ok to you? Here's what the user sees, just for reference:

hostname% ./asadmin get-health clus
inst1 started since Thu Nov 03 14:14:49 EDT 2011
inst2 not started
inst3 stopped since Thu Nov 03 14:21:33 EDT 2011
Command get-health executed successfully.

Here's the JSON output from the server at http://localhost:4848/management/domain/clusters/cluster/clus/get-health.json

{"message":"inst1 started since Thu Nov 03 14:14:49 EDT 2011\ninst2 not started\ninst3 stopped since Thu Nov 03 14:21:33 EDT 2011","command":"get-health AdminCommand","exit_code":"SUCCESS","properties":

{"inst1":"started since Thu Nov 03 14:14:49 EDT 201","inst3":"stopped since Thu Nov 03 14:21:33 EDT 201","inst2":"not started"}

,"extraProperties":{"methods":[

{"name":"GET"}

,{}]}}

Those props in the action report look ok to you?

Comment by Anissa Lam [ 03/Nov/11 ]

if I can see something like the following in the actionReport from get-health, then i can extract and parse that accordingly.
"properties":

{"ABC-1":"not started","ABC-2:"not-started"}

,"

Comment by Anissa Lam [ 03/Nov/11 ]

Just like your previous experience, I didn't see your last comment about the suggested change until i add my comments. Something is not quite right in JIRA.
Anyway, what you suggested

{"message":"inst1 started since Thu Nov 03 14:14:49 EDT 2011\ninst2 not started\ninst3 stopped since Thu Nov 03 14:21:33 EDT 2011","command":"get-health AdminCommand","exit_code":"SUCCESS","properties":

{"inst1":"started since Thu Nov 03 14:14:49 EDT 201","inst3":"stopped since Thu Nov 03 14:21:33 EDT 201","inst2":"not started"}

,"extraProperties":{"methods":[

{"name":"GET"}

,{}]}}

is exactly what I am looking for.
thanks.

Comment by Bobby Bissett [ 03/Nov/11 ]

Fixed in revisions 50649 (3.1.2 branch) and 50653 (trunk).





[GLASSFISH-17458] in non-multicast mode, one failed to connect per cluster instance at startup Created: 22/Oct/11  Updated: 07/Mar/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b05
Fix Version/s: not determined

Type: Bug Priority: Minor
Reporter: zorro Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.18-164.0.0.0.1.el5



 Description   

Glassfish version 3.1.2 build 5

Lots of the following exception are being seen the the server logs.

Expected: All exceptions to be handled

http://aras2.us.oracle.com:8080/logs/gf31/gms//set_10_20_11_t_11_56_02/scenario_0012_Thu_Oct_20_12_00_37_PDT_2011.html

11-10-20T18:56:57.888+0000|INFO|glassfish3.1.2|ShoalLogger.nomcast|_ThreadID=84;_ThreadName=Thread-2;|failed to send message to a virtual multicast endpoint[10.133.184.137:9090:230.30.1.1:9090:clusterz1:Unknown_10.133.184.137_9090] message=[MessageImpl[v1:MASTER_NODE_MESSAGE: NAD, Target: 10.133.184.137:9090:230.30.1.1:9090:clusterz1:Unknown_10.133.184.137_9090 , Source: 10.133.184.207:9090:230.30.1.1:9090:clusterz1:server, MQ, ]
java.io.IOException: failed to connect to 10.133.184.137:9090:230.30.1.1:9090:clusterz1:Unknown_10.133.184.137_9090
at com.sun.enterprise.mgmt.transport.grizzly.grizzly1_9.GrizzlyTCPConnectorWrapper.send(GrizzlyTCPConnectorWrapper.java:132)
at com.sun.enterprise.mgmt.transport.grizzly.grizzly1_9.GrizzlyTCPConnectorWrapper.doSend(GrizzlyTCPConnectorWrapper.java:96)
at com.sun.enterprise.mgmt.transport.AbstractMessageSender.send(AbstractMessageSender.java:74)
at com.sun.enterprise.mgmt.transport.VirtualMulticastSender.doBroadcast(VirtualMulticastSender.java:134)
at com.sun.enterprise.mgmt.transport.AbstractMulticastMessageSender.broadcast(AbstractMulticastMessageSender.java:70)
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.broadcast(GrizzlyNetworkManager.java:295)
at com.sun.enterprise.mgmt.MasterNode.send(MasterNode.java:1338)
at com.sun.enterprise.mgmt.MasterNode.discoverMaster(MasterNode.java:382)
at com.sun.enterprise.mgmt.MasterNode.startMasterNodeDiscovery(MasterNode.java:1235)
at com.sun.enterprise.mgmt.MasterNode.run(MasterNode.java:1204)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at com.sun.grizzly.TCPConnectorHandler.finishConnect(TCPConnectorHandler.java:297)
at com.sun.grizzly.connectioncache.client.CacheableConnectorHandler.finishConnect(CacheableConnectorHandler.java:230)
at com.sun.enterprise.mgmt.transport.grizzly.grizzly1_9.GrizzlyTCPConnectorWrapper$CloseControlCallbackHandler.onConnect(GrizzlyTCPConnectorWrapper.java:185)
at com.sun.grizzly.CallbackHandlerContextTask.doCall(CallbackHandlerContextTask.java:70)
at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
... 1 more

#]


 Comments   
Comment by Joe Fialli [ 24/Oct/11 ]

This is only occuring in Glassfish Shoal QE test when running with GMS_DISCOVERY_URI_LIST set to
a list on instances that have not been created (or started yet) and the DAS initially joins the cluster.

There is only one exception per cluster member listed in GMS_DISCOVERY_URI_LIST.
For the test case this is reported against, there are 9 instances, so there are nine connection
failed when DAS joins cluster initially and those instances have yet to been created and started.
When DAS first joins cluster and no instance has even been created yet,
the DISCOVERY_URI_LIST contains connection info to yet to be created instances.

We will demote the failed connections during discovery from WARNING to FINE, this
will enable us to debug network configuration issues (such as firewalls) without
the nusance of always seeing one failure per cluster member referenced in GMS_DISCOVERY_URI_LIST.

Note: this issue does not apply to GMS_DISCOVERY_URI_LIST set to "generate" or to group discovery
via UDP multicast.

Comment by Tom Mueller [ 07/Mar/12 ]

Bulk update to set Fix Version to "not determined" for issues that had it set to a version that has already been released.





[GLASSFISH-17195] GMS fails to initialize due to GMSException: can not find a first InetAddress Created: 16/Aug/11  Updated: 18/Aug/11

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.1
Fix Version/s: None

Type: Bug Priority: Trivial
Reporter: arungupta Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 7, JDK 7, Wireless Network Interface, GlassFish 3.1.1


Attachments: File ipconfig.out     File ListNetsEx.out     Zip Archive log_2011-08-18_07-41-35.zip    

 Description   

Created a 2-instance cluster using GlassFish 3.1.1 on Windows 7/JDK7 and starting the cluster/instances gives the following error:

[#|2011-08-16T10:59:17.116-0700|CONFIG|glassfish3.1.1|ShoalLogger|_ThreadID=1;_ThreadName=Thread-2;|
GrizzlyNetworkManager Configuration
BIND_INTERFACE_ADDRESS:null NetworkInterfaceName:null
TCPSTARTPORT..TCPENDPORT:9090..9200
MULTICAST_ADDRESS:MULTICAST_PORT:228.9.143.78:9635 MULTICAST_PACKET_SIZE:65536 MULTICAST_TIME_TO_LIV
E: default
FAILURE_DETECT_TCP_RETRANSMIT_TIMEOUT(ms):10000
ThreadPool CORE_POOLSIZE:20 MAX_POOLSIZE:50 POOL_QUEUE_SIZE:4096 KEEP_ALIVE_TIME(ms):60000
HIGH_WATER_MARK:1024 NUMBER_TO_RECLAIM:10 MAX_PARALLEL:15
START_TIMEOUT(ms):15000 WRITE_TIMEOUT(ms):10000
MAX_WRITE_SELECTOR_POOL_SIZE:30
VIRTUAL_MULTICAST_URI_LIST:null

#]

[#|2011-08-16T10:59:17.157-0700|INFO|glassfish3.1.1|grizzly|_ThreadID=20;_ThreadName=Thread-2;|GRIZZ
LY0001: Starting Grizzly Framework 1.9.36 - 8/16/11 10:59 AM|#]

[#|2011-08-16T10:59:17.184-0700|CONFIG|glassfish3.1.1|ShoalLogger|_ThreadID=1;_ThreadName=Thread-2;|
Grizzly controller listening on /0:0:0:0:0:0:0:0:9179. Controller started in 37 ms|#]

[#|2011-08-16T10:59:17.422-0700|SEVERE|glassfish3.1.1|javax.org.glassfish.gms.org.glassfish.gms|_Thr
eadID=1;_ThreadName=Thread-2;|GMSAD1017: GMS failed to start. See stack trace for additional informa
tion.
com.sun.enterprise.ee.cms.core.GMSException: failed to join group c1
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:181)
at com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.join(GroupManagementServ
iceImpl.java:382)
at org.glassfish.gms.GMSAdapterImpl.initializeGMS(GMSAdapterImpl.java:576)
at org.glassfish.gms.GMSAdapterImpl.initialize(GMSAdapterImpl.java:199)
at org.glassfish.gms.bootstrap.GMSAdapterService.loadModule(GMSAdapterService.java:218)
at org.glassfish.gms.bootstrap.GMSAdapterService.checkCluster(GMSAdapterService.java:192)
at org.glassfish.gms.bootstrap.GMSAdapterService.postConstruct(GMSAdapterService.java:136)
at com.sun.hk2.component.AbstractCreatorImpl.inject(AbstractCreatorImpl.java:131)
at com.sun.hk2.component.ConstructorCreator.initialize(ConstructorCreator.java:91)
at com.sun.hk2.component.AbstractCreatorImpl.get(AbstractCreatorImpl.java:82)
at com.sun.hk2.component.SingletonInhabitant.get(SingletonInhabitant.java:67)
at com.sun.hk2.component.EventPublishingInhabitant.get(EventPublishingInhabitant.java:139)
at com.sun.hk2.component.AbstractInhabitantImpl.get(AbstractInhabitantImpl.java:76)
at com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:253)
at com.sun.enterprise.v3.server.AppServerStartup.doStart(AppServerStartup.java:145)
at com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:136)
at com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
at com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:6
3)
at com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.jav
a:69)
at com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:1
17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)
at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)
Caused by: com.sun.enterprise.ee.cms.core.GMSException: initialization failure
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:142)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.initializeGroupCommuni
cationProvider(GroupCommunicationProviderImpl.java:164)
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:175)
... 25 more
Caused by: java.io.IOException: can not find a first InetAddress
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.start(GrizzlyNetworkManag
er.java:376)
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:140)
... 27 more

#]

Here are the wireless network settings:

D:\tools\glassfish\3.1.1\ose-glassfish3-full>ipconfig /all

Windows IP Configuration

Host Name . . . . . . . . . . . . : ARUNGUP-LAP
Primary Dns Suffix . . . . . . . : st-users.us.oracle.com
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
DNS Suffix Search List. . . . . . : st-users.us.oracle.com
us.oracle.com

Ethernet adapter Bluetooth Network Connection:

Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Bluetooth Device (Personal Area Network)
Physical Address. . . . . . . . . : 70-F1-A1-9B-D6-3C
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes

Wireless LAN adapter Wireless Network Connection:

Connection-specific DNS Suffix . : us.oracle.com
Description . . . . . . . . . . . : Intel(R) Centrino(R) Advanced-N 6200 AGN
Physical Address. . . . . . . . . : 00-27-10-17-FB-9C
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
IPv4 Address. . . . . . . . . . . : 10.151.1.82(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.224.0
Lease Obtained. . . . . . . . . . : Tuesday, August 16, 2011 9:18:51 AM
Lease Expires . . . . . . . . . . : Tuesday, August 16, 2011 2:35:41 PM
Default Gateway . . . . . . . . . : 10.151.0.1
DHCP Server . . . . . . . . . . . : 10.196.255.250
DNS Servers . . . . . . . . . . . : 148.87.1.22
148.87.112.101
NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter Local Area Connection:

Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . : us.oracle.com
Description . . . . . . . . . . . : Intel(R) 82577LM Gigabit Network Connection
Physical Address. . . . . . . . . : 00-26-B9-F1-15-19
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes

validate-multicast with/without --bindaddress passes.

The cluster could be successfully started with the wired network.

Tried the same steps on home wired/wireless network and got the same results.



 Comments   
Comment by Joe Fialli [ 17/Aug/11 ]

There is insufficient information to evaluate this issue. The submittd ipconfig output from windows is missing whether multicast is enabled or not for the network interface.

GMS will not automatically select a network interface when NetworkInterface.supportsMulticast() does not
return true.

Please submit the output of running following command to confirm whether NetworkInterface.supportsMulticast()
is returning true for the wireless network interface.

cd to glassfish installation directory and run the following command:

$ java -classpath glassfish3/glassfish/modules/shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility

Here is the output from my mac from running this.

AllLocalAddresses() = [/fe80:0:0:0:223:32ff:fe97:5cf7%5, /10.152.23.224, /fe80:0:0:0:0:0:0:1%1]
getFirstNetworkInterface() = name:en0 (en0)
getFirstInetAddress( true ) = /fe80:0:0:0:223:32ff:fe97:5cf7%5
getFirstInetAddress( false ) = /10.152.23.224
getFirstNetworkInteface() = name:en0 (en0)
getFirstInetAddress(firstNetworkInteface, true) = /fe80:0:0:0:223:32ff:fe97:5cf7%5
getFirstInetAddress(firstNetworkInteface, false) = /10.152.23.224

The issue is that automatic selection of the network interface is failing, the above is the unit test for this case.
Did you try the workaround of explicitly setting BIND_INTERFACE_ADDRESS ?

Comment by Joe Fialli [ 17/Aug/11 ]

lowered priority since explicitly setting BIND_INTERFACE_ADDRESS will work around the issue.
(See link http://download.oracle.com/docs/cd/E18930_01/html/821-2426/gjfnl.html#gjdlw
for details on how to configure this property.

Additionally, while inconvenient that automatic selection of a network interface is not working properly,
more info is needed to verify network interface configuration since the submitted info does
not contain MULTICAST enabled info. The request for additional info in comments section will resolve
the shortage of information and allow us to determine whether this is a truely a blocking issue
for wireless networks on windows 7 using jdk 7.

Comment by Joe Fialli [ 17/Aug/11 ]

Awaiting confirmation from reporter if this issue is due to windows firewall and requires
network configuration by user to enable network communications between processes.

Specifically, create an inbound rule in Windows Firewall that allows all connections from all
other members of the cluster.

Comment by arungupta [ 17/Aug/11 ]

The output from the command is:

D:\tools\glassfish\3.1.1\ose-glassfish3-full>java -classpath glassfish3\glassfish\modules\shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility
AllLocalAddresses() = [/127.0.0.1, /0:0:0:0:0:0:0:1]
getFirstNetworkInterface() = name:lo (Software Loopback Interface 1)
getFirstInetAddress( true ) = null
getFirstInetAddress( false ) = null
getFirstNetworkInteface() = name:lo (Software Loopback Interface 1)
getFirstInetAddress(firstNetworkInteface, true) = null
getFirstInetAddress(firstNetworkInteface, false) = null

Explicitly setting GMS-BIND-INTERFACE-ADDRESS-c1 as a system property is the workaround.

The firewall rules will be required even if the DAS/instances are all on the local machine ?

Comment by Joe Fialli [ 18/Aug/11 ]

Attempted to recreate reported issue to determine if this is a general issue that all configurations
of glassfish 3.1.1, windows 7 and jdk 7 would hit running using wireless network.

Downloaded JDK 7 and GlassFish 3.1.1 to a HP Windows 7 Professional Laptop with only wireless network connection.
(This laptop was running Norton Security instead of MacAfee Security in this report.)

Was able to create a GlassFish cluster and instances were able to see each other.
Key Log info.

Aug 18, 2011 7:36:01 AM com.sun.enterprise.admin.launcher.GFLauncherLogger info
INFO: JVM invocation command line:
C:\Program Files\Java\jdk1.7.0\bin\java.exe

#|2011-08-18T07:37:21.524-0400|INFO|glassfish3.1.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-2;|GMS1092: GMS View Change Received for group: mycluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: instance01, MemberType: CORE, Address: 10.0.1.11:9145:228.9.29.50:30647:mycluster:instance01
2: MemberId: instance02, MemberType: CORE, Address: 10.0.1.11:9122:228.9.29.50:30647:mycluster:instance02
3: MemberId: server, MemberType: SPECTATOR, Address: 10.0.1.11:9116:228.9.29.50:30647:mycluster:server

#]

Additionally, ran both ipconfig and ListNetsEx program that uses java.net.NetworkInterface methods used by
GMS to locate the first inet address. Attaching complete output, here is key output from those two commands.

Wireless LAN adapter Wireless Network Connection:

Connection-specific DNS Suffix . : hsd1.ma.comcast.net.
Description . . . . . . . . . . . : Atheros AR9285 802.11b/g/n WiFi Adapter
Physical Address. . . . . . . . . : C4-17-FE-2C-6C-51
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::4839:4e5d:143c:ea11%11(Preferred)
IPv4 Address. . . . . . . . . . . : 10.0.1.11(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Lease Obtained. . . . . . . . . . : Monday, August 15, 2011 9:41:36 AM
Lease Expires . . . . . . . . . . : Monday, August 22, 2011 6:30:28 AM
Default Gateway . . . . . . . . . : 10.0.1.1
DHCP Server . . . . . . . . . . . : 10.0.1.1
DHCPv6 IAID . . . . . . . . . . . : 314841086
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-12-EB-25-F6-00-26-9E-EC-3C-66
DNS Servers . . . . . . . . . . . : 10.0.1.1
NetBIOS over Tcpip. . . . . . . . : Enabled

ListNetsEx output:
Display name: Atheros AR9285 802.11b/g/n WiFi Adapter
Name: net3
InetAddress: /10.0.1.11
InetAddress: /fe80:0:0:0:4839:4e5d:143c:ea11%11
Up? true
Loopback? false
PointToPoint? false
Supports multicast? true
Virtual? false
Hardware address: [-60, 23, -2, 44, 108, 81]
MTU: 1500

Note that unlike submitted case, that the network interface "up" is returning true, which allows automatic
detection of inet address to be found.

There is something about the submitter's configuration that is resulting in java.net.NetworkInterface.isUp()
to incorrectly return false for the wireless adapter. Can not be sure what it is since was unable to recreate the
failure. Downgrading this issue to trivial since there is a workaround and we were not able to recreate
the failure with the provided configuration information.





[GLASSFISH-17016] Inconsistency between validate-multicast and GMS picking interface for binding Created: 12/Jul/11  Updated: 07/Dec/11

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: arungupta Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

In Ubuntu 11.04, with eth0 disabled and no wireless connectivity ifconfig reports:

arun@ArunUbuntu:~/tools/glassfish-web$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:26:b9:f1:15:19
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:5698 errors:0 dropped:0 overruns:0 frame:0
TX packets:4575 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4717594 (4.7 MB) TX bytes:1129576 (1.1 MB)
Interrupt:20 Memory:f6900000-f6920000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:192770 errors:0 dropped:0 overruns:0 frame:0
TX packets:192770 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:197208585 (197.2 MB) TX bytes:197208585 (197.2 MB)

Explicitly enabled MULTICAST on lo as:

sudo ifconfig lo multicast

and then got:

arun@ArunUbuntu:~/tools/glassfish-web$ ifconfig
eth0 Link encap:Ethernet HWaddr 00:26:b9:f1:15:19
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:5698 errors:0 dropped:0 overruns:0 frame:0
TX packets:4575 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4717594 (4.7 MB) TX bytes:1129576 (1.1 MB)
Interrupt:20 Memory:f6900000-f6920000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MULTICAST MTU:16436 Metric:1
RX packets:192914 errors:0 dropped:0 overruns:0 frame:0
TX packets:192914 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:197220833 (197.2 MB) TX bytes:197220833 (197.2 MB)

Explicitly added route as:

sudo route add -net 224.0.0.0 netmask 240.0.0.0 dev lo

and then saw:

arun@ArunUbuntu:~/tools/glassfish-web$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.151.0.0 0.0.0.0 255.255.224.0 U 2 0 0 wlan0
169.254.0.0 0.0.0.0 255.255.0.0 U 1000 0 0 wlan0
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 lo
0.0.0.0 10.151.0.1 0.0.0.0 UG 0 0 0 wlan0

Running validate-multicast command in two separate shells show:

arun@ArunUbuntu:~/tools/glassfish-web$ ./glassfish3/bin/asadmin validate-multicastWill use port 2048
Will use address 228.9.3.1
Will use bind interface null
Will use wait period 2,000 (in milliseconds)

Listening for data...
Sending message with content "ArunUbuntu" every 2,000 milliseconds
Received data from ArunUbuntu (loopback)
Received data from ArunUbuntu
Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
Command validate-multicast executed successfully.

Creating a cluster with 2 instances and starting it shows the following log message:

Caused by: com.sun.enterprise.ee.cms.core.GMSException: initialization failure
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:142)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.initializeGroupCommunicationProvider(GroupCom
municationProviderImpl.java:164)
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:176)
... 22 more
Caused by: java.io.IOException: can not find a first InetAddress
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.start(GrizzlyNetworkManager.java:376)
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:140)
... 24 more

Even though validate-multicast is working the instances are not able to join the cluster.

Here is what Joe mentioned in an email thread:

– cut here –
Validate-multicast is not using NetworkUtility.getFirstInetAddress(false).
validate-multicast is not specifying any IP address by default when creating the multicast socket.
Just to remind you, validate-multicast is only creating a MulticastSocket and only communicating
over UDP. While the getFirstInetAddress(false) is being used to compute the IP address that
another instance can communicate via TCP to an instance. That is totally different.
We are trying to use same IP address for both TCP and UDP in GMS. We need to revisit
this logic. We will need to remove the check for multicast enabled in selecting network interface
now since we are working on supporting non-multicast mode.
– cut here –

Explicitly setting GMS_BIND_INTERFACE_ADDRESS-c1 property to "127.0.0.1" in each instance and DAS and then restarting the DAS and cluster makes sure the instances can join the cluster.



 Comments   
Comment by Bobby Bissett [ 20/Oct/11 ]

Assigning to me.

Comment by Bobby Bissett [ 07/Dec/11 ]

Moving to Joe (hi) since I'm not on the GF project any more. The work for this is mostly done, and Joe knows what change to make in the mcast sender thread so it mirrors what GMS proper is doing.





[GLASSFISH-16908] More than 6 instances does not join to GMS group Created: 24/Jun/11  Updated: 05/Jul/11  Resolved: 05/Jul/11

Status: Closed
Project: glassfish
Component/s: configuration, failover, grizzly-kernel, group_management_service
Affects Version/s: 3.1.1_b08
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: vanya_void Assignee: Bobby Bissett
Resolution: Invalid Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.18-164.el5PAE RHEL
2xXeon L7555
Red Hat Enterprise Linux Server release 5.4 (Tikanga)


Tags: 3_1_1-scrubbed, cluster, clustered

 Description   

I have 5 node cluster with 2 instances running on each. When im running start-cluster command, only 6 of them, joins to gms group at same time, and information that represented by get-health command is looks like this:

portal-instance1 failed since Fri Jun 24 20:00:57 MSD 2011
portal-instance12 not started
portal-instance2 started since Fri Jun 24 20:01:16 MSD 2011
portal-instance22 started since Fri Jun 24 20:01:16 MSD 2011
portal-instance3 started since Fri Jun 24 20:01:16 MSD 2011
portal-instance32 started since Fri Jun 24 20:01:16 MSD 2011
portal-instance4 not started
portal-instance42 not started
portal-instance5 failed since Thu Jun 23 19:54:10 MSD 2011
portal-instance52 failed since Thu Jun 23 21:04:20 MSD 2011

Maybe you have an information, why this situation may be?

[#|2011-06-24T20:08:39.443+0400|INFO|glassfish3.1|ShoalLogger|_ThreadID=12;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: portal-cluster : Members in view for ADD_EVENT(before change analysis) are :
1: MemberId: portal-instance1, MemberType: CORE, Address: 192.168.101.31:9188:228.9.96.158:20796:portal-cluster:portal-instance1
2: MemberId: portal-instance12, MemberType: CORE, Address: 192.168.101.31:9091:228.9.96.158:20796:portal-cluster:portal-instance12
3: MemberId: portal-instance4, MemberType: CORE, Address: 192.168.101.34:9096:228.9.96.158:20796:portal-cluster:portal-instance4
4: MemberId: portal-instance42, MemberType: CORE, Address: 192.168.101.34:9146:228.9.96.158:20796:portal-cluster:portal-instance42
5: MemberId: portal-instance5, MemberType: CORE, Address: 192.168.101.35:9102:228.9.96.158:20796:portal-cluster:portal-instance5
6: MemberId: portal-instance52, MemberType: CORE, Address: 192.168.101.35:9129:228.9.96.158:20796:portal-cluster:portal-instance52



 Comments   
Comment by Bobby Bissett [ 24/Jun/11 ]

So far I don't see any bug here. Clusters are known to work.

Please give the output of asadmin list-instances as well as asadmin get-health. If both commands agree that some instances are stopped, look at the logs to figure out why. If the commands don't agree, follow all the steps of this blog to make sure your network supports what you're doing:

http://blogs.oracle.com/bobby/entry/validating_multicast_transport_where_d

It would be better to discuss this on the users list and then file an issue after we find out there's a bug.

Comment by Bobby Bissett [ 05/Jul/11 ]

Glad you got the network issues figured out.





[GLASSFISH-16721] MS1042: failed to send heartbeatmessage with state=aliveandready to group Created: 24/May/11  Updated: 25/May/11  Resolved: 25/May/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: sebglon Assignee: Joe Fialli
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu, glassfish 3.1, apache 2.2.1, mod_jk 1.2.26


Tags: apache, glassfish, ioexception, mod_jk, shoallogger

 Description   

I have an Apache and glassfish with mod_jk.
Many times, I have an error http 500 when i request my apache.

On my log, i have always this line
[#|2011-05-23T19:54:19.749+0200|WARNING|glassfish3.1|ShoalLogger|_ThreadID=11;_ThreadName=Thread-1;|GMS1042: failed to send heartbeatmessage with state=aliveandready to group autre-voyage. Reason: IOException:Operation not permitted|#]

For info, start log:

19-May-2011 23:10:42 com.sun.enterprise.admin.launcher.GFLauncherLogger info
INFO: JVM invocation command line:
/usr/lib/jvm/java-6-openjdk/bin/java
-cp
/opt/glassfish3/glassfish/modules/glassfish.jar
-XX:+UnlockDiagnosticVMOptions
-XX:MaxPermSize=192m
-XX:+AggressiveOpts
-XX:NewRatio=2
-XX:+UseParallelGC
-Xmx512m
-Xmn192m
-javaagent:/opt/glassfish3/glassfish/lib/monitor/btrace-agent.jar=unsafe=true,noServer=true
-server
-Dosgi.shell.telnet.maxconn=1
-Djdbc.drivers=org.apache.derby.jdbc.ClientDriver
-Dfelix.fileinstall.disableConfigSave=false
-Dfelix.fileinstall.dir=/opt/glassfish3/glassfish/modules/autostart/
-Djavax.net.ssl.keyStore=/opt/glassfish3/glassfish/domains/domain1/config/keystore.jks
-Dosgi.shell.telnet.port=6666
-Djava.security.policy=/opt/glassfish3/glassfish/domains/domain1/config/server.policy
-Dfelix.fileinstall.log.level=2
-Dfelix.fileinstall.poll=5000
-Dcom.sun.aas.instanceRoot=/opt/glassfish3/glassfish/domains/domain1
-Dosgi.shell.telnet.ip=127.0.0.1
-Dcom.sun.enterprise.config.config_environment_factory_class=com.sun.enterprise.config.serverbeans.AppserverConfigEnvironmentFactory
-Djava.endorsed.dirs=/opt/glassfish3/glassfish/modules/endorsed:/opt/glassfish3/glassfish/lib/endorsed
-Dcom.sun.aas.installRoot=/opt/glassfish3/glassfish
-Djava.ext.dirs=/usr/lib/jvm/java-6-openjdk/lib/ext:/usr/lib/jvm/java-6-openjdk/jre/lib/ext:/opt/glassfish3/glassfish/domains/domain1/lib/ext
-Dfelix.fileinstall.bundles.startTransient=true
-Dfelix.fileinstall.bundles.new.start=true
-Djavax.net.ssl.trustStore=/opt/glassfish3/glassfish/domains/domain1/config/cacerts.jks
-Dorg.glassfish.additionalOSGiBundlesToStart=org.apache.felix.shell,org.apache.felix.gogo.runtime,org.apache.felix.gogo.shell,org.apache.felix.gogo.command
-Dcom.sun.enterprise.security.httpsOutboundKeyAlias=s1as
-DANTLR_USE_DIRECT_CLASS_LOADING=true
-Djava.security.auth.login.config=/opt/glassfish3/glassfish/domains/domain1/config/login.conf
Dgosh.args=-nointeractive
-Djava.library.path=/opt/glassfish3/glassfish/lib:/usr/lib/jvm/java-6-openjdk/jre/lib/amd64/server:/usr/lib/jvm/java-6-openjdk/jre/lib/amd64:/usr/lib/jvm/java-6-openjdk/lib/amd64:/usr/java/packages/lib/amd64:/usr/lib/jni:/lib:/usr/lib
com.sun.enterprise.glassfish.bootstrap.ASMain
-domainname
domain1
-asadmin-args
-host,,,localhost,,,port,,,4848,,,secure=false,,,terse=false,,,echo=false,,,interactive=true,,,start-domain,,,verbose=false,,,debug=false,,,-domaindir,,,/opt/glassfish3/glassfish/domains,,,domain1
-instancename
server
-verbose
false
-debug
false
-asadmin-classpath
/opt/glassfish3/glassfish/modules/admin-cli.jar
-asadmin-classname
com.sun.enterprise.admin.cli.AsadminMain
-upgrade
false
-type
DAS
-domaindir
/opt/glassfish3/glassfish/domains/domain1
-read-stdin
true
19-May-2011 23:10:42 com.sun.enterprise.admin.launcher.GFLauncherLogger info
INFO: Successfully launched in 8 msec.
[#|2011-05-19T23:11:11.590+0200|INFO|glassfish3.1|null|_ThreadID=1;_ThreadName=Thread-1;|Running GlassFish Version: GlassFish Server Open Source Edition 3.1 (build 43)|#]

[#|2011-05-19T23:11:11.821+0200|INFO|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=1;_ThreadName=Thread-1;|GMSAD1005: Member server joined group autre-voyage|#]

[#|2011-05-19T23:11:11.828+0200|INFO|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=1;_ThreadName=Thread-1;|GMSAD1004: Started GMS for instance server in group autre-voyage|#]

[#|2011-05-19T23:11:11.882+0200|INFO|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=1;_ThreadName=Thread-1;|GMSAD1005: Member server joined group evasion|#]

[#|2011-05-19T23:11:11.883+0200|INFO|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=1;_ThreadName=Thread-1;|GMSAD1004: Started GMS for instance server in group evasion|#]

[#|2011-05-19T23:11:12.117+0200|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.services.impl|_ThreadID=59;_ThreadName=Thread-1;|Grizzly Framework 1.9.31 started in: 76ms - bound to [0.0.0.0:4848]|#]

[#|2011-05-19T23:11:12.117+0200|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.services.impl|_ThreadID=56;_ThreadName=Thread-1;|Grizzly Framework 1.9.31 started in: 87ms - bound to [0.0.0.0:8181]|#]

[#|2011-05-19T23:11:12.112+0200|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.services.impl|_ThreadID=57;_ThreadName=Thread-1;|Grizzly Framework 1.9.31 started in: 32ms - bound to [0.0.0.0:7676]|#]

[#|2011-05-19T23:11:12.112+0200|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.services.impl|_ThreadID=58;_ThreadName=Thread-1;|Grizzly Framework 1.9.31 started in: 49ms - bound to [0.0.0.0:3700]|#]

[#|2011-05-19T23:11:12.140+0200|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.admin.adapter|_ThreadID=1;_ThreadName=Thread-1;|The Admin Console is already installed, but not yet loaded.|#]

[#|2011-05-19T23:11:12.556+0200|INFO|glassfish3.1|javax.enterprise.system.container.web.com.sun.enterprise.web|_ThreadID=1;_ThreadName=Thread-1;|WEB0170: Apache mod_jk/jk2 attached to virtual-server [server] listening on port [8,080]|#]

[#|2011-05-19T23:11:12.560+0200|INFO|glassfish3.1|javax.enterprise.system.container.web.com.sun.enterprise.web|_ThreadID=1;_ThreadName=Thread-1;|WEB0169: Created HTTP listener [http-listener-2] on host/port [0.0.0.0:8181]|#]

[#|2011-05-19T23:11:12.567+0200|INFO|glassfish3.1|javax.enterprise.system.container.web.com.sun.enterprise.web|_ThreadID=1;_ThreadName=Thread-1;|WEB0169: Created HTTP listener [admin-listener] on host/port [0.0.0.0:4848]|#]

[#|2011-05-19T23:11:12.596+0200|INFO|glassfish3.1|javax.enterprise.system.container.web.com.sun.enterprise.web|_ThreadID=1;_ThreadName=Thread-1;|WEB0171: Created virtual server [server]|#]

[#|2011-05-19T23:11:12.599+0200|INFO|glassfish3.1|javax.enterprise.system.container.web.com.sun.enterprise.web|_ThreadID=1;_ThreadName=Thread-1;|WEB0171: Created virtual server [__asadmin]|#]

[#|2011-05-19T23:11:13.015+0200|INFO|glassfish3.1|javax.enterprise.system.container.web.com.sun.enterprise.web|_ThreadID=1;_ThreadName=Thread-1;|WEB0172: Virtual server [server] loaded default web module []|#]

[#|2011-05-19T23:11:13.117+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.120+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.120+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAdmin worker|#]

[#|2011-05-19T23:11:13.121+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAdmin worker|#]

[#|2011-05-19T23:11:13.121+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfEvasion worker|#]

[#|2011-05-19T23:11:13.121+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.121+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAdmin worker|#]

[#|2011-05-19T23:11:13.122+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfEvasion worker|#]

[#|2011-05-19T23:11:13.122+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.122+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.122+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.123+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.123+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfEvasion worker|#]

[#|2011-05-19T23:11:13.123+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.124+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

[#|2011-05-19T23:11:13.124+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker worker|#]

[#|2011-05-19T23:11:13.124+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAdmin worker|#]

[#|2011-05-19T23:11:13.124+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfEvasion worker|#]

[#|2011-05-19T23:11:13.124+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAdmin worker|#]

[#|2011-05-19T23:11:13.125+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfEvasion worker|#]

[#|2011-05-19T23:11:13.138+0200|WARNING|glassfish3.1|org.apache.tomcat.util.threads.ThreadPool|_ThreadID=10;_ThreadName=Thread-1;|threadpool.max_threads_too_low|#]

[#|2011-05-19T23:11:14.206+0200|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.services.impl|_ThreadID=1;_ThreadName=Thread-1;|core.start_container_done|#]

and my mod_jk config:
glassfish-jk.properties file -->

  1. Define 1 real worker using ajp13
    worker.list=gfEvasion, gfAdmin, gfAutreVoyage
  1. Set properties for worker1 (ajp13)
    worker.gfAdmin.type=ajp13
    worker.gfAdmin.host=localhost
    worker.gfAdmin.port=8080
    worker.gfAdmin.lbfactor=25
    worker.gfAdmin.cachesize=10
    #worker.gfAdmin.cache_timeout=600
    #worker.gfAdmin.socket_keepalive=1
    #worker.gfAdmin.socket_timeout=300
  1. Set properties for worker1 (ajp13)
    worker.gfEvasion.type=ajp13
    worker.gfEvasion.host=localhost
    worker.gfEvasion.port=28081
    worker.gfEvasion.lbfactor=25
    worker.gfEvasion.cachesize=10
    #worker.gfEvasion.cache_timeout=600
    #worker.gfEvasion.socket_keepalive=1
    #worker.gfEvasion.socket_timeout=300

worker.gfAutreVoyage.type=ajp13
worker.gfAutreVoyage.host=localhost
worker.gfAutreVoyage.port=28080
worker.gfAutreVoyage.lbfactor=25
worker.gfAutreVoyage.cachesize=10
worker.gfAutreVoyage.cache_timeout=600
worker.gfAutreVoyage.socket_keepalive=1
worker.gfAutreVoyage.socket_timeout=300
worker.gfAutreVoyage.connection_pool_timeout=600

My virtual host config: JkMount /* gfAutreVoyage

and my httpd.conf

JkWorkersFile /opt/glassfish3/glassfish/domains/domain1/config/glassfish-jk.properties

  1. Where to put jk logs
    JkLogFile /var/log/apache2/mod_jk.log
  2. Set the jk log level [debug/error/info]
    JkLogLevel debug
  3. Select the log format
    JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "
  4. JkOptions indicate to send SSL KEY SIZE,
    JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
  5. JkRequestLogFormat set the request format
    JkRequestLogFormat "%w %V %T"


 Comments   
Comment by Nazrul [ 24/May/11 ]

User is using openJDK. Perhaps it would be a good idea to try HotSpot (Sun) JDK from here: http://www.oracle.com/technetwork/java/javase/downloads/index.html

Comment by Joe Fialli [ 25/May/11 ]

These comments only pertain to following reported log message and the subject of this issue.

[#|2011-05-23T19:54:19.749+0200|WARNING|glassfish3.1|ShoalLogger|_ThreadID=11;_ThreadName=Thread-1;|GMS1042: failed to send heartbeatmessage with state=aliveandready to group autre-voyage. Reason: IOException:Operation not permitted|#]

Shoal GMS requires UDP multicast to be configured properly to be used between glassfish clustered instances.
The above failure indicates that multicast may not be configured to be allowed at the OS networking layer.

The following link provides guidance on verifying UDP multicast is configured properly.
http://blogs.oracle.com/bobby/entry/validating_multicast_transport_where_d

The reported failure is a network or security configuration issue that is not allowing the UDP multicast operation
to succeed so this issue will be marked invalid.

*******

The following error message is not a GMS issue.

[#|2011-05-19T23:11:13.117+0200|SEVERE|glassfish3.1|org.apache.jk.server.JkMain|_ThreadID=10;_ThreadName=Thread-1;|No class name for worker.gfAutreVoyage worker|#]

It should be filed as different issue with a different subcomponent.

Comment by Joe Fialli [ 25/May/11 ]

Marking this issue as incomplete since information was not provided that illustrated that
UDP multicast was verified to be working properly. The reported failure "Operation invalid"
on sending a UDP multicast message indicates that UDP multicast is not enabled.

Following the directions on this
link: http://blogs.oracle.com/bobby/entry/validating_multicast_transport_where_d
should resolve the network configuration issue.

If it does not, please submit "ifconfig -a" output (or equivalent if OS does not support that command),
the DAS server.log with ShoalLogger log level of CONFIG. (provides what network addresses and ports are being used)

% asadmin set-log-levels ShoalLogger=CONFIG

Run the above before creating cluster.

The ifconfig output should have "MULTICAST" on a line similar to following:
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500





[GLASSFISH-16570] [regression w.r.t 3.1] Classloading issues related to GMS observed in the instance logs on start of RichAccess Big App test. Created: 06/May/11  Updated: 16/May/11  Resolved: 16/May/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.1_b04
Fix Version/s: None

Type: Bug Priority: Major
Reporter: varunrupela Assignee: Joe Fialli
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive fine-shoal-logs.zip    
Issue Links:
Related
is related to GLASSFISH-16568 GMS can select incorrect network inte... In Progress
Tags: 3_1_1-review

 Description   

Please see the parent bug http://java.net/jira/browse/GLASSFISH-15425 for scenario details.

On running the RichAccess Big App test the following classloading related exceptions appear in the instance logs. This is only during the start of the test. The exceptions seem to be related to GMS.

******
[#|2011-05-06T12:48:24.135+0530|WARNING|glassfish3.1|javax.enterprise.system.core.classloading.com.sun.enterprise.loader|_ThreadID=39;_ThreadName=Thread-1;|LDR5207: ASURLClassLoader EarLibClassLoader :
doneCalled = true
doneSnapshot = ASURLClassLoader.done() called ON EarLibClassLoader :
urlSet = []
doneCalled = false
Parent -> org.glassfish.internal.api.DelegatingClassLoader@1a72d7ef

AT Fri May 06 11:44:58 IST 2011
BY :java.lang.Throwable: printStackTraceToString
at com.sun.enterprise.util.Print.printStackTraceToString(Print.java:639)
at com.sun.enterprise.loader.ASURLClassLoader.done(ASURLClassLoader.java:211)
at com.sun.enterprise.loader.ASURLClassLoader.preDestroy(ASURLClassLoader.java:179)
at org.glassfish.javaee.full.deployment.EarClassLoader.preDestroy(EarClassLoader.java:114)
at org.glassfish.internal.data.ApplicationInfo.unload(ApplicationInfo.java:358)
at com.sun.enterprise.v3.server.ApplicationLifecycle.unload(ApplicationLifecycle.java:999)
at com.sun.enterprise.v3.server.ApplicationLifecycle.disable(ApplicationLifecycle.java:1970)
at com.sun.enterprise.v3.server.ApplicationConfigListener.disableApplication(ApplicationConfigListener.java:278)
at com.sun.enterprise.v3.server.ApplicationConfigListener.handleOtherAppConfigChanges(ApplicationConfigListener.java:198)
at com.sun.enterprise.v3.server.ApplicationConfigListener.transactionCommited(ApplicationConfigListener.java:146)
at org.jvnet.hk2.config.Transactions$TransactionListenerJob.process(Transactions.java:344)
at org.jvnet.hk2.config.Transactions$TransactionListenerJob.process(Transactions.java:335)
at org.jvnet.hk2.config.Transactions$ListenerNotifier$1.call(Transactions.java:211)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.jvnet.hk2.config.Transactions$Notifier$1$1.run(Transactions.java:165)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Parent -> org.glassfish.internal.api.DelegatingClassLoader@1a72d7ef
was requested to find class com.sun.enterprise.ee.cms.logging.LogStrings after done was invoked from the following stack trace
java.lang.Throwable
at com.sun.enterprise.loader.ASURLClassLoader.findClassData(ASURLClassLoader.java:780)
at com.sun.enterprise.loader.ASURLClassLoader.findClass(ASURLClassLoader.java:696)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:296)
at java.lang.ClassLoader.loadClass(ClassLoader.java:296)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2289)
at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1364)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1328)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1224)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:952)
at java.util.logging.Logger.findResourceBundle(Logger.java:1257)
at java.util.logging.Logger.setupResourceInfo(Logger.java:1312)
at java.util.logging.Logger.getLogger(Logger.java:312)
at com.sun.enterprise.ee.cms.logging.GMSLogDomain.getSendLogger(GMSLogDomain.java:87)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.logSendMessageException(GroupCommunicationProviderImpl.java:395)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.sendMessage(GroupCommunicationProviderImpl.java:366)
at com.sun.enterprise.ee.cms.impl.base.GroupHandleImpl.sendMessage(GroupHandleImpl.java:142)
at org.shoal.ha.group.gms.GroupServiceProvider.sendMessage(GroupServiceProvider.java:257)
at org.shoal.ha.cache.impl.interceptor.TransmitInterceptor.onTransmit(TransmitInterceptor.java:83)
at org.shoal.ha.cache.api.AbstractCommandInterceptor.onTransmit(AbstractCommandInterceptor.java:98)
at org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterManager.onTransmit(ReplicationCommandTransmitterManager.java:86)
at org.shoal.ha.cache.api.AbstractCommandInterceptor.onTransmit(AbstractCommandInterceptor.java:98)
at org.shoal.ha.cache.impl.interceptor.CommandHandlerInterceptor.onTransmit(CommandHandlerInterceptor.java:74)
at org.shoal.ha.cache.impl.command.CommandManager.executeCommand(CommandManager.java:122)
at org.shoal.ha.cache.impl.command.CommandManager.execute(CommandManager.java:114)
at org.shoal.ha.cache.impl.interceptor.ReplicationCommandTransmitterWithMap$BatchedCommandMapDataFrame.run(ReplicationCommandTransmitterWithMap.java:298)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

#]

*******



 Comments   
Comment by varunrupela [ 16/May/11 ]

This issue was filed also as http://java.net/jira/browse/GLASSFISH-16631.





[GLASSFISH-16568] GMS can select incorrect network interface when a Virtual Machine created bridge n/w interface (virbr0) exists Created: 06/May/11  Updated: 14/Oct/11

Status: In Progress
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.1_b04
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: varunrupela Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux [FQDN removed] 2.6.18-164.el5 #1 SMP Thu Sep 3 04:15:13 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Virtual Machine has created a network interface virbr0 on each of 3 machines.
java.net.NetworkInterface.getNetworkInterfaces() is returning virbr0 as first interface.
This interface was not working for TCP point to point messaging in GMS.

Here is network interface config from ifconfig -a.
The virbr0 configuration (same on all 3 machines so the IP address not being unique is a big problem)

eth0 Link encap:Ethernet HWaddr 00:16:36:FF:D5:C8
inet addr:10.12.153.53 Bcast:10.12.153.255 Mask:255.255.255.0
inet6 addr: fe80::216:36ff:feff:d5c8/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:163499623 errors:0 dropped:0 overruns:0 frame:0
TX packets:164695644 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:40318720962 (37.5 GiB) TX bytes:68091586600 (63.4 GiB)
Interrupt:66 Memory:fdff0000-fe000000

<deleted eth1 - eth3, none were UP>
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:276021187 errors:0 dropped:0 overruns:0 frame:0
TX packets:276021187 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:47722695747 (44.4 GiB) TX bytes:47722695747 (44.4 GiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

virbr0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0
inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:137 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:40467 (39.5 KiB)


Attachments: Zip Archive fine-shoal-logs.zip     Zip Archive logs.zip    
Issue Links:
Dependency
blocks GLASSFISH-15425 [STRESS][umbrella] 24x7 RichAccess ru... Open
Related
is related to GLASSFISH-16570 [regression w.r.t 3.1] Classloading i... Resolved
is related to GLASSFISH-16631 resource bundle resolution failing in... Resolved
Tags: 3_1-next, 3_1_1-scrubbed

 Description   

Please see the parent bug http://java.net/jira/browse/GLASSFISH-15425 for scenario details.

On running the RichAccess Big App test the instance logs are observed to be filled with Grizzly and Shoal logger messages of the following type:

******
[#|2011-05-06T11:26:53.439+0530|SEVERE|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=29;_ThreadName=Thread-1;|Connection refused|#]

[#|2011-05-06T11:26:53.445+0530|SEVERE|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=30;_ThreadName=Thread-1;|Connection refused|#]

[#|2011-05-06T11:26:53.447+0530|WARNING|glassfish3.1|ShoalLogger|_ThreadID=31;_ThreadName=Thread-1;|Error during groupHandle.sendMessage(instance103, /richAccess; size=30672|#]

*******

  • No http failures were on observed on the client.
  • 2 sets of logs are being attached. 1 with Shoal logger set to fine and 1 without. Unzip and look under "logs/st-cluster" for the instance logs.
  • This issue appears with both Sun JDK and JRockit JDK


 Comments   
Comment by varunrupela [ 12/May/11 ]

Marked the issue as blocking. Its hard to analyze the logs and extract information useful to debug the run.

Comment by Joe Fialli [ 13/May/11 ]

Perhaps there is firewall configuration preventing connections. GMS is using UDP multicast to find all instances and communicate
GMS notifications. All of that is working just fine.
However, I have not seen any TCP connections succeed.
The gms send messages that are failing are over TCP and they are HA replication sends.

Below is a pure Grizzly connection that that does not have anything to do with GMS. GMS names all of its threads with "gms" in them (even the thread pool given to Grizzly to run gms handlers uses "gms" in the thread name.

There are 1574 of the following failures.

server.log:[#|2011-05-06T11:28:11.435+0530|SEVERE|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|
_ThreadID=40;_ThreadName=Thread-1;|Connection refused|#]

Such a failure looks similar to what would happen if a firewall was blocking ports. It is suspcious that the instances
are all running on same machine and the connections are failing. Thus, firewall is probably blocking intermachine TCP communication.

*************

It was not stated but I have observed that 3 instances and das are all running on one machine.
I have not observed it yet, but if there is not sufficient memory on the machine, the instances
could start running out of memory. My past experience with rich access not all instances where run on one machine.

[#|2011-05-06T11:20:37.552+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: instance101, MemberType: CORE, Address: 192.168.122.1:9186:228.9.30.160:9176:st-cluster:instance101
2: MemberId: instance102, MemberType: CORE, Address: 192.168.122.1:9163:228.9.30.160:9176:st-cluster:instance102
3: MemberId: instance103, MemberType: CORE, Address: 192.168.122.1:9091:228.9.30.160:9176:st-cluster:instance103
4: MemberId: server, MemberType: SPECTATOR, Address: 192.168.122.1:9114:228.9.30.160:9176:st-cluster:server

#]

More analysis to come. Just wanted to pass this along.

Comment by Joe Fialli [ 13/May/11 ]

The tcp ports that GMS needs to not be blocked by a firewall are between 9090 and 9200.
For the above run, the ports used were randomly selected from the above range and are
9186, 9163, 9091 ad 9114. The next run will have different ports so unblocking
the tcp port range from 9090 to 9200 is necessary.

Comment by Joe Fialli [ 13/May/11 ]

There would be no log message printed out for this gms send message failure since the logging is set to FINE.
However, the lookup of the Logger is failing. (something different in jrockit environment is causing this.)

The following stack trace is occurring trying to get a resource bundle for a java.util.Logger.
The call is java.util.Logger.getLogger("ShoalLogger.send", "com.sun.enterprise.ee.cms.logging.LogStrings");

The resource bundle in question is com.sun.enterprise.ee.cms.logging.LogStrings.properties.

The error message is incorrectly stating that it is looking for a class called com.sun.enterprise.ee.cms.logging.LogStrings.
No such class exists. We need assistance from someone with class loading/resource bundle knowledge to find out why this is going
wrong in jrockit environment.

[#|2011-05-06T11:26:53.369+0530|WARNING|glassfish3.1|javax.enterprise.system.core.classloading.com.sun.enterprise.loader|_ThreadID=28;_ThreadName=Thread-1;|LDR5207: ASURLClassLoader EarLibClassLoader :
doneCalled = true
doneSnapshot = ASURLClassLoader.done() called ON EarLibClassLoader :
urlSet = []
doneCalled = false
Parent -> org.glassfish.internal.api.DelegatingClassLoader@392aa3fb

AT Fri May 06 11:26:28 IST 2011
BY :java.lang.Throwable: printStackTraceToString
at com.sun.enterprise.util.Print.printStackTraceToString(Print.java:639)
at com.sun.enterprise.loader.ASURLClassLoader.done(ASURLClassLoader.java:211)
at com.sun.enterprise.loader.ASURLClassLoader.preDestroy(ASURLClassLoader.java:179)
at org.glassfish.javaee.full.deployment.EarClassLoader.preDestroy(EarClassLoader.java:114)
at org.glassfish.internal.data.ApplicationInfo.unload(ApplicationInfo.java:358)
at com.sun.enterprise.v3.server.ApplicationLifecycle.unload(ApplicationLifecycle.java:999)
at com.sun.enterprise.v3.server.ApplicationLifecycle.disable(ApplicationLifecycle.java:1970)
at com.sun.enterprise.v3.server.ApplicationConfigListener.disableApplication(ApplicationConfigListener.java:278)
at com.sun.enterprise.v3.server.ApplicationConfigListener.handleOtherAppConfigChanges(ApplicationConfigListener.java:198)
at com.sun.enterprise.v3.server.ApplicationConfigListener.transactionCommited(ApplicationConfigListener.java:146)
at org.jvnet.hk2.config.Transactions$TransactionListenerJob.process(Transactions.java:344)
at org.jvnet.hk2.config.Transactions$TransactionListenerJob.process(Transactions.java:335)
at org.jvnet.hk2.config.Transactions$ListenerNotifier$1.call(Transactions.java:211)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at org.jvnet.hk2.config.Transactions$Notifier$1$1.run(Transactions.java:165)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Parent -> org.glassfish.internal.api.DelegatingClassLoader@392aa3fb
was requested to find class com.sun.enterprise.ee.cms.logging.LogStrings after done was invoked from the following stack trace
java.lang.Throwable
at com.sun.enterprise.loader.ASURLClassLoader.findClassData(ASURLClassLoader.java:780)
at com.sun.enterprise.loader.ASURLClassLoader.findClass(ASURLClassLoader.java:696)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:296)
at java.lang.ClassLoader.loadClass(ClassLoader.java:296)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
at java.util.ResourceBundle$Control.newBundle(ResourceBundle.java:2289)
at java.util.ResourceBundle.loadBundle(ResourceBundle.java:1364)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1328)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1282)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1224)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:952)
at java.util.logging.Logger.findResourceBundle(Logger.java:1280)
at java.util.logging.Logger.setupResourceInfo(Logger.java:1335)
at java.util.logging.Logger.getLogger(Logger.java:335)
at com.sun.enterprise.ee.cms.logging.GMSLogDomain.getSendLogger(GMSLogDomain.java:87)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.logSendMessageException(GroupCommunicationProviderImpl.java:395)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.sendMessage(GroupCommunicationProviderImpl.java:366)
at com.sun.enterprise.ee.cms.impl.base.GroupHandleImpl.sendMessage(GroupHandleImpl.java:142)
at org.shoal.ha.group.gms.GroupServiceProvider.sendMessage(GroupServiceProvider.java:257)
at org.shoal.ha.cache.impl.interceptor.TransmitInterceptor.onTransmit(TransmitInterceptor.java:83)
at org.shoal.ha.cache.api.AbstractCommandInterceptor.onTransmit(AbstractCommandInterceptor.java:98)
:doneCalled = true
doneSnapshot = ASURLClassLoader.done() called ON EarLibClassLoader :
urlSet = []
doneCalled = false
Parent -> org.glassfish.internal.api.DelegatingClassLoader@392aa3fb

Comment by Joe Fialli [ 13/May/11 ]

The submitted fine server logging did not have any ShoalLogger of FINE level.
Only had FINE logging for org.shoal.ha.

To enable GMS ShoalLogger, one needs to specify ShoalLogger with FINE level.
(Shoal GMS does not use org.shoal.gms* for logger name, but still uses ShoalLogger.)

Since the connection refused is in grizzly, it might be of more use to
set grizzly logging to FINE if it turns out that there is no firewall protection
blocking GMS tcp communications on ports between 9090 and 9200. (the default
gms port ranges. User can override these defaults if necessary.)

[#|2011-05-06T12:48:40.087+0530|SEVERE|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=51;_ThreadName=Thread-1;|Connection refused|#]

So enabling logging of com.sun.grizzly to FINE my help find out why the Connection was refused.

Comment by varunrupela [ 16/May/11 ]

Clarification regarding the setup:

  • Multiple network interfaces are enabled on all the 3 machines on this setup and GMS on each seems to bind to the virtual n/w interface 192.168.122.1.
Comment by Joe Fialli [ 17/May/11 ]

Please follow documentation to configure gms to bind to a specific network interface.

http://download.oracle.com/docs/cd/E18930_01/html/821-2426/gjfnl.html#gjdlw

Also, recommend running "asadmin validate-multicast -bindaddress X.X.X.X" on all three machines
to double check that UDP multicast traffic is working properly on whatever subnet that you select.

Comment by Joe Fialli [ 17/May/11 ]

Removed blocking and regression from subject line and changed subject line to match what the issue was discovered to be.
This was not a regression, same issue would exist in 3.1 as 3.1.1. No changes were made in 3.1.1 that caused
this. There was a change in the configured environment that caused this issue to surface.

Simple workaround is to disable or down the virbr0 network interface that were not being used.

The following error messages were being repeated many times in server.log file.

[#|2011-05-06T11:26:53.445+0530|SEVERE|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=30;_ThreadName=Thread-1;|Connection refused|#]

[#|2011-05-06T11:26:53.447+0530|WARNING|glassfish3.1|ShoalLogger|_ThreadID=31;_ThreadName=Thread-1;|Error during groupHandle.sendMessage(instance103, /richAccess; size=30672|#]

We will add to the GMS log event message above with the IP address trying to be sent to assist in diagnosing this problem in future.

java.net.NetworkInterface.getNetworkInterfaces() was returning the virbr0 network interface as first interface
and that resulted in this issue. To resolve this issue, GMS will default initially to
network interface associated with InetAddress.getLocalHost(), (as long as that n/w interface is multicast enabled
and not a loopback address and UP.) This default would have avoided the reported issue.

When there are multiple n/w interfaces on a machine and the default is not the one desired to use for GMS,
the following documentation should be followed to configure GMS to use a specific n/w interface on each machine.

http://download.oracle.com/docs/cd/E18930_01/html/821-2426/gjfnl.html#gjdlw

Comment by Joe Fialli [ 17/May/11 ]

Also, add a configuration message to show the localPeerID and GMS system advertisement being sent to other machines to dynamically form the GMS group (glassfish cluster). This configuration message will show what IP address that GMS is telling other members of the cluster to contact it at.

Comment by Joe Fialli [ 13/Jun/11 ]

was unable to identify the non-functional virtual network interface using any of the java.network.NetworkInterface
methods. recommend postponing attempting to fix this issue in 3.1.1 time frame since changing the algorithm
on selecting the first network address could potentially introduce a regression for a previously
working existing network configuration. There is no means to just correct this issue without changing
how first network address is selected.

workaround did exist for this issue. simply disabled the virtual network interface that was not
being used.

Comment by Joe Fialli [ 14/Oct/11 ]

lowered priority to fix to minor since there is a workaround. Additionally this problem
is only the result of having a virtual network interface that was created by virtualbox but
was not being used. simply disabling the virb0 network interface that was not being used fixed
the problem. At this time, my recommendation is to document the issue and its workaround in release notes.





[GLASSFISH-16565] AIX 6.1, created a cluster, but: "GMS failed to start" Created: 05/May/11  Updated: 17/May/11  Resolved: 17/May/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.1_b04
Fix Version/s: 3.1.1_b06

Type: Bug Priority: Major
Reporter: easarina Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Archive File shoal-gms-impl.jar    
Tags: 3_1_1-approved

 Description   

Aix 6.1, Glassfish 3.1.1 build 04. Installed the build on two machines, configured ssh without password between the machines. Started domain. Then created a cluster, the cluster was created successfully, according a message at the terminal window. But during the cluster creation in the server.log I saw such error messages:

================================================================================

[#|2011-05-05T14:54:31.365-0700|INFO|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=11;_Thre
adName=Thread-8;|GRIZZLY0001: Starting Grizzly Framework 1.9.34 - 5/5/11 2:54 PM|#]

[#|2011-05-05T14:54:31.370-0700|CONFIG|glassfish3.1|ShoalLogger|_ThreadID=10;_ThreadName=Thread-8;|Grizzly control
ler listening on /0:0:0:0:0:0:0:0:9164. Controller started in 9 ms|#]

[#|2011-05-05T14:54:31.371-0700|SEVERE|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=10;_Thread
Name=Thread-8;|GMSAD1017: GMS failed to start. See stack trace for additional information.
com.sun.enterprise.ee.cms.core.GMSException: failed to join group c2
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:182)
at com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.join(GroupManagementServiceImpl.java:3
82)
at org.glassfish.gms.GMSAdapterImpl.initializeGMS(GMSAdapterImpl.java:576)
at org.glassfish.gms.GMSAdapterImpl.initialize(GMSAdapterImpl.java:199)
at org.glassfish.gms.bootstrap.GMSAdapterService.loadModule(GMSAdapterService.java:218)
at org.glassfish.gms.bootstrap.GMSAdapterService.checkCluster(GMSAdapterService.java:192)
at org.glassfish.gms.bootstrap.GMSAdapterService.access$100(GMSAdapterService.java:79)
at org.glassfish.gms.bootstrap.GMSAdapterService$1.changed(GMSAdapterService.java:248)
at org.jvnet.hk2.config.ConfigSupport.sortAndDispatch(ConfigSupport.java:289)
at org.glassfish.gms.bootstrap.GMSAdapterService.changed(GMSAdapterService.java:240)
at org.jvnet.hk2.config.Transactions$ConfigListenerJob.process(Transactions.java:379)
at org.jvnet.hk2.config.Transactions$ConfigListenerJob.process(Transactions.java:369)
at org.jvnet.hk2.config.Transactions$ConfigListenerNotifier$1$1.call(Transactions.java:259)
at org.jvnet.hk2.config.Transactions$ConfigListenerNotifier$1$1.call(Transactions.java:257)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:315)
at java.util.concurrent.FutureTask.run(FutureTask.java:150)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:736)
Caused by: com.sun.enterprise.ee.cms.core.GMSException: initialization failure
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:142)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.initializeGroupCommunicationProvider
(GroupCommunicationProviderImpl.java:164)
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:176)
... 18 more
Caused by: java.io.IOException: can not find a first InetAddress
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.start(GrizzlyNetworkManager.java:376)
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:140)
... 20 more

#]

===========================================================

And then, when instances were created for the cluster, I did not see any GMS messages or events in the server.log.



 Comments   
Comment by Tom Mueller [ 06/May/11 ]

Reassigning to GMS subcategory.

Comment by Joe Fialli [ 06/May/11 ]

More information is needed to investigate why basic lookup of a network address is not working.
We need to investigate why this failure occurred in reported stack trace.

Caused by: java.io.IOException: can not find a first InetAddress
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.start(GrizzlyNetworkManager.java:376)

A simple NetworkUtility method is failing to find an InternetAddress for the machine.
There is a simple test that can be run to diagnose why the network configuration is
not working correctly.

$ cd <GlassFishInstallation>/glassfish/modules
$ java -classpath shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility
AllLocalAddresses() = [/10.152.23.224, /fe80:0:0:0:223:32ff:fe97:5cf7%4, /fe80:0:0:0:0:0:0:1%1]
getFirstNetworkInterface() = name:en0 (en0) index: 4 addresses:
/10.152.23.224;
/fe80:0:0:0:223:32ff:fe97:5cf7%4;

getFirstInetAddress( true ) = /fe80:0:0:0:223:32ff:fe97:5cf7%4
getFirstInetAddress( false ) = /10.152.23.224
getFirstNetworkInteface() = name:en0 (en0) index: 4 addresses:
/10.152.23.224;
/fe80:0:0:0:223:32ff:fe97:5cf7%4;

getFirstInetAddress(firstNetworkInteface, true) = /fe80:0:0:0:223:32ff:fe97:5cf7%4
getFirstInetAddress(firstNetworkInteface, false) = /10.152.23.224

Additionally, please submit ifconfig -a so an assessment can be made of the network configuration of the machine.

Comment by easarina [ 06/May/11 ]

Please see bellow the DAS machine information:
======================================================
-bash-3.00$ uname -n
aixas13
-bash-3.00$ java -classpath shoal-gms-impl.jar com.sun.enterprise.mgmt.transport.NetworkUtility
AllLocalAddresses() = [/10.133.169.1]
getFirstNetworkInterface() = name:lo0 (lo0) index: 1 addresses:
/0:0:0:0:0:0:0:1;
/127.0.0.1;

getFirstInetAddress( true ) = null
getFirstInetAddress( false ) = null
getFirstNetworkInteface() = name:lo0 (lo0) index: 1 addresses:
/0:0:0:0:0:0:0:1;
/127.0.0.1;

getFirstInetAddress(firstNetworkInteface, true) = null
getFirstInetAddress(firstNetworkInteface, false) = null

-bash-3.00$ ifconfig -a
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),PSEG,LARGESEND,CHAIN>
inet 10.133.169.1 netmask 0xfffff800 broadcast 10.133.175.255
tcp_sendspace 131072 tcp_recvspace 65536
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

Comment by Joe Fialli [ 06/May/11 ]

java.net.NetworkInterface.supportsMulticast() is returning FALSE for network interface en0.

So even though ifconfig is stating that MULTICAST is enabled, something in the network configuration
is not allowing that method to return true. This is why GMS is not able to find a local address.

Comment by scatari [ 10/May/11 ]

Pre-approved for integration as this is a test blocker.

Comment by Joe Fialli [ 12/May/11 ]

Fix is checked into shoal gms trunk. It has been verified to work.
Fix is still not integrated into GlassFish 3.1.1 yet.

Use attached shoal-gms-impl.jar as patch till integration complete.

Comment by Joe Fialli [ 12/May/11 ]

Install patch into <glassfish-install-dir>/glassfish/modules to workaround this issue on AIX 6.1.

Comment by Bobby Bissett [ 17/May/11 ]

Integrated into GF:

Sending packager/resources/pkg_conf.py
Sending pom.xml
Transmitting file data ..
Committed revision 46895.

I think this will be 3.1.1-b06 (looks like the tag for b05 has already been made).





[GLASSFISH-16422] signal.getMemberDetails().get() sometimes return null instead of a stored value Created: 21/Apr/11  Updated: 02/Dec/11  Resolved: 11/Jul/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: 3.1.1_b08, 4.0

Type: Bug Priority: Major
Reporter: marina vatkina Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File log.txt     Zip Archive server.log.zip     Java Archive File shoal-gms-api.jar     Java Archive File shoal-gms-impl.jar     Java Archive File shoal-gms-impl.jar     Java Archive File shoal-gms-impl.jar     Zip Archive tx-ee-resendautorecovery-logs.zip    

 Description   

There are random failures in http://hudson-sca.us.oracle.com/job/gf-transaction-cluster-devtest that can't be seen on local run.

With #75 the extra logging enabled yesterday shows the problem. In this run the failed test is 'autorecovery' and the instance and DAS logs are preserved in tx-ee-autorecovery-logs.zip (note that DAS was started earlier, so it contains everything that happened until the end of the corresponding test run).

In in1 log you can see (before the crash)
[#|2011-04-21T09:07:43.043-0700|INFO|glassfish3.2|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=10;_ThreadName=Thread-1;|Storing GMS instance in1 data TX_LOG_DIR : /export/home/hudson/workspace/gf-transaction-cluster-devtest/appserv-tests/build/module/archive/in1/tx|#]

But in in2 when it gets notification
[#|2011-04-21T09:09:30.398-0700|INFO|glassfish3.2|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=27;_ThreadName=Thread-1;|[GMSCallBack] Recovering for instance: in1 logdir: null|#]

[#|2011-04-21T09:09:30.405-0700|WARNING|glassfish3.2|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=27;_ThreadName=Thread-1;|JTS5077: Transaction log location data is not available for failed Member details for in1|#]



 Comments   
Comment by marina vatkina [ 06/May/11 ]

Attaching log files from the failed run. Hopefully they produce enough info. server.log.zip is a zip of the DAS log files from the whole test (DAS was started once for all tests)

Comment by Joe Fialli [ 09/May/11 ]

There are log files missing from DAS(server.log.zip). Only server.log is submitted and there were other server.log* that were not collected for DAS. The first log event in DAS server.log is at 16:04:53.618-0700 while server.log for in1 and in2 start
at 15:59:30.752-0700, (over 5 minutes earlier than latest DAS server log event). asadmin collect-log-files collects all log files
that are needed so it must be a shell script that is not collecting all log files for DAS.

***************

There is definitely a bug in distributed state cache when instance in1 publishes distributed state cache entry TX_LOG_DIR before
instance in2 has joined the cluster in instance in1.
However, the bug that the DAS is not syncing its DistributedStateCache contents with in2 is in DAS server.log
that were not submitted. (so can not completely confirm why DistributedStateCache is not getting synched.)
There is enough info to go on now to recreate this issue in a GMS dev test so additional logs are not necessary.

From ins1.log:
[#|2011-05-06T15:59:32.469-0700|INFO|glassfish3.2|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=10;_ThreadName=Thread-1;|Storing GMS instance in1 data TX_LOG_DIR : /export/home/hudson/workspace/gf-transaction-cluster-devtest/appserv-tests/build/module/archive/in1/tx|#]

[#|2011-05-06T15:59:34.086-0700|FINER|glassfish3.2|ShoalLogger|_ThreadID=27;_ThreadName=Thread-1;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=printDSCContents;|in1:DSC now contains ---------
-93744772 key=GMSMember:in2:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest/appserv-tests/build/module/archive/in2/tx
-93744771 key=GMSMember:in3:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest/appserv-tests/build/module/archive/in3/tx
-93744773 key=GMSMember:in1:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest/appserv-tests/build/module/archive/in1/tx

#]

[#|2011-05-06T15:59:33.106-0700|INFO|glassfish3.2|ShoalLogger|_ThreadID=14;_ThreadName=Thread-1;|GMS1024: Adding Join member: in2 group: c1 StartupState: GROUP_STARTUP |#]

From ins2.log:
[#|2011-05-06T15:59:34.062-0700|FINER|glassfish3.2|ShoalLogger|_ThreadID=10;_ThreadName=Thread-1;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=printDSCContents;|in2:DSC now contains ---------
-93744772 key=GMSMember:in2:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest/appserv-tests/build/module/archive/in2/tx

#]

Should be able to recreate this issue without further server log.

Comment by marina vatkina [ 09/May/11 ]

server.log.zip contains all 3 log files, with server.log_2011-05-06T15-51-33 starting at 3:39pm

Comment by Joe Fialli [ 09/May/11 ]

found the missing server.logs. mistakened DAS server.log from tx*.zip to be from server.log.zip and forgot to unzip the complete DAS log. Thanks for catching that.

Comment by Joe Fialli [ 16/May/11 ]

Investigating.

Noticed following trend when failing.

Instance1 registers TX_LOG_DIR in DistributedStateCache AND instance2 is not yet joined the group.
This always happens when test fails. If instance2 is started when instance1 registered TX_LOG_DIR
with DSC, then things work fine. (quick workaround would be to delay registering TX_LOG_DIR until
cluster is started. Currently the registration is happening during cluster startup when all instances
are joining and not all are up yet. DSC should work for this case, but it seems to not be at this time.)
Thus, investigating mechanism that DAS is suppose to
synchronize DSC with a newly joining instance. Uncertain if mechanism is completely not working
or if a small window exists where some values are not getting synched.

Updated existing GMS ApplicationServer dev test with current TX_LOG_DIR access pattern used by
FailureRecovery and working on recreating and fixing issue.

Comment by Joe Fialli [ 19/May/11 ]

Status update.

Created a gms level test simulation that reproduces the failure condition reported by this issue.
Have confirmed a fix to this gms level simulation of the failure.

Fix should be available early next week.
Patch that can be used to confirm this fix will be attached to this bug report.
http://java.net/jira/secure/attachment/45919/shoal-gms-impl.jar

Comment by Joe Fialli [ 19/May/11 ]

install this patch jar in glassfish3/glassfish/modules as a fix for this issue.

Comment by Joe Fialli [ 24/May/11 ]

Apply following two attachments to a Glassfish 3.1 or 3.1.1. (directory glassfish3/glassfish/modules as a patch to fix this bug)

Comment by Joe Fialli [ 25/May/11 ]

Fix required for test case that create cluster, performs a quick restart of a killed instance, deletes cluster and then creates same cluster name again. There was a rejoin map that was incorrectly a static and not an instance variable. Thus, it did not get reinitialized when DAS was not restarted after
creating it. The initially correct rejoin subevent was surviving the delete cluster and incorrectly
getting reported in newly created cluster. This rejoin then resulted in newly created instance
to get its network peerid map incorrectly removed resulting in failure to be able to perform
first synchronize from GMS master to instance. It was intermittent failure only since if
the order that instances came up and initialized DistributedStateCache was in2 and then in1, then
in1 setting DSC set in2 dsc correctly. However, if in1 came up first and then in2, then in2 depended
on GMS master to synchronize DSC with its correct dsc values (that included in1 TX_LOG_DIR).
That sendMessage was failing due to incorrect rejoin subevent. So this fix is required for transaction test usage pattern.

Comment by marina vatkina [ 24/Jun/11 ]

Joe, had the fix been checked in?

Comment by marina vatkina [ 24/Jun/11 ]

On 3.1.1 7/22 run http://hudson-sca.us.oracle.com/job/gf-transaction-cluster-devtest-3.1.1/6 (I marked it to keep forever) autorecovery test failed with a missing TX_LOG_DIR data.

Comment by Joe Fialli [ 27/Jun/11 ]

shoal~svn: revision 1617(5/19) and revision 1619(5/24) had fixes for this issue in glassfish 3.1.1.

The shoal-gms jar containing this fix was integrated into glassfish 3.1.1 on 5/27 in glassfish revision 47138 by Bobby.

> Integrating newer Shoal version into 3.1.1. Addresses the following:
>
>
>- GF-16422: TX_LOG_DIR missing from gms distributed state cache for cluster transaction recovery dev test
>- GF-15788: fix a NPE and bug# 1241855 change a WARNING to FINE in shoal cache
> - Partial for GF-16568: Demoted some shoal INFO messages to FINE since they were reported to be spamming log
> in internal filed bugs and external forum feedback. Also in shoal cache, when a sendMessage
> fails, only log message of failure to a particular instance once every 12 hours. (anti spam log)

So this should have been fixed for sometime.
I will look at logs when I have some time.

Comment by Joe Fialli [ 27/Jun/11 ]

Executive summary:

Committed fix is checked in. Everything was working as expected except the DAS fails to update its
DSC view that contains in1 TX_LOG_DIR to rest of cluster. (search for "NEW FAILURE" below)
Will need to analyze why DAS is failing to send itself a message.

Following from in2 server.log confirms that fix is checked in. (these FINE log messages document the fix.)

[#|2011-06-22T07:11:27.830-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=47;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=getFromCacheForPattern;|DSCImpl.getCacheFromPattern() for in1|#]

[#|2011-06-22T07:11:27.832-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=47;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=getFromCacheForPattern;|getFromCacheForPattern componentName:MEMBERDETAILS memberToken:in1 missing data in local cache. look up data from oldest group member:server|#]

[#|2011-06-22T07:11:27.832-0700|FINE|glassfish3.1|ShoalLogger|_ThreadID=47;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl;MethodName=sendMessage;|sending message to PeerID: 10.133.187.30:9187:228.9.146.63:7623:c1:server|#]

[#|2011-06-22T07:11:33.880-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=47;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=getFromCacheForPattern;|getFromCacheForPattern waited 6035 ms for result {}|#]

[#|2011-06-22T07:11:33.881-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=47;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=getFromCacheForPattern;|retVal is empty|#]

[#|2011-06-22T07:11:33.881-0700|INFO|glassfish3.1|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=47;_ThreadName=Thread-2;|[GMSCallBack] Recovering for instance: in1|#]

[#|2011-06-22T07:11:33.886-0700|WARNING|glassfish3.1|javax.enterprise.system.core.transaction.com.sun.jts.jta|_ThreadID=47;_ThreadName=Thread-2;|JTS5077: Transaction log location data is not available for failed Member details for in1|#]

From the DAS server.log received the request to update its local cache to all. While the DAS DCS local cache does have TX_LOG_DIR
value for in1 the request to all all to local cache fails. This is a new failure that is further along the line. Will need to investigate why this is happening. Here are key log messages from DAS showing that previous fix is checked in and that this
yet another failure further along the line.

    1. Next line shows that DAS distributed state cache definitely has the TX_LOG_DIR for in1 within a second of request for
      DAS to add all its DSC to cluster.

[#|2011-06-22T07:10:27.824-0700|FINER|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=printDSCContents;|server:DSC now contains -
--------
796990885 key=GMSMember:in2:Component:TRANSACTION-RECOVERY-SERVICE:key:in1 : value=RECOVERY_SERVER_APPOINTED|1308751827796
-93744772 key=GMSMember:in2:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest-3.1.1/appserv-tests/build/module/archive/in2/tx
-93744771 key=GMSMember:in3:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest-3.1.1/appserv-tests/build/module/archive/in3/tx
-93744773 key=GMSMember:in1:Component:MEMBERDETAILS:key:TX_LOG_DIR : value=/export/home/hudson/workspace/gf-transaction-cluster-devtest-3.1.1/appserv-tests/build/module/archive/in1/tx

#]

<deleted>

[#|2011-06-22T07:11:27.846-0700|FINER|glassfish3.1|ShoalLogger|_ThreadID=77;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=run;|Processing received message .... com.sun.enterprise.ee.cms.impl.common.DSCMessage@11aed57|#]

[#|2011-06-22T07:11:27.846-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=handleDSCMessage;|DSCMessageReceived from :in2, Operation :ADDALLLOCAL|#]

[#|2011-06-22T07:11:27.846-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=handleDSCMessage;|Syncing local cache with group ...|#]

[#|2011-06-22T07:11:27.846-0700|FINE|glassfish3.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.mgmt.ClusterViewManager;MethodName=lockLog;|getLocalView() viewLock Hold count :0, lock queue count:0|#]

[#|2011-06-22T07:11:27.847-0700|FINE|glassfish3.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.mgmt.ClusterView;MethodName=lockLog;|getView() viewLock Hold count :0, lock queue count:0|#]

[#|2011-06-22T07:11:27.847-0700|FINER|glassfish3.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl;MethodName=sendMessage;|sending message to member: server|#]

###NEW FAILURE: THIS FAILURE STOPS THE DAS from synching its DSC with the group ######
[#|2011-06-22T07:11:27.847-0700|FINE|glassfish3.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl;MethodName=sendMessage;|sendMessage(synchronous=true, to=group) failed to send msg com.sun.enterprise.ee.cms.impl.common.DSCMessage@d40911 to member 10.133.187.30:9110:228.9.27.148:3038:c1:server|#]

[#|2011-06-22T07:11:27.847-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=handleDSCMessage;|done with local to group sync...|#]

[#|2011-06-22T07:11:27.847-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=handleDSCMessage;|adding group cache state to local cache..|#]

[#|2011-06-22T07:11:27.847-0700|FINER|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=addAllToLocalCache;|done adding all to Distributed State Cache|#]

Comment by marina vatkina [ 27/Jun/11 ]

Can the fix(es) be backported into the trunk?

Comment by Joe Fialli [ 27/Jun/11 ]

could you clarify which fixes you are referring to?

The fixes that were made in May were checked into shoal gms branch for glassfish 3.1.1 and merged into the
shoal gms branch for glassfish 3.2 (shoal~svn svn revision 1628: branch gms-transport-modules).

At this time, the shoal-gms.jars were only integrated into glassfish 3.1.1. The shoal-gms jars
intended for glassfish 3.2 with these fixes has not been integrated into glassfish trunk.
It would be simple to do the integration, it just has not been done yet. I believe that might
be what is being requested.

**

However, the recently reported failures on 6/24 are against 3.1.1.
I am uncertain of what fix is needed to fix current problem reported recently. I thought that this
bug had been fix and had been awaiting confirmation. There is a new issue that I am not able to determine
from current logs.

Comment by marina vatkina [ 27/Jun/11 ]

Trunk tests are also failing. But I do not know if this happens a) because rev 47138 wasn't backported to the trunk yet, b) for the same reason as 3.1.1 tests are currently failing, c) the combination of a&b (most probably), or d) something else not quite right on trunk.

Comment by Joe Fialli [ 27/Jun/11 ]

So trunk is failing since gms changes were not integrated into trunk yet.

The following message is a FINE log message to track fix and occurs in "in2" server.log when there is a failure.
[#|2011-06-22T07:11:27.832-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=47;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.DistributedStateCacheImpl;MethodName=getFromCacheForPattern;|getFromCacheForPattern componentName:MEMBERDETAILS memberToken:in1 missing data in local cache. look up data from oldest group member:server|#]

The above log message is showing up in glassfish 3.1.1 server.log so we have confirmation
of the integration of the fix in glassfish 3.1.1.

Since I was unable to find a shoal-gms integration message for glassfish 3.2 trunk, I am certain the fix is not in glassfish trunk.

Comment by marina vatkina [ 27/Jun/11 ]

In this case it would be nice if the gms changes are integrated into the trunk

Comment by Joe Fialli [ 28/Jun/11 ]

Workaround for new issue being hit when confirming this issue is resolved.
Must restart DAS after deleting a cluster and then subsequently recreating same cluster name.

This is the sequence of steps taken between each of the tests in question.

stop-cluster A
delete-cluster A
create-cluster A
start-cluster A

There is evidence in server.log if one simply restarts the DAS between the delete and create cluster,
that it would workaround a new bug that was found. Glassfish GMS dev testing does not test this scenario
where a cluster is created, worked on, deleted and restarted on same DAS. This is the second case that
we have discovered while investingating this issue that stale data is living across the delete-cluster and
subsequent create-cluster without restarting the DAS.

Here is the log event sequence in server.log confirming that stale info is living across the delete-cluster and subsequent recreation
of the cluster in DAS.

  1. here is the GMS tcp port before the first delete cluster
    [#|2011-06-22T07:07:28.088-0700|CONFIG|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-2;|Grizzly controller listening on /0.0.0.0:9110. Controller started in 6 ms|#]
    1. this is happening due to the delete cluster command.
      [#|2011-06-22T07:09:19.340-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=54;_ThreadName=Thread-2;|GMS1010: Leaving GMS group: c1 with shutdown type set to InstanceShutdown|#]

#here is the new GMS client tcp port after create-cluster command.
[#|2011-06-22T07:09:33.482-0700|CONFIG|glassfish3.1|ShoalLogger|_ThreadID=60;_ThreadName=Thread-2;|Grizzly controller listening on /0.0.0.0:9187. Controller started in 1 ms|#]

#GMS view just before the Distributed State Cache should be sending to all these instancs.

  1. Note that server TCP port is 9187. (Address: tcp address:tcp port:multicast group address:multicast port:clustername:instancename
    [#|2011-06-22T07:10:27.786-0700|INFO|glassfish3.1|ShoalLogger|_ThreadID=62;_ThreadName=Thread-2;|GMS1092: GMS View Change Received for group: c1 : Members in view for FAILURE_EVENT(before change analysis) are :
    1: MemberId: in2, MemberType: CORE, Address: 10.133.187.30:9196:228.9.146.63:7623:c1:in2
    2: MemberId: in3, MemberType: CORE, Address: 10.133.187.30:9120:228.9.146.63:7623:c1:in3
    3: MemberId: server, MemberType: SPECTATOR, Address: 10.133.187.30:9187:228.9.146.63:7623:c1:server
    #]

#following is request from instance "in2" to update distributed state cache when it could not find TX_LOG_DIR for instance in1.

[#|2011-06-22T07:11:27.846-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;
_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=handleDSCMessage;|DSCMessageReceived from :in2, Operation :ADDALLLOCAL|#]

[#|2011-06-22T07:11:27.846-0700|FINE|glassfish3.1|ShoalLogger.dsc|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.MessageWindow;MethodName=handleDSCMessage;|Syncing local cache with group ...|#]

[#|2011-06-22T07:11:27.847-0700|FINER|glassfish3.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl;MethodName=sendMessage;|sending message to member: server|#]

  1. the server address below is server tcp port 9110 that no longer exists. It was from the prior test run.
  2. it is a bug that this reference remained around. However, it is a different bug than this current distributed state cache
  3. update problem. If this message broadcast succeeded, this bug would have been fixed.
  4. workaround is simply to restart das after delete cluster.
    [#|2011-06-22T07:11:27.847-0700|FINE|glassfish3.1|ShoalLogger|_ThreadID=78;_ThreadName=Thread-2;ClassName=com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl;MethodName=sendMessage;|sendMessage(synchronous=true, to=group) failed to send msg com.sun.enterprise.ee.cms.impl.common.DSCMessage@d40911 to member 10.133.187.30:9110:228.9.27.148:3038:c1:server|#]
Comment by Joe Fialli [ 28/Jun/11 ]

fix for latest intermittent failure for this test. this patch ensures that
cached reference to distributed state cache does not survive in DAS when
the following sequence of asadmin commands takes place without restarting DAS
between the delete and the create.

asadmin delete-cluster clusterX
asadmin create-cluster clusterX

Fix committed to shoal-gms workspace trunk (for glassfish 3.1.1) in svn revision 1661
to shoal-gms workpsace branch gms-transport-module (for glassfish trunk) in svn revision 1662

New shoal-gms-impl.jar from shoal trunk integrated in gf 3.1.1 in glassfish svn revision 47807 on June 30th.
Integration in glassfish trunk is still pending.

Comment by Bobby Bissett [ 11/Jul/11 ]

Integrated into GF trunk as well, revision 47962.





[GLASSFISH-16421] Factor Shoal GMS grizzly transport dependent classes into shoal-gms-grizzly-1_9.jar and shoal-gms-grizzly-2_0.jar. Created: 21/Apr/11  Updated: 20/Oct/11  Resolved: 20/Oct/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: 3.1.2_b02

Type: Improvement Priority: Major
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Completes transition from grizzly 1.9 to grizzly 2.0 for shoal gms impl jar. When completed shoal-gms-grizzly-2_0.jar would be integrated into GlassFish 3.2 branch. Currently shoal gms grizzly transport support for both 1.9 and 2.0 are in shoal-gms-impl.jar integrated with glassfish 3.2 workspace.



 Comments   
Comment by Joe Fialli [ 20/Oct/11 ]

fixed in 3.1.2 branch and in 4.0 trunk

Comment by Joe Fialli [ 20/Oct/11 ]

fix is checked into 3.1.2 branch and trunk (4.0)





[GLASSFISH-16420] New GMS configuration info on cluster and group-management-service element in domain.xml Created: 21/Apr/11  Updated: 17/Oct/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: future release

Type: New Feature Priority: Major
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3_1_2-exclude

 Description   

New heartbeat failure detection implementation may need alternative configuration parameters. (given different algorithm) (unknown at this point)
SSL configuration for GMS TCP.
GMS Member authentication.
(Impacts GMSAdapterImpl config processing, asadmin create-cluster subcommand parameters)



 Comments   
Comment by Bobby Bissett [ 07/Dec/11 ]

Moving to Joe since I'm no longer on project.





[GLASSFISH-16419] Virtual multicast optimization to send messages concurrently. Created: 21/Apr/11  Updated: 17/Oct/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: future release

Type: New Feature Priority: Major
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3_1_2-exclude

 Description   

Only reason to wait for completion of send is to be notified of failed delivery.
With Grizzly 2.0 using async send, should be easy to have a nowait mode for delivery.
Point 2 point messges could be sent synchronous and unicast sends that are
part of a broadcast could be sent without waiting for send to complete.






[GLASSFISH-16418] New Heartbeat Failure Detection implementation optimized for non-multicast and no DAS Created: 21/Apr/11  Updated: 17/Oct/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: future release

Type: New Feature Priority: Major
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Self-configuring cluster case. (Note: This item's priority should track priority of self-configuring cluster priority.)






[GLASSFISH-16417] asadmin get-health needs to work in self configuring(ad hoc clusters) cluster env Created: 21/Apr/11  Updated: 20/Oct/11  Resolved: 20/Oct/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1.2_b01
Fix Version/s: 3.1.2_b06

Type: New Feature Priority: Major
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3_2prd

 Description   

In GlassFish 3.1, this command only runs against the DAS and there is no DAS in self configuring cluster env.
Other commands that run against the DAS (list-instance, start-cluster) are non-requirements.
So need to evaluate if this command should be implemented for self-configuring cluster.

Health info could be stored in GMS master.
Command needs to be able to locate the master via cluster name.
(investigate means to associate clustername with configuration info such as GMS_DISCOVERY_URI_LIST)



 Comments   
Comment by Bobby Bissett [ 20/Oct/11 ]

Is in b06





[GLASSFISH-16416] heartbeats over UDP unicast Created: 21/Apr/11  Updated: 21/Oct/11

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3_1_2-exclude

 Description   

Administrator shall be able to configure heartbeats to be sent over UDP unicast transport when multicast is disabled.






[GLASSFISH-16415] Administrators shall be able to configure clustered instances to potentially be separated by a firewall. Created: 21/Apr/11  Updated: 17/Oct/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 4.0
Fix Version/s: future release

Type: New Feature Priority: Critical
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3_2prd

 Description   

Support for hybrid cloud (part private and part public cloud).
Default heartbeat failure detection configuration will require adjustment to account for potentially slower network throughput across the firewall.



 Comments   
Comment by shreedhar_ganapathy [ 27/Oct/11 ]

Changed AffectsVersion to 4.0





[GLASSFISH-16414] Administrator shall be able to configure a cluster to not require UDP multicast. Created: 21/Apr/11  Updated: 21/Oct/11  Resolved: 21/Oct/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1.2_b06

Type: New Feature Priority: Critical
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Provide new Cluster properties to asadmin create-cluster subcommand to use group discovery rather than UDP multicast.

New cluster property is DISCOVERY_URI_LIST.
Changes checked in.

% asadmin create-cluster --properties GMS_DISCOVERY_URI_LIST=generate:GMS_LISTENER_PORT=9091 myCluster1
% asadmin create-cluster --properties GMS_DISCOVERY_URI_LIST=generate:GMS_LISTENER_PORT=9092 myCluster2



 Comments   
Comment by Joe Fialli [ 21/Oct/11 ]

Functionality checked into glassfish 3.1.2 and 4.0 workspaces.





[GLASSFISH-16413] Administrators shall be able to configure a GMS group discovery mechanism for a site. Created: 21/Apr/11  Updated: 17/Oct/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 4.0
Fix Version/s: not determined

Type: New Feature Priority: Critical
Reporter: Bobby Bissett Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3_2prd

 Description   

Mechanism is used to enable a GMS cluster when UDP multicast is unavailable between clustered instances.
Provide CLI to install a group discovery service as an OS service at a Well Known Address.
Provide CLI to configure VM template to reference a site-wide group discovery mechanism.
Provide CLI to configure S3-based group discovery.
(See GLASSFISH-3636 for issue that GMS requires UDP multicast.)



 Comments   
Comment by shreedhar_ganapathy [ 27/Oct/11 ]

Changed AffectsVersion to 4.0

Comment by Bobby Bissett [ 07/Dec/11 ]

Moving to Joe since I'm no longer on project.





[GLASSFISH-16276] get-health reports incorrect status using jdk7 b136 after cluster is stopped Created: 28/Mar/11  Updated: 14/Mar/12  Resolved: 16/Apr/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b43
Fix Version/s: 3.1.1_b01, 4.0_b37

Type: Bug Priority: Major
Reporter: zorro Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

10-machine cluster setup on linux 64bit



 Description   

used jdk7_b136 on b43 and get-health reported incorrect status after a successful cluster shutdown.
However, after stopping the cluster I get:
n1c1m1 failed since Mon Mar 28 02:32:24 UTC 2011
n1c1m2 failed since Mon Mar 28 02:32:24 UTC 2011
n1c1m3 stopped since Mon Mar 28 02:32:16 UTC 2011
n1c1m4 failed since Mon Mar 28 02:32:24 UTC 2011
n1c1m5 failed since Mon Mar 28 02:32:26 UTC 2011
n1c1m6 failed since Mon Mar 28 02:32:24 UTC 2011
n1c1m7 failed since Mon Mar 28 02:32:24 UTC 2011
n1c1m8 failed since Mon Mar 28 02:32:24 UTC 2011
n1c1m9 failed since Mon Mar 28 02:32:24 UTC 2011
Command get-health executed successfully.

http://aras2.us.oracle.com:8080/logs/gf31/gms//set_03_27_11_t_19_33_25/scenario_installonecluster_Sun_Mar_27_19_33_48_PDT_2011.html



 Comments   
Comment by Joe Fialli [ 28/Mar/11 ]

The instances are being stopped by stop-cluster. All instances are stopped in their server.log

>[#|2011-03-28T02:32:16.172+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=18;_ThreadName=Thread-1;|
> GMS1015: Received Group Shutting down message from member: server of group: clusterz1|#]
>
>[#|2011-03-28T02:32:16.315+0000|INFO|glassfish3.1|
>javax.enterprise.system.tools.admin.com.sun.enterprise.v3.admin.cluster|_ThreadID=115;_ThreadName=Thread-1;|
>Server shutdown initiated|#]
>
>[#|2011-03-28T02:32:16.322+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=115;_ThreadName=Thread-1;|
>GMS1096: member: n1c1m1 is leaving group: clusterz1|#]

The UDP notification message from the stopped instance to the DAS is being lost for 8 out of the 9
instances. The planned shutdown notification for n1c1m3 is received in DAS.

Could the test be rerun with following log-level set to FINEST to help track what is going wrong.

% asadmin set-log-level --target clusterz1 ShoalLogger=FINEST // enable FINEST ShoalLogger in clustered instances
% asadmin set-log-level ShoalLogger=FINEST // enable ShoalLogger in DAS server.log

Comment by Joe Fialli [ 30/Mar/11 ]

recreated failure and verified fix.

fix integrated into glassfish 3 trunk. should be in next build.

Comment by Joe Fialli [ 16/Apr/11 ]

fix integrated into 3.1.1 branch.





[GLASSFISH-16173] JDK1.7: GF3.1 was not able to see member ship views in a cluster Created: 08/Mar/11  Updated: 15/Mar/11  Resolved: 15/Mar/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b43
Fix Version/s: future release

Type: Bug Priority: Major
Reporter: mzh777 Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OEL5.4
JDK1.7.0-ea-b130


Attachments: Java Archive File shoal-gms-api.jar     Java Archive File shoal-gms-impl.jar     Zip Archive testTXCPDriver.zip    

 Description   

Using JDK1.7.0 + GF3.1 B43, create 4 instances in a cluster (st-cluster) on 1 machine. The EJB fail-over test failed due to no membership view among cluster instances. Used asadmin validate-multicast on 2 windows of the machine and it worked.

The st-domain logs and instance logs are attached.



 Comments   
Comment by Joe Fialli [ 09/Mar/11 ]

The attached log files are insufficient. Only the server.log files are provided. Require server*.log files.
Recommend using following commands to collect log files with DAS running.

// unsure if command creates a directory or if you need to create st-cluster-logs
% asadmin collect-log-files --target st-cluster --retrieve st-cluster-logs
% asadmin collect-log-files --retrieve st-cluster-logs

The generated zip files in st-cluster-logs directory should have all the server.logs
(including the server logs).

Also, if it is possible to rerun test, running with following logging enabled would be helpful.
It will provide Shoal configuration information in the server logs.

% asadmin set-log-levels --target st-cluster ShoalLogger=CONFIG
% asadmin set-log-levels ShoalLogger=CONFIG

Will attempt to login to the machine to inspect "ifconfig -a" results.
But any GMS failure like this should always have "ifconfig -a"
info to assist in analyzing network interface configurations.

Lastly, was there a chance there was run over wireless network interface OR
with VPN enabled. If so, the remedy is to create the cluster with following command
(when running only on one machine). Consider if there is any non-typical network
interface situation. Lastly, does this configuration work with JDK1.6 but not JDK1.7
on same configuation. It was not explicitly stated in bug report if changing to JDK 1.6
causes the test to run fine.

$ asadmin create-cluster --bindaddress 127.0.0.1 st-cluster

The above command ensures that all instances use an IP address that is sure to work
independent of VPN being on.

Comment by Joe Fialli [ 09/Mar/11 ]

There is a subtle difference between "asadmin validate-multicast" and how shoal gms register for
multicast in Glassfish 3.1. The shoal gms trunk includes a fix that Tom Mueller needed to work
on an IPV6-only system. asadmin validate-multicast worked on this system and shoal gms in glassfish 3.1
did not. I will attach shoal-gms-impl.jar and shoal-gms-api.jar from shoal-gms trunk to see if that
addresses the problem for you. That shoal gms code uses precisely the same mechanism that asadmin
validate-multicast uses to configure a multicast socket.

Comment by Joe Fialli [ 09/Mar/11 ]

Shoal GMS jars from Shoal GMS trunk. Have a multicast fix in them.

Comment by Joe Fialli [ 09/Mar/11 ]

There is a different code path for GMS multicast for JDK 1.7.
This code path was verified to work once at the very beginning of
Shoal GMS over Grizzly. No other testing was ever done.

Minimally, I can remove the alternative code path so things work same
under JDK 1.7 as under JDK 1.6.

The alternative code path is verified to be working on my Solaris 10 box
using JDK 1.7 32 bit.

Comment by Joe Fialli [ 09/Mar/11 ]

On Solaris 10 Sparc using 32 bit JDK 1.7 and glassfish 3.1 release, all the instances were able to see each other after I stopped using --bindaddress 127.0.0.1 on create-cluster. Checking the submitted domain.xml, bindaddress was not set on cluster.

So more analysis will be needed to be done on the machine in question since I have not recreated the reported issue except when using the loopback address.

[#|2011-03-09T11:27:55.873-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=17;_ThreadName=Thread-1;|GMS1092: GMS View
Change Received for group: myCluster : Members in view for ADD_EVENT(before change analysis) are :
1: MemberId: instance01, MemberType: CORE, Address: 10.152.20.67:9142:228.9.1.3:2231:myCluster:instance01
2: MemberId: instance02, MemberType: CORE, Address: 10.152.20.67:9156:228.9.1.3:2231:myCluster:instance02
3: MemberId: server, MemberType: SPECTATOR, Address: 10.152.20.67:9131:228.9.1.3:2231:myCluster:server

#]

bash-3.00$ asadmin get-health myCluster
instance01 started since Wed Mar 09 11:27:42 EST 2011
instance02 started since Wed Mar 09 11:27:54 EST 2011
Command get-health executed successfully.

bash-3.00$ java -version
java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b132)
Java HotSpot(TM) Client VM (build 21.0-b03, mixed mode, sharing)

Comment by Joe Fialli [ 09/Mar/11 ]

Fix was to simply disable the conditional code that used NIO multicast available in JDK 1.7.

Confirmed the fix on machine where the problem was reported.

While the code worked on Solaris, it is not working on OEL, so just disabling for time being.

Below is the commit message from shoal gms trunk:

Revisions:
----------
1561

Modified Paths:
---------------
trunk/gms/impl/src/main/java/com/sun/enterprise/mgmt/transport/grizzly/GrizzlyUtil.java

Diffs:
------
Index: trunk/gms/impl/src/main/java/com/sun/enterprise/mgmt/transport/grizzly/GrizzlyUtil.java
===================================================================
— trunk/gms/impl/src/main/java/com/sun/enterprise/mgmt/transport/grizzly/GrizzlyUtil.java (revision 1560)
+++ trunk/gms/impl/src/main/java/com/sun/enterprise/mgmt/transport/grizzly/GrizzlyUtil.java (revision 1561)
@@ -78,7 +78,9 @@
private static Method getNIOMulticastMethod() {
Method method = null;
try

{ - method = DatagramChannel.class.getMethod( "join", InetAddress.class, NetworkInterface.class ); + // WORKAROUND: disable using NIO multicast in JDK 1.7 due to failure on OEL reported as Glassfish issue 16173. + // Will revisit enabling again in the future. + //method = DatagramChannel.class.getMethod( "join", InetAddress.class, NetworkInterface.class ); }

catch( Throwable t )

{ method = null; }

Shoal version 1.5.30 shoal-gms-impl.jar and shoal-gms-api.jar will address this issue with running glassfish 3.1 with jdk 1.7.

Comment by Joe Fialli [ 09/Mar/11 ]

Investigating why the jdk 1.7 specific code is failing on OEL and not Solaris.

Comment by Joe Fialli [ 09/Mar/11 ]

On a OEL multihome machine, I am seeing following failure with JDK 1.7 multicast code.

[#|2011-03-10T02:46:03.982+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-1;|getFirstNetworkInte
rface result: interface namebond0 address:/fe80:0:0:0:21e:68ff:feee:fecf%7|#]

[#|2011-03-10T02:46:03.982+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-1;|MulticastSelectorHa
ndler ctor: first Network Interface:bond0 address: /fe80:0:0:0:21e:68ff:feee:fecf%7|#]

[#|2011-03-10T02:46:03.982+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-1;|MulticastSelectorHa
ndler joinMethod=public abstract java.nio.channels.MembershipKey java.nio.channels.MulticastChannel.join(java.net.In
etAddress,java.net.NetworkInterface) throws java.io.IOException|#]

[#|2011-03-10T02:46:03.992+0000|INFO|glassfish3.1|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=79;_Thread
Name=Thread-1;|GRIZZLY0001: Starting Grizzly Framework 1.9.31 - 3/10/11 2:46 AM|#]

[#|2011-03-10T02:46:04.000+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=81;_ThreadName=Thread-1;|MulticastSelectorHa
ndler.initSelector role=CLIENT_SERVER datagramSocket.isConnected false datagramSocket.localSocketAddressAddress=/0:0
:0:0:0:0:0:0:2231|#]

[#|2011-03-10T02:46:04.000+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=81;_ThreadName=Thread-1;|MulticastSelectorHa
ndler.initSelector calling join anInterface=bond0|#]

[#|2011-03-10T02:46:04.001+0000|WARNING|glassfish3.1|ShoalLogger|_ThreadID=81;_ThreadName=Thread-1;|Exception occure
d when tried to join datagram channel
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:613)
at com.sun.enterprise.mgmt.transport.grizzly.MulticastSelectorHandler.initSelector(MulticastSelectorHandler.
java:163)
at com.sun.enterprise.mgmt.transport.grizzly.MulticastSelectorHandler.preSelect(MulticastSelectorHandler.jav
a:129)
at com.sun.grizzly.SelectorHandlerRunner.doSelect(SelectorHandlerRunner.java:188)
at com.sun.grizzly.SelectorHandlerRunner.run(SelectorHandlerRunner.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.SocketException: Invalid argument
at sun.nio.ch.Net.joinOrDrop6(Native Method)
at sun.nio.ch.Net.join6(Net.java:444)
at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:805)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:841)
... 11 more

#]

It looks like there might be a mixture of ipv4 (multicast address is ipv4) and ipv6 (first address of network interface is a ipv6) the join call is with a multicast address that is ipv4 and the provided network interface bond0
has a first address that is ipv6.

Still investigating...

Comment by Joe Fialli [ 09/Mar/11 ]

The encountered failure is definitely due to mixing ipv4 and ipv6.
By configuring GMS to just use IPV6 by setting multicastaddress to IPV6 address,
I was able to get the cluster working using JDK 7 on OEL with a network interface that
had both ipv4 and ipv6 enabled.

One workaround would be to totally disable ipv6.
Another is to configure GMS to use ipv6 by explicitly setting multicastaddress for the cluster
to an ipv6 address. (I only tried setting multicastaddress to ipv6).

I do not know enough about jdk 7 to know if it prefers ipv6 over ipv4 when both are available.
This is an observation at this point that needs further research.

On the solaris 10 system that I had no problem on, only ipv4 was configured.

On OEL systems that we have observed issues, both ipv4 and ipv6 are configured to be on.
[aroot@gf-ha-dev-10 ~]# ifconfig -a
bond0 Link encap:Ethernet HWaddr 00:1E:68:EE:FE:CF
inet addr:10.133.184.235 Bcast:10.133.191.255 Mask:255.255.248.0
inet6 addr: fe80::21e:68ff:feee:fecf/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:9217811 errors:0 dropped:0 overruns:0 frame:0
TX packets:257417 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1269365649 (1.1 GiB) TX bytes:34826247 (33.2 MiB)

By creating the cluster with an IPV6 multicast address, I was able to get things to work with Glassfish 3.1 and JDK 7.

%asadmin create-cluster --multicastaddress FF02:16:16:0:0:0:0:16 yourClusterName

Here is a sample of it working with GMS View and asadmin get-health.

[hatester@gf-ha-dev-10 hatester]$ asadmin get-health myCluster
instance01 started since Thu Mar 10 04:19:16 UTC 2011
instance02 started since Thu Mar 10 04:19:15 UTC 2011
Command get-health executed successfully.

from DAS server.log
/export/hatester/jdk1.7.0/bin/java
<deleted>
[#|2011-03-10T04:19:21.899+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=75;_ThreadName=Thread-1;|GMS1092: GMS View C
hange Received for group: myCluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: instance01, MemberType: CORE, Address: 10.133.184.235:9144:FF02:16:16:0:0:0:0:16:2231:myCluster:instanc
e01
2: MemberId: instance02, MemberType: CORE, Address: 10.133.184.235:9180:FF02:16:16:0:0:0:0:16:2231:myCluster:instanc
e02
3: MemberId: server, MemberType: SPECTATOR, Address: 10.133.184.235:9113:FF02:16:16:0:0:0:0:16:2231:myCluster:server

#]
Comment by Joe Fialli [ 10/Mar/11 ]

Summary of issue and its workaround:

Glassfish 3.1 specifies an IPv4 multicast address by default and if the multicast address is set by
the user, it would typically be set in an IPV4 format.
On a network interface that has both IPV4 and IPV6 enabled (verified using ifconfig -a)
and using JDK 7 on Oracle Enterprise Linux, gms is using NIO multicast that was introduced
in JDK 7. There is a jdk 7 difficulty of use that is preventing the mixture of IPV4 multicast address
and a DatagramChannel open that is defaulting to IPV6 when it is available.

Available Workarounds: (some workarounds below have ramifications that might not be acceptable for a production environments. Options 2 3 may not be acceptable.)

There are three possible workarounds. (only one needs to be done)

1. Specify cluster multicastaddress using IPV6 address format when IPV6 is available on network interface.
% asadmin create-cluster --multicastaddress FF02:16:16:0:0:0:0:16

2. Disable ipv6 on the network interface. (OS specific on how to achieve this.)

3. Configure Glassfish to preferIPV4 using the following commands:
$GF_HOME/bin/asadmin start-domain $

{DOMAIN}
$GF_HOME/bin/asadmin create-jvm-options -Djava.net.preferIPv4Stack=true
$GF_HOME/bin/asadmin restart-domain ${DOMAIN}

Note: The above should be done before creating the cluster.

Details:

On a network interface that has both IPV4 and IPV6 enabled, the default behavior
with JDK 7 on OEL is to create a IPV6 when one calls DatagramChannel.open() with no
parameter specifying the Protocol Family. Shoal GMS is calling the open() with no
protocol family specified. The javadoc for this method explicitly states if the
protocol family is not provided when calling open, that the channel's socket is
unspecified. The source code implementation confirms the preference for IPV6 when
it is available and DatagramChannel.open() specifying no protocol family is used.
See lines 103-04 in following
Link:http://hg.openjdk.java.net/jdk7/2d/jdk/file/f06f30b29f36/src/share/classes/sun/nio/ch/DatagramChannelImpl.java

Given that Shoal GMS is relying on unspecified protocol family for the
open, this bug may or may not happen on other platforms depending on their defaulting.
When the multicast address is specified in ipv4 and
the default protocol family is relied upon, we are encountering what can be considered
an ease of use bug.

Caused by: java.net.SocketException: Invalid argument
at sun.nio.ch.Net.joinOrDrop6(Native Method)
at sun.nio.ch.Net.join6(Net.java:444)
at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:805)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:841)

If GMS simply computed the address family of the multicast address and called DatagramChannel.open(protocolFamily),
this issue would go away. Link: http://download.java.net/jdk7/docs/api/java/nio/channels/DatagramChannel.html#open%28java.net.ProtocolFamily%29

However, since DatagramChannel.open(ProtocolFamily) and java.net.ProtocolFamily were just added in JDK 7,
it would take additional reflection code lookup and calls (already doing this for jdk 7 method DatagramChannel.join(..).

Shoal GMS code is clearly violating the recommendation to explicitly specify the protocol family
when opening the DatagramChannel. (See fragment of code at http://blogs.sun.com/alanb/entry/multicasting_with_nio)

Recommendation is to compute the protocol family of the provided multicast address and be sure to specify the same protocol family when calling DatagramChannel.open(ProtocolFamily).

Comment by Joe Fialli [ 10/Mar/11 ]

Link to issue reported to JDK 7 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7026376

It may take a few days to be visible through above link.

Comment by Joe Fialli [ 15/Mar/11 ]

With an internal version of jdk 7 b136 with a fix for JDK 7 issue 7026376, confirmed that glassfish 3.1 works on OEL.

No workaround is needed with this fix.

Comment by Joe Fialli [ 15/Mar/11 ]

Fixed with JDK 1.7 b136 when it is released in a few weeks.

Confirmed fix with an internal JDK 1.7 b136 against the glassfish 3.1 release.
Before b136 is available, use one of the recommended workarounds mentioned
in a comment on this issue.





[GLASSFISH-16109] get-health command not showing rejoins Created: 28/Feb/11  Updated: 14/Mar/12  Resolved: 29/Mar/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b43
Fix Version/s: 3.1.1_b01, 4.0_b37

Type: Bug Priority: Minor
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on SHOAL-116 rejoin subevent is null in JoinedAndR... Closed
Tags: 3_1-next

 Description   

When a rejoin event happens (an instance fails and restarts before GF knows it has failed), the status for it in 'asadmin get-health' shows started instead of rejoined. This is happening because the rejoin subevent is null in the JoinedAndReadyNotificationSignal. Am filing this for GF integration and will file a more technical one against Shoal to get the fix in.

This is a minor issue now because instances probably won't restart this quickly. But it would be good to get the fix in. To cause this, here's a snippet of domain.xml that leads to an absurdly long time to kill/restart an instance:

<group-management-service>
<failure-detection
heartbeat-frequency-in-millis="1800000"
max-missed-heartbeats="30"></failure-detection>
</group-management-service>



 Comments   
Comment by Bobby Bissett [ 28/Feb/11 ]

Adding dependency.

Comment by Bobby Bissett [ 02/Mar/11 ]

This has been fixed in the shoal workspace and will be in the next integration of shoal into glassfish.

Comment by Bobby Bissett [ 29/Mar/11 ]

Fixed in revision 45770.





[GLASSFISH-16108] validate-multicast tool can give duplicate results Created: 28/Feb/11  Updated: 14/Mar/12  Resolved: 29/Mar/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b43
Fix Version/s: 3.1.1_b01, 4.0_b37

Type: Bug Priority: Minor
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Dependency
depends on SHOAL-115 MultiCastReceiverThread is not cleari... Resolved
Tags: 3_1-next

 Description   

When running the "asadmin validate-multicast" command, there can be duplicate entries shown so that a host shows up more than once. For instance:

Received data from myhost (loopback)
Received data from someotherhost
Received data from myhost

'myhost' appears twice, when it should only be there once. This is a minor issue, and is due to a bug in the gms code that does not clear out the data buffer between receive() calls on the socket. If you run the tool with the --verbose option, you can see the full messages that are being received that cause the duplicate entries.

What's important is that the host name shows up at all, not how many times it shows up. But this could cause confusion with users (it confused me).



 Comments   
Comment by Bobby Bissett [ 28/Feb/11 ]

Adding a dependency to the Shoal issue. Will fix in trunk there and integrate into GF with another Shoal promotion.

Comment by Bobby Bissett [ 03/Mar/11 ]

This is now fixed in the Shoal workspace and will be in the next Shoal integration into GlassFish.

Comment by Bobby Bissett [ 29/Mar/11 ]

Fixed in revision 45770.





[GLASSFISH-16103] GMS fails to start on IPv6 only system Created: 25/Feb/11  Updated: 14/Mar/12  Resolved: 20/Apr/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b43
Fix Version/s: 3.1.1_b01, 4.0_b37

Type: Bug Priority: Major
Reporter: Tom Mueller Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Oracle Enterprise Linux 5, with an IPv6-only stack. The ifconfig output is below. Note that
the eth0 interface is down but it is the first one listed.

  1. ifconfig -a
    eth0 Link encap:Ethernet HWaddr 00:09:3D:10:CF:C8
    BROADCAST MULTICAST MTU:1500 Metric:1
    RX packets:0 errors:0 dropped:0 overruns:0 frame:0
    TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
    collisions:0 txqueuelen:1000
    RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
    Interrupt:185 Memory:fe810000-fe820000

eth1 Link encap:Ethernet HWaddr 00:09:3D:10:CF:C9
inet6 addr: fe80::209:3dff:fe10:cfc9/64 Scope:Link
inet6 addr: fc01::25/96 Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:293292 errors:0 dropped:0 overruns:0 frame:0
TX packets:347212 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:228022244 (217.4 MiB) TX bytes:378699726 (361.1 MiB)
Interrupt:193 Memory:fe830000-fe840000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:6863 errors:0 dropped:0 overruns:0 frame:0
TX packets:6863 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:6932837 (6.6 MiB) TX bytes:6932837 (6.6 MiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)


Attachments: Java Archive File shoal-gms-impl.jar     File shoal.dif    
Tags: 3_1-exclude, 3_1-next

 Description   

To recreate the problem, create a cluster on an IPv6-only system with a single instance on another node. The cluster has to be created with the multicast address set as follows:

asadmin create-cluster --multicastaddress ff02::1 c1

Then restart the domain. The log will contain the following exception message:

[#|2011-02-24T22:43:19.510+0100|SEVERE|glassfish3.2|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=1;_ThreadName=Thread-1;|GMSAD1017: GMS failed to start. See stack trace for additional information.
com.sun.enterprise.ee.cms.core.GMSException: failed to join group c1
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:182)
at com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.join(GroupManagementServiceImpl.java:381)
at org.glassfish.gms.GMSAdapterImpl.initializeGMS(GMSAdapterImpl.java:573)
at org.glassfish.gms.GMSAdapterImpl.initialize(GMSAdapterImpl.java:199)
at org.glassfish.gms.bootstrap.GMSAdapterService.loadModule(GMSAdapterService.java:218)
at org.glassfish.gms.bootstrap.GMSAdapterService.checkCluster(GMSAdapterService.java:192)
at org.glassfish.gms.bootstrap.GMSAdapterService.checkAllClusters(GMSAdapterService.java:180)
at org.glassfish.gms.bootstrap.GMSAdapterService.postConstruct(GMSAdapterService.java:132)
at com.sun.hk2.component.AbstractCreatorImpl.inject(AbstractCreatorImpl.java:131)
at com.sun.hk2.component.ConstructorCreator.initialize(ConstructorCreator.java:91)
at com.sun.hk2.component.AbstractCreatorImpl.get(AbstractCreatorImpl.java:82)
at com.sun.hk2.component.SingletonInhabitant.get(SingletonInhabitant.java:67)
at com.sun.hk2.component.EventPublishingInhabitant.get(EventPublishingInhabitant.java:139)
at com.sun.hk2.component.AbstractInhabitantImpl.get(AbstractInhabitantImpl.java:76)
at com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:243)
at com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:135)
at com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
at com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)
at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)
Caused by: com.sun.enterprise.ee.cms.core.GMSException: initialization failure
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:142)
at com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.initializeGroupCommunicationProvider(GroupCommunicationProviderImpl.java:164)
at com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.join(GMSContextImpl.java:176)
... 23 more
Caused by: java.net.SocketException: Cannot assign requested address
at java.net.PlainDatagramSocketImpl.socketSetOption(Native Method)
at java.net.PlainDatagramSocketImpl.setOption(PlainDatagramSocketImpl.java:299)
at java.net.MulticastSocket.setNetworkInterface(MulticastSocket.java:506)
at com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender.start(BlockingIOMulticastSender.java:165)
at com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.start(GrizzlyNetworkManager.java:434)
at com.sun.enterprise.mgmt.ClusterManager.<init>(ClusterManager.java:140)
... 25 more

#]

The exception message persists even if the bind address is set to variables different values: the IPv6 site scope address, the IPv6 link scope address.

A shoal-gms-impl.jar file that fixes the problem, with the diffs is attached. Just put this JAR file into the glassfish/modules directory and GMS works fine on an IPv6-only system.

It is still unknown as to whether the existence of the eth0 that is disabled or the sit0 interfaces have anything to do with this problem, or whether there is some other IPv6 configuration change that could be made to workaround this problem. The problem does not show up on a dual stack system that uses both IPv4 and IPv6.

There are several Java problems related to multicast sockets and IPv6 reported in the past. The one that seems most relevant to this issue is:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6262075

The enclosed patch works around the issue by calling MulticastSocket.setInterface() instead of
setNetworkInteface and we are explicitly checking if network interface is considered UP before using it.



 Comments   
Comment by Joe Fialli [ 25/Feb/11 ]

Checked fix into shoal trunk.

Still needs to be integrated into glassfish 3.1.

Comment by Bobby Bissett [ 20/Apr/11 ]

This was integrated into GF in revision 45770 (before 3.1.1 branch created).





[GLASSFISH-16073] GMS1077 gms total message size is too big (default max is 4MB) is repeated in server log Created: 22/Feb/11  Updated: 14/Mar/12  Resolved: 20/Apr/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b43
Fix Version/s: 3.1.1_b01, 4.0_b37

Type: Bug Priority: Minor
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: 3-1_exclude, 3_1-next

 Description   

An attempt to send a message larger than GMS max message size results in repeated reporting of the WARNING
in server.log.

Here is sample message:
[#|2011-02-22T17:06:00.412-0500|WARNING|glassfish3.1|ShoalLogger|_ThreadID=24;_ThreadName=Thread-1;|GMS1077: total message size is too big: size = 12,584,300, max size = 4,196,352|#]

[#|2011-02-22T17:06:00.548-0500|WARNING|glassfish3.1|ShoalLogger|_ThreadID=32;_ThreadName=Thread-1;|GMS1077: total message size is too big: size = 12,584,366, max size = 4,196,352|#]

************************************

WORKAROUND:

A workaround exists if the default max gms message size is too small.
To create a cluster with a larger max message size, use the following command:

% asadmin create-cluster --properties "GMS_MAX_MESSAGE_LENGTH=8200000" theClusterName

****

To increase the max message size for an already created cluster, use the following command with cluster running.

$ asadmin set clusters.cluster.myCluster.property.GMS_MAX_MESSAGE_LENGTH=8200000
clusters.cluster.myCluster.property.GMS_MAX_MESSAGE_LENGTH=81200000
Command set executed successfully.

After setting the value, stop the cluster and DAS domain, then restart the DAS and the cluster with the new settings.



 Comments   
Comment by Joe Fialli [ 22/Feb/11 ]

Fix is already known for this issue and is checked into shoal gms trunk.

Comment by Bobby Bissett [ 20/Apr/11 ]

This was integrated into GF in rev 45770 (before 3.1.1 branch was created).





[GLASSFISH-15717] "Very Intermittent: Drop of Planned Shutdown notification of DAS (a spectator) to one of the clustered instances". Created: 27/Jan/11  Updated: 22/Oct/11  Resolved: 21/Oct/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b38
Fix Version/s: 3.1.2_b06

Type: Bug Priority: Minor
Reporter: zorro Assignee: Joe Fialli
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux



 Description   

This is a very intermittent drop of das planned shutdown notification seen in scenarios 10 and 11.
http://aras2.us.oracle.com:8080/logs/gf31/gms/set_01_21_11_t_08_03_25/final_Fri_Jan_21_14_44_57_PST_2011.html

http://aras2.us.oracle.com:8080/logs/gf31/gms//set_01_11_11_t_13_45_23/scenario_0010_Tue_Jan_11_23_55_27_PST_2011.html

The failed constraint was the Planned Shutdown for the DAS was not received by one of the clustered instances.
(Scenario 10 explicitly stops DAS in middle of scenario to verify GroupLeadership change.)
This failure only happened in one out of 32 runs and for only one instance in the cluster. So it is definitely quite intermittent.

There is a strong possibility that this was a dropped UDP message. While I have fixed dropped UDP broadcast messages
in this release, this is unfortunately a boundary case that I can not address with current design, the rebroadcast of the missed event
can not take place since the last event the DAS broadcast was it shutdown. So when the clustered instance noticed
it missed an event, the instance it would request to rebroadcast the missed event no longer exist so it can not rebroadcast
the dropped UDP packet. So this would be nontrivial
to fix and not advised to attempt at this late stage of the release.

Luckily, the DAS is not part of replicating data so this missed PlannedShutdown of a SPECTATOR member would not impact HA.
There is no application that I am aware of that is dependent on planned shutdown notification of the SPECTATOR das. Everything else is okay in the logs.
The instance was notified of a new GroupLeader to replace the Shutdown DAS and the list of current alive and ready members is correct.
(reflects the DAS "server" is no longer part of cluster)

Extracted from http://aras2.us.oracle.com:8080/logs/gf31/gms///set_01_11_11_t_13_45_23/scenario_0010_Tue_Jan_11_23_55_27_PST_2011/easqezorro8_n1c1m7.log

[#|2011-01-12T07:56:38.260+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1093: adding GroupLeadershipNotification signal leadermember: n1c1m1 of group: clusterz1|#]

[#|2011-01-12T07:56:38.260+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: clusterz1 : Members in view for MASTER_CHANGE_EVENT(before change analysis) are :
1: MemberId: n1c1m1, MemberType: CORE, Address: 10.133.184.208:9132:228.9.53.86:31524:clusterz1:n1c1m1
2: MemberId: n1c1m2, MemberType: CORE, Address: 10.133.184.209:9154:228.9.53.86:31524:clusterz1:n1c1m2
3: MemberId: n1c1m3, MemberType: CORE, Address: 10.133.184.211:9140:228.9.53.86:31524:clusterz1:n1c1m3
4: MemberId: n1c1m4, MemberType: CORE, Address: 10.133.184.213:9196:228.9.53.86:31524:clusterz1:n1c1m4
5: MemberId: n1c1m5, MemberType: CORE, Address: 10.133.184.214:9147:228.9.53.86:31524:clusterz1:n1c1m5
6: MemberId: n1c1m6, MemberType: CORE, Address: 10.133.184.137:9195:228.9.53.86:31524:clusterz1:n1c1m6
7: MemberId: n1c1m7, MemberType: CORE, Address: 10.133.184.138:9121:228.9.53.86:31524:clusterz1:n1c1m7
8: MemberId: n1c1m8, MemberType: CORE, Address: 10.133.184.139:9194:228.9.53.86:31524:clusterz1:n1c1m8
9: MemberId: n1c1m9, MemberType: CORE, Address: 10.133.184.140:9191:228.9.53.86:31524:clusterz1:n1c1m9

#]


 Comments   
Comment by Joe Fialli [ 27/Jan/11 ]

I confirmed that there were UDP drops on the machine that has the missing PlannedShutDown notification.

% netstat -su

Udp:
19870588 packets received
97130 packets to unknown port received.
1 packet receive errors
506777 packets sent

I checked another machine and it had two UDP receive errors.
I did verify that the /etc/sysctl.conf had appropriate settings for
receive buffer. (So the OEL OS are configured as we have requested in
past.)

This failure can only happen in either Shoal GMS QE Scenario 10 or 11
and it has only ever happened on machine running n1c1m7 (easqezorro8).

The recreation rate at the time this issue was submitted was twice in 104 runs.

It is quite possible to tune away the UDP drops by increasing the UDP receive buffer and write buffer sizes
from current size to a little bigger. If increasing these values makes the failure go away and we do not observe
udp packet receive errors in "netstat -su", then we would have confirmed the hypothesis that this drop is
due to UDP drop. As I mentioned in my previous attached email, there is a boundary condition in current design
that does not allow for rebroadcast of a a dropped planned shutdown since the rebroadcast logic is solely
in the master which has shutdown in this case.

The following document describes how to check and set udp buffer sizes for various OS.
http://www.29west.com/docs/THPM/udp-buffer-sizing.html

An unconfirmed workaround for this issue is to tune the systems current udp buffer sizing by
increasing its value. It would be helpful if we could validate with exiting GMS QE scenario 10 and
11 testing if this workaround does address the failure that has been reported.

Given that the current udp read/write buffer size is 512 * 1024, we could increase it to 756 * 1024 to see if that
causes the issue to go away on easqezorro8 machine.

Comment by Joe Fialli [ 28/Jan/11 ]

Minimally will investigate if proposed workaround mitigates this very intermittent issue.

Comment by Joe Fialli [ 21/Oct/11 ]

This failure has not been reported in recent glassfish gms qe test runs so closing with a cannot reproduce for time being.

Comment by zorro [ 22/Oct/11 ]

Confirming that this issue is not being seen in b4 and b5 of version 3.1.2





[GLASSFISH-15445] [Blocking] Incomplete/split GMS view on instances. Created: 05/Jan/11  Updated: 12/Jan/11  Resolved: 05/Jan/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: sonymanuel Assignee: Joe Fialli
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive gms-view-ha-dev-setup.zip     Zip Archive gms-view-qe-setup.zip    
Issue Links:
Dependency
blocks GLASSFISH-15446 ReplicaChoice does not have all avail... Resolved

 Description   

With the latest GF nightly build the GMS view are not consistent across instances in the cluster. I have a 2 machines 6 instance cluster.

DAS + instance101, 103 & 105 on bigapp-oblade-10
instance102 , 104 & 106 on bigapp-oblade-9

The instances have a split GMS view. Instances on bigapp-oblade-10 only see that view :
[#|2011-01-05T11:57:33.537-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=13;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: instance101, MemberType: CORE, Address: 10.5.216.156:9182:228.9.45.208:7694:st-cluster:instance101
2: MemberId: instance103, MemberType: CORE, Address: 10.5.216.156:9193:228.9.45.208:7694:st-cluster:instance103
3: MemberId: instance105, MemberType: CORE, Address: 10.5.216.156:9175:228.9.45.208:7694:st-cluster:instance105
4: MemberId: server, MemberType: SPECTATOR, Address: 10.5.216.156:9105:228.9.45.208:7694:st-cluster:server

#]

Instance on bigapp-oblade-9 see only the instances on that machines :
[#|2011-01-05T11:57:38.396-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: instance102, MemberType: CORE, Address: 10.133.184.205:9119:228.9.45.208:7694:st-cluster:instance102
2: MemberId: instance104, MemberType: CORE, Address: 10.133.184.205:9157:228.9.45.208:7694:st-cluster:instance104
3: MemberId: instance106, MemberType: CORE, Address: 10.133.184.205:9098:228.9.45.208:7694:st-cluster:instance106

#]

The machines are on the same subnet. I did a validate-multicluster to check. Here is the output.

bigapp-oblade-10:~ # hostname
bigapp-oblade-10
bigapp-oblade-10:~ # /space/gf-ha/glassfish3/bin/asadmin validate-multicast
Will use port 2,048
Will use address 228.9.3.1
Will use bind interface null
Will use wait period 2,000 (in milliseconds)

Listening for data...
Sending message with content "bigapp-oblade-10" every 2,000 milliseconds
Received data from bigapp-oblade-10 (loopback)
Received data from bigapp-oblade-9
Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
Command validate-multicast executed successfully.
bigapp-oblade-10:~ #

bigapp-oblade-9:~ # hostname
bigapp-oblade-9
bigapp-oblade-9:~ # /space/gf-ha/glassfish3/bin/asadmin validate-multicast
Will use port 2,048
Will use address 228.9.3.1
Will use bind interface null
Will use wait period 2,000 (in milliseconds)

Listening for data...
Sending message with content "bigapp-oblade-9" every 2,000 milliseconds
Received data from bigapp-oblade-9 (loopback)
Received data from bigapp-oblade-10
Received data from bigapp-oblade-9
Exiting after 20 seconds. To change this timeout, use the --timeout command line option.
Command validate-multicast executed successfully.
bigapp-oblade-9:~ #

A GMS View issue was also seen on Mahesh's setup (gf-ha-dev-sb6-14 - 18). Here none of the instances have DAS in their GMS view. Mahesh and I checked the multicast and its works.

Attached the logs from both setup.



 Comments   
Comment by Joe Fialli [ 05/Jan/11 ]

From the submitted GMS View information, there appeared to be an inconsistent network interface configuration
for the two multihomed machines being used. The "ifconfig -a" output from the 2 machines below confirms this.
It appears to be the same problem that was occurring on Mahesh's gf-ha-dev-sb6-14 and 15 machines.

Note: no definition of a network interface for 10.5.216 on this bigapp-oblade-9.
Thus, bigapp-oblade-10 can not be selecting ip address for that subnet.
bigapp-oblade-9 /home/jf39279> ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:14:4F:E1:8E:7C
inet addr:10.133.184.205 Bcast:10.133.191.255 Mask:255.255.248.0
inet6 addr: fe80::214:4fff:fee1:8e7c/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:39495713 errors:0 dropped:0 overruns:0 frame:0
TX packets:2340661 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:6146412550 (5861.6 Mb) TX bytes:537299629 (512.4 Mb)
Base address:0xdc00 Memory:bbde0000-bbe00000

eth1 Link encap:Ethernet HWaddr 00:14:4F:E1:8E:7D
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Base address:0xd800 Memory:bbd80000-bbda0000

eth2 Link encap:Ethernet HWaddr 00:14:4F:9E:A0:4A
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:15 Base address:0x2000

eth3 Link encap:Ethernet HWaddr 00:14:4F:9E:A0:4B
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:7 Base address:0x8000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:153172 errors:0 dropped:0 overruns:0 frame:0
TX packets:153172 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:31005158 (29.5 Mb) TX bytes:31005158 (29.5 Mb)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

bigapp-oblade-10 /home/jf39279> ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:14:4F:80:0D:DA
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:233 Base address:0x2000

eth1 Link encap:Ethernet HWaddr 00:14:4F:80:0D:DB
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:58 Base address:0x6000

eth2 Link encap:Ethernet HWaddr 00:14:4F:3C:AB:DA
inet addr:10.133.184.206 Bcast:10.133.191.255 Mask:255.255.248.0
inet6 addr: fe80::214:4fff:fe3c:abda/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:48333578 errors:0 dropped:0 overruns:0 frame:0
TX packets:4976490 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:7242454147 (6906.9 Mb) TX bytes:1951041059 (1860.6 Mb)
Base address:0xdc00 Memory:bbde0000-bbe00000

eth3 Link encap:Ethernet HWaddr 00:14:4F:3C:AB:DB
inet addr:10.5.216.156 Bcast:10.5.216.255 Mask:255.255.255.0
inet6 addr: fe80::214:4fff:fe3c:abdb/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:764119 errors:0 dropped:0 overruns:0 frame:0
TX packets:249398 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:196907658 (187.7 Mb) TX bytes:288379041 (275.0 Mb)
Base address:0xd800 Memory:bbd80000-bbda0000

Quick fix would be to disable eth3 (ip address: 10.5.216.156) on bigapp-oblade-10.

To fix via GMS configuration,
Setting GMS-BIND-INTERFACE-ADDRESS-st-cluster appropriately for all instances to
the respective ip addresses that share a common subnet will correct this issue.
For this case, the shared subnet is 10.133.184.

So for this scenario,
add
<system-property name="GMS-BIND-INTERFACE-ADDRESS-st-cluster" value="10.133.184.206"/>
to <server> element for instance101, instance103 and instance105.
Additionally, would need to add that system property as a child element to
<config name="server"/> to set the DAS bind-interface-address.

[Analysis from Mahesh's machine issues.
On Mahesh's machines, I investigated with him and the ifconfig on sb6-14 defined two network interfaces
and the sb6-15..sb6-18 machines only defined one network interface. The default network interface
on sb6-14 (the first ip address returned by Java Network getInterfaces) was not on same subnet as
other machines only ip address that was defined.]

For this issue, for some unknown reason, some of the instances are selecting network interfaces on
one subnet and the others are selecting the network interface on another subnet.

The GMS View submitted with this issue shows this clearly.
bigapp-oblade-10 is using ip addres 10.5.216.156.
bigapp-oblade-9 is using ip address 10.133.184.205.

> The instances have a split GMS view. Instances on bigapp-oblade-10 only see that view :
> [#|2011-01-05T11:57:33.537-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=13;_ThreadName=Thread-1;|GMS1092: GMS View > Change Received for group: st-cluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
> 1: MemberId: instance101, MemberType: CORE, Address: 10.5.216.156:9182:228.9.45.208:7694:st-cluster:instance101
> 2: MemberId: instance103, MemberType: CORE, Address: 10.5.216.156:9193:228.9.45.208:7694:st-cluster:instance103
> 3: MemberId: instance105, MemberType: CORE, Address: 10.5.216.156:9175:228.9.45.208:7694:st-cluster:instance105
> 4: MemberId: server, MemberType: SPECTATOR, Address: 10.5.216.156:9105:228.9.45.208:7694:st-cluster:server
> |#]

> Instance on bigapp-oblade-9 see only the instances on that machines :
> [#|2011-01-05T11:57:38.396-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1092: GMS View > Change Received for group: st-cluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
> 1: MemberId: instance102, MemberType: CORE, Address: 10.133.184.205:9119:228.9.45.208:7694:st-cluster:instance102
> 2: MemberId: instance104, MemberType: CORE, Address: 10.133.184.205:9157:228.9.45.208:7694:st-cluster:instance104
> 3: MemberId: instance106, MemberType: CORE, Address: 10.133.184.205:9098:228.9.45.208:7694:st-cluster:instance106
> |#]

These are definitely not on same subnet.
The "asadmin validate-mulicast" test must be run with explicitly setting above
network interfaces using the parameter --bindinterface. (to replicate what is
occuring when glassfish clustered instances are being started.)

So on bigapp-oblade-10, run the following command=
% asadmin validate-mulitcast --bindinterface 10.5.216.156

Then on bigapp-oblade-9, run the following command:
% asadmin validate-multicast --bindinterface 10.133.184.205

The issue is probably that there are multiple network interfaces on one of the machines
and by default, the network interface on the wrong subnet is getting selected. This is
probably due to the machines network interfaces not being consistently defined on bigapp-oblade-9
and bigapp-oblade-10, (like on Mahesh's gf-ha-dev-sb6-14 machine differeing from gf-ha-dev-sb6-15).
The output from "ifconfig -a" on both machines will shed some light on this.
One could explicitly set system-property "GMS-BIND-INTERFACE-ADDRESS-st-cluster"
in each <server> element of the cluster and in <config name="server"> for DAS
to the IP address of network interfaces on same subnet.

By default, the "asadmin validate-multicast" uses all network interfaces on a machine.
Thus, when asadmin --validate-multicast was run w/o specifying the bind-interface addresses
being used by app server clustered instances, it appears to work, but it is not working under
same conditions.

Comment by sonymanuel [ 05/Jan/11 ]

Thanks Joe for the analysis. Disabled the additional network interface on both the machines. GMS view is now consistent.

[#|2011-01-05T17:59:18.889-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=12;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: instance101, MemberType: CORE, Address: 10.133.184.206:9152:228.9.26.79:30837:st-cluster:instance101
2: MemberId: instance102, MemberType: CORE, Address: 10.133.184.205:9091:228.9.26.79:30837:st-cluster:instance102
3: MemberId: instance103, MemberType: CORE, Address: 10.133.184.206:9128:228.9.26.79:30837:st-cluster:instance103
4: MemberId: instance104, MemberType: CORE, Address: 10.133.184.205:9157:228.9.26.79:30837:st-cluster:instance104
5: MemberId: instance105, MemberType: CORE, Address: 10.133.184.206:9200:228.9.26.79:30837:st-cluster:instance105
6: MemberId: instance106, MemberType: CORE, Address: 10.133.184.205:9197:228.9.26.79:30837:st-cluster:instance106
7: MemberId: server, MemberType: SPECTATOR, Address: 10.133.184.206:9100:228.9.26.79:30837:st-cluster:server

#]

[#|2011-01-05T17:59:18.890-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=12;_ThreadName=Thread-1;|GMS1016: Analyzing new membership snapshot received as part of event: JOINED_AND_READY_EVENT for member: instance105 of group: st-cluster|#]

Comment by Joe Fialli [ 12/Jan/11 ]

Tips for identifying a split GMS view and how to correct it.

After starting cluster, ensure that all instances in the cluster are visible with
"asadmin get-health <cluster-name>". If all instances are available, one has confirmed
that the default network configuration is sufficient for all members of the cluster and the DAS to see each other. If instances are missing from the listing then one needs to look at the server logs of
the individual machines that did not show up in the list. The instances either did not start due to a failure or they may have started but the default network configuration is inconsistent between the machines running the DAS and clustered instances for the cluster.

In DAS server log, (under glassfish3/glassfish/domains/<domain>/logs/server.log)
find last "GMS View" in the file. This output should list all members that are in
"asadmin get-health <clustername>" command. However, it also list the IP address of the
interface being used.

MemberId: instance101, MemberType: CORE, \
Address: 10.133.184.206:9152:228.9.26.79:30837:st-cluster:instance101

The Address is composed of Interface IP Address:TCP Port:Multicast Group Address: Multicast Port.
10.133.184.206 9152 228.9.26.79 30837

All instances in a cluster must have same Multicast Group Address and Port.
Additionally, all instances in the cluster and DAS must have an interface address with common
subnet for Interface IP Address.

After getting address from DAS, go to a instance log for an intance that belongs to the cluster
but is not showing up.

Either instance failed during startup and stopped.
(1) will show up in its server.log.

Or
(2) the instance will startup and not be able to discover other members of the cluster due to interface ip address not being on same subnet. This instances "GroupLeader" (search for last
GroupLeader in the missing instances server.log) will not be "server", it will either be itself
or another instance on the fragmented GMS dyanmic cluster.

For this issue, running "ifconfig -a" on each machine showed that one machine had 2 network interfaces configured and the other machine involved only had one network interface configured.
The default network interface on the machine with 2 network interfaces was not on same subnet as
the subnet of machine configured with only one network interface.

Fix:
Disable network interface on machine with 2 network interfaces that is not on same subnet as other machine.

OR
Configure $

{GMS-BIND-INTERFACE-ADDRESS-<cluster-name}

to have the common subnet across all machines running cluster. (including DAS) See existing documentation on how to do this correctly.





[GLASSFISH-15428] [STRESS] NoClassDefFoundError from Shoal observed on an instance's log when another instance goes down Created: 04/Jan/11  Updated: 05/Jan/11  Resolved: 05/Jan/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b35
Fix Version/s: None

Type: Bug Priority: Major
Reporter: varunrupela Assignee: Joe Fialli
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive jrockit_windows_gmssuspected_failure.zip     Text File server.log    
Issue Links:
Dependency
blocks GLASSFISH-15425 [STRESS][umbrella] 24x7 RichAccess ru... Open

 Description   

Please see parent issue http://java.net/jira/browse/GLASSFISH-15425 for details of the scenario that shows this bug.

The following log messages were observed in instance 102 when instance103 went down:

**********
[#|2011-01-02T22:23:05.584+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for IN_DOUBT_EVENT(before change analysis) are :
1: MemberId: instance101, MemberType: CORE, Address: 10.12.153.54:9195:228.9.37.37:31818:st-cluster:instance101
2: MemberId: instance102, MemberType: CORE, Address: 10.12.153.52:9147:228.9.37.37:31818:st-cluster:instance102
3: MemberId: instance103, MemberType: CORE, Address: 10.12.153.53:9110:228.9.37.37:31818:st-cluster:instance103
4: MemberId: server, MemberType: SPECTATOR, Address: 10.12.153.54:9103:228.9.37.37:31818:st-cluster:server

#]

[#|2011-01-02T22:23:05.584+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: IN_DOUBT_EVENT for member: instance103 of group: st-cluster|#]

[#|2011-01-02T22:23:05.585+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1007: Received FailureSuspectedEvent for member: instance103 of group: st-cluster|#]

[#|2011-01-02T22:23:05.586+0530|WARNING|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1090: handled exception processing event packet IN_DOUBT_EVENT from instance103|#]

[#|2011-01-02T22:23:05.586+0530|WARNING|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|stack trace
java.lang.NoClassDefFoundError:
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.addInDoubtMemberSignals(ViewWindowImpl.java:399)
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.analyzeViewChange(ViewWindowImpl.java:287)
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.newViewObserved(ViewWindowImpl.java:222)
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.run(ViewWindowImpl.java:193)
at java.lang.Thread.run(Thread.java:662)

#]

***********

The server log showing this message is attached



 Comments   
Comment by Joe Fialli [ 04/Jan/11 ]

Here is the source code line where the NoClassDefFoundError occurred.

signals.add(new FailureSuspectedSignalImpl(token, member.getGroupName(), member.getStartTime()));

So the class that was not found was FailureSuspectedSignalImpl which is defined in shoal-gms-impl.jar.
2975 Fri Dec 03 09:11:38 EST 2010 com/sun/enterprise/ee/cms/impl/common/FailureSuspectedSignalImpl.class

There was no class not found error when reporting Failure 2 seconds later based on following log event from server.log.
[#|2011-01-02T22:23:07.601+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: FAILURE_EVENT for member: instance103 of group: st-cluster|#]

[#|2011-01-02T22:23:07.603+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1019: member: instance103 of group: st-cluster has failed|#]

So we can assume that FailureSignalImpl was found in the same shoal-gms-impl.jar. (unless the ClassNotFoundError is in next server.log file).

dhcp-burlington9-3rd-a-east-10-152-23-224:impl jf39279$ jar tvf target/shoal-gms-impl.jar | grep FailureNotificationSignalImpl
3798 Fri Dec 03 09:11:38 EST 2010 com/sun/enterprise/ee/cms/impl/common/FailureNotificationSignalImpl.class

This does not look like a bug in Shoal GMS but rather a higher level class loading issue.
Potentially, one could see this type of behavior when there was a previous OutOfMemoryException
thrown. Looking through the attached server.log, there was no previous history of OutOfMemory,
but there was evidence of other class loading issues. Specifically, the following log event occurred
5 times before the reported NoClassDefFoundError.

[#|2011-01-02T20:28:35.674+0530|SEVERE|glassfish3.1|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=22;_ThreadName=Thread-3;|Exception in thread "RMI RenewClean-[10.12.153.52:27688]" |#]

[#|2011-01-02T20:28:35.675+0530|SEVERE|glassfish3.1|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=22;_ThreadName=Thread-3;|java.lang.ClassFormatError: sun/reflect/GeneratedSerializationConstructorAccessor9015 : illegal JVM_CONSTANT_Methodref name: <init>
at sun.misc.Unsafe.defineClass(Native Method)
at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:45)
at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:381)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:377)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:95)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:313)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1327)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:437)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:413)
at java.io.ObjectStreamClass.lookup0(ObjectStreamClass.java:310)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java)

At this point, there is not enough info in this report to investigate this further.
It would be helpful to note if there were any previous out of memory exceptions occuring in this instance's server.logs before this occurred. (The other instance is failing presumably due to out of memory, but it is only this instance that is of interest)
Once there is an out of memory in server.log, all future failures are suspect and may be result of system dealing with out of memory issues.

Comment by Joe Fialli [ 04/Jan/11 ]

I researched if this could possibly be a missing osgi export.

Export-Package below is from shoal-gms-impl.jar manifest. Both referencing class ViewWindowImpl and
class not found FailureSuspectedSignalImp are in same jar: shoal-gms-impl.jar. The [1] rule below
represents the dependency of com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl
on com.sun.enterprise.ee.cms.impl.common.FailureSuspectedSignalImpl.

From shoal-gms-impl.jar/META-INF/MANIFEST.MF
(added carriage returns below so each rule could be inspected.)

Export-Package: com.sun.enterprise.ee.cms.impl.common;uses:="com.sun.e
nterprise.ee.cms.core,com.sun.enterprise.ee.cms.spi,com.sun.enterpris
e.ee.cms.impl.client,com.sun.enterprise.ee.cms.logging,com.sun.enterp
rise.ee.cms.impl.base",

com.sun.enterprise.ee.cms.impl.client;uses:="c
om.sun.enterprise.ee.cms.core,com.sun.enterprise.ee.cms.logging",

<!-- begin [1] -->
com.sun.enterprise.ee.cms.impl.base;uses:="com.sun.enterprise.ee.cms.logg
ing,com.sun.enterprise.ee.cms.core,com.sun.enterprise.ee.cms.impl.com
mon,com.sun.enterprise.ee.cms.spi,com.sun.enterprise.mgmt,com.sun.ent
erprise.mgmt.transport,com.sun.enterprise.mgmt.transport.grizzly",
<!-- end [1] -->

com .sun.enterprise.gms.tools,com.sun.enterprise.ee.cms.logging;uses:="su
n.security.action",
com.sun.enterprise.mgmt;uses:="com.sun.enterprise.
ee.cms.impl.base,com.sun.enterprise.ee.cms.logging,com.sun.enterprise
.ee.cms.core,com.sun.enterprise.mgmt.transport,com.sun.enterprise.ee.
cms.impl.common,com.sun.enterprise.ee.cms.impl.client,com.sun.enterpr
ise.ee.cms.spi",
com.sun.enterprise.mgmt.transport;uses:="com.sun.ente
rprise.ee.cms.impl.base,com.sun.enterprise.ee.cms.logging,com.sun.ent
erprise.ee.cms.core,com.sun.enterprise.ee.cms.impl.common,com.sun.ent
erprise.ee.cms.spi",

com.sun.enterprise.mgmt.transport.grizzly;uses:="
com.sun.grizzly,com.sun.grizzly.util,com.sun.grizzly.connectioncache.
server,com.sun.enterprise.mgmt.transport,com.sun.grizzly.connectionca
che.spi.transport,com.sun.enterprise.mgmt,com.sun.enterprise.ee.cms.i
mpl.common,com.sun.enterprise.ee.cms.impl.base,com.sun.grizzly.connec
tioncache.client,com.sun.grizzly.async,com.sun.grizzly.filter"

Additional info is needed to continue any further investigation of this issue from group-management-service
point of view.

Comment by varunrupela [ 05/Jan/11 ]

We have started a re-run of this scenario with the latest build to collect more information on issue http://java.net/jira/browse/GLASSFISH-15426. Will also check if this issue still exists

Comment by Joe Fialli [ 05/Jan/11 ]

If the re-run has just restarted, it is possible to verify the Shoal GMS code path by doing the following.

Select one of the clustered instances to be killed using "kill -9".
Shoal GMS will generate the FailureSuspected notification 6 seconds later.
If this fails as reported at beginning of run, the failure is not stress related,
but would have something to do with running on Windows with Jrocket. (No GMS QE test
have been run with jrocket or on Windows at this time.)

Comment by Joe Fialli [ 05/Jan/11 ]

I have verified running with jrockit-jdk1.6.0_20-R28.1.0-4.0.1 (from public website, do not know how to get the reported jrockit version of 1.6.0_22) on Windows XP 32 bit that the reported exception NoClassDefFoundError does not occur. I have attached complete server logs of the run.

We are also performing a GMS QE run with Jrockit on OEL.

Comment by Joe Fialli [ 05/Jan/11 ]

DAS and instance logs running on windows xp with jrockit 1.6_0_20.
Illustrates that FailureSuspectedSignalImpl NoClassDefFound did not occur in simple
Shoal GMS run. Thus, reported failure is suspected to be stress run impact on class loading.

Comment by Joe Fialli [ 05/Jan/11 ]

Abbreviated server.log showing jrockit version and FailureSuspected being reported.

Jan 5, 2011 9:53:10 AM com.sun.enterprise.admin.launcher.GFLauncherLogger info
INFO: JVM invocation command line:
C:\jrockit-jdk1.6.0_20-R28.1.0-4.0.1\bin\java.exe

<deleted>
[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: myCluster : Members in view for IN_DOUBT_EVENT(before change analysis) are :
1: MemberId: instance01, MemberType: CORE, Address: 129.148.6.186:9121:228.9.1.3:2231:myCluster:instance01
2: MemberId: instance02, MemberType: CORE, Address: 129.148.6.186:9130:228.9.1.3:2231:myCluster:instance02
3: MemberId: server, MemberType: SPECTATOR, Address: 129.148.6.186:9173:228.9.1.3:2231:myCluster:server

#]

[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: IN_DOUBT_EVENT for member: instance01 of group: myCluster|#]

[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1007: Received FailureSuspectedEvent for member: instance01 of group: myCluster|#]

[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1005: Sending FailureSuspectedSignals to registered Actions. member: instance01 ...|#]

[#|2011-01-05T09:54:21.547-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: myCluster : Members in view for FAILURE_EVENT(before change analysis) are :
1: MemberId: instance02, MemberType: CORE, Address: 129.148.6.186:9130:228.9.1.3:2231:myCluster:instance02
2: MemberId: server, MemberType: SPECTATOR, Address: 129.148.6.186:9173:228.9.1.3:2231:myCluster:server

#]

[#|2011-01-05T09:54:21.547-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: FAILURE_EVENT for member: instance01 of group: myCluster|#]

[#|2011-01-05T09:54:21.547-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1019: member: instance01 of group: myCluster has failed|#]

The above sequence of log events included a NoClassDefFound for FailureSuspectedSignalImpl in this report.
So it is definitely working in non-stress case.

Comment by Joe Fialli [ 05/Jan/11 ]

Unable to proceed that this is a group-management-service issue with current information.

Previous comments on this issue document my attempts to recreate the reported failure in a
simplified Shoal GMS test on windows xp platform (32 bit) using jrockit 1.6_0_20. The
simple test demonstrates that the package exports for Shoal GMS are correct and working.
The attached logs illustrate that for this simple run the class FailureSuspectedSignalImpl
was found and that NoClassDefFound did not occur. At this time, there is no evidence that
this is a a GMS issue but it looks more like a class loading issue caused by the stress test.
The issue may have been caused by previous out of memory exceptions or some other bad conditions.
But in simple terms, Shoal GMS SuspectedFailure notification has been verified with jrockit on
both windows and linux. We have successfully run all Shoal GMS QE test on OEL with jrockit 1.6_0_20
at least one run of each test. We have started 5 iterations of each scenario just to make sure.





[GLASSFISH-15417] "Port 2,048" formatting should be "port 2048" in validate-multicast command Created: 03/Jan/11  Updated: 21/Feb/11  Resolved: 21/Feb/11

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: None
Fix Version/s: 3.1_ms08

Type: Bug Priority: Trivial
Reporter: jclingan Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Port number shouldn't have a comma; it is inconsistent formatting when compared to unix CLIs (netstat being a good example). validate-multicast command output shows a port number with a comma:

$ asadmin validate-multicast
Will use port 2,048
Will use address 228.9.3.1
Will use bind interface null
Will use wait period 2,000 (in milliseconds)



 Comments   
Comment by Bobby Bissett [ 04/Jan/11 ]

Once this is integrated, I need to change the admin dev test to test this output more specifically. Right now, the only reason the test passes is because the params in the asadmin command are part of the output from the asadminWithOutput() method. Oops. Am adding this here and am watching the issue so I don't forget.





[GLASSFISH-15347] java.lang.OutOfMemoryError: Java heap space and other failures during Nile Book Store longevity run. Created: 25/Dec/10  Updated: 28/Dec/10  Resolved: 27/Dec/10

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_b33
Fix Version/s: 3.1_b34

Type: Bug Priority: Blocker
Reporter: zorro Assignee: Mahesh Kannan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Solaris Sparc jdk 6u22


Issue Links:
Duplicate
is duplicated by GLASSFISH-15357 Command stop-cluster failed after 600... Resolved
Tags: 3_1-blocking

 Description   

b33 started 7-day longevity runs using NileBookStore bigapp against a 3-node cluster on 4 sparc solaris machines.

Bug:
After a few hours of run the following exceptions were thrown massively.
Note: Intermittently transactions succeed.

[#|2010-12-25T13:58:04.991-0800|SEVERE|oracle-glassfish3.1|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=16;_ThreadName=Thread-1;|java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.<init>(String.java:215)
at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
at java.nio.CharBuffer.toString(CharBuffer.java:1157)
at com.sun.enterprise.web.PEAccessLogValve.log(PEAccessLogValve.java:652)
at com.sun.enterprise.web.PEAccessLogValve.run(PEAccessLogValve.java:1122)
at java.lang.Thread.run(Thread.java:662)

#]

[#|2010-12-24T22:35:15.600-0800|WARNING|oracle-glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=16;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

[#|2010-12-24T22:35:15.900-0800|WARNING|oracle-glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=16;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

[#|2010-12-24T22:35:17.000-0800|WARNING|oracle-glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=16;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

[#|2010-12-24T22:35:18.530-0800|WARNING|oracle-glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=16;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]

org.shoal.ha.cache.command.save|_ThreadID=16;_ThreadName=Thread-1;|Aborting command transmission for ReplicationFramePayloadCommand:1 because beforeTransmit returned false|#]
[#|2010-12-25T13:41:39.169-0800|WARNING|oracle-glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|Error during groupHandle.sendMessage(null, /NileBookStore; size=287193|#]

java.net.SocketException: Invalid argument
at sun.nio.ch.Net.setIntOption0(Native Method)
at sun.nio.ch.Net.setIntOption(Net.java:157)
at sun.nio.ch.SocketChannelImpl$1.setInt(SocketChannelImpl.java:406)
at sun.nio.ch.SocketOptsImpl.setBoolean(SocketOptsImpl.java:38)
at sun.nio.ch.SocketOptsImpl$IP$TCP.noDelay(SocketOptsImpl.java:284)
at sun.nio.ch.OptionAdaptor.setTcpNoDelay(OptionAdaptor.java:48)
at sun.nio.ch.SocketAdaptor.setTcpNoDelay(SocketAdaptor.java:268)
at com.sun.grizzly.http.SelectorThread.setSocketOptions(SelectorThread.java:1490)
at com.sun.grizzly.http.SelectorThreadHandler.configureChannel(SelectorThreadHandler.java:91)
at com.sun.grizzly.http.SelectorThreadHandler.onAcceptInterest(SelectorThreadHandler.java:102)
at com.sun.grizzly.SelectorHandlerRunner.handleSelectedKey(SelectorHandlerRunner.java:300)
at com.sun.grizzly.SelectorHandlerRunner.handleSelectedKeys(SelectorHandlerRunner.java:263)
at com.sun.grizzly.SelectorHandlerRunner.doSelect(SelectorHandlerRunner.java:200)
at com.sun.grizzly.SelectorHandlerRunner.run(SelectorHandlerRunner.java:132)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

#]

All logs:
http://aras2.us.oracle.com:8080/logs/gf31/gms/set_12_25_10_t_14_13_41/scenario_0001_Sat_Dec_25_14_14_09_PST_2010/

physical location:
/net/asqe-logs.us.oracle.com/export1/gms/gf31/gms/set_12_25_10_t_14_13_41/scenario_0001_Sat_Dec_25_14_14_09_PST_2010/



 Comments   
Comment by zorro [ 27/Dec/10 ]

7-day run against nightly build 33 stopped after 2 days with failures stated above.
Stopping cluster failed with:
asadmin stop-cluster clusterz1
No response from Domain Admin Server after 600 seconds.
The command is either taking too long to complete or the server has failed.
Please see the server log files for command status.
Command stop-cluster failed.

all logs:
http://aras2.us.oracle.com:8080/logs/gf31/gms/set_12_27_10_t_12_14_01/scenario_0001_Mon_Dec_27_12_31_14_PST_2010/

physical location.
/net/asqe-logs.us.oracle.com/export1/gms/gf31/gms/set_12_27_10_t_12_14_01/scenario_0001_Mon_Dec_27_12_31_14_PST_2010/

Comment by shreedhar_ganapathy [ 27/Dec/10 ]

Based on feedback from Rajiv and Sony, the issue seems to be the same as the one reported in 15231 which was seen in b33 and fixed in b34.

Also the heap size for Niles app should be -Xmx1024m based on input from Sony from runs in prior releases. The domain xml shows the run was set at 512m.

Please run with b34 and if you see this issue, please reopen it.

Comment by Mahesh Kannan [ 27/Dec/10 ]

Closing this based on Shreedhar's comment





[GLASSFISH-15252] incorrect request to resend a GMS broadcast notification when an instance transitions from being master to not being master Created: 17/Dec/10  Updated: 21/Oct/11  Resolved: 21/Oct/11

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1_ms06
Fix Version/s: 3.1

Type: Bug Priority: Minor
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File unableToFindResend-gf-ha-dev-1_domain.log    
Tags: 3_1-approved

 Description   

Failure impacts GMS QE tests that stop or kill the DAS. These are scenarios 8, 10 and 11.
The issue occurs in all runs. (it is not intermittent)

Attached the complete server log, but the following log events capture what the issue is.
These happen in every run of scenarios 8, 10 and 11.

The following messages occur at the end of the test when the DAS is restarted so "asadmin stop-cluster" can be run.
When "stop-cluster" is run, the DAS takes over GroupLeadership again(otherwise there is
cascading group leadership during the entire shutdown process). The instances that was the master was
incorrectly requesting resends of messages that it had sent out to the group when it was the master.

Here are the log events that capture requesting the resend of the missed events.

[#|2010-12-15T01:24:48.382-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1093: adding GroupLeadershipNotification signal leadermember: server of group: clusterz1|#]

[#|2010-12-15T01:24:48.384-0800|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1057: Announcing Master Node designation for member: server of group: clusterz1. Local view contains 10 entries|#]

[#|2010-12-15T01:24:49.217-0800|INFO|glassfish3.1|ShoalLogger.mcast|_ThreadID=16;_ThreadName=Thread-1;|GMS1112: unable to find message to resend broadcast event with masterViewId: 21 to member: n1c1m1 of group: clusterz1|#]

[#|2010-12-15T01:24:49.218-0800|INFO|glassfish3.1|ShoalLogger.mcast|_ThreadID=16;_ThreadName=Thread-1;|GMS1112: unable to find message to resend broadcast event with masterViewId: 22 to member: n1c1m1 of group: clusterz1|#]

[#|2010-12-15T01:24:49.218-0800|INFO|glassfish3.1|ShoalLogger.mcast|_ThreadID=16;_ThreadName=Thread-1;|GMS1112: unable to find message to resend broadcast event with masterViewId: 23 to member: n1c1m1 of group: clusterz1|#]

[#|2010-12-15T01:24:49.219-0800|INFO|glassfish3.1|ShoalLogger.mcast|_ThreadID=16;_ThreadName=Thread-1;|GMS1112: unable to find message to resend broadcast event with masterViewId: 24 to member: n1c1m1 of group: clusterz1|#]

[#|2010-12-15T01:24:49.231-0800|INFO|glassfish3.1|ShoalLogger.mcast|_ThreadID=16;_ThreadName=Thread-1;|GMS1111: resend broadcast event with masterViewId: 25 to member: n1c1m1 of group: clusterz1 resends=1 broadcast seq id:25 viewChangeEvent:MASTER_CHANGE_EVENT member:server peerId:10.133.184.226:9116:228.9.32.97:5229:clusterz1:server|#]

[#|2010-12-15T01:24:50.394-0800|INFO|glassfish3.1|javax.enterprise.system.tools.admin.com.sun.enterprise.v3.admin.cluster|_ThreadID=16;_ThreadName=Thread-1;|Stopping cluster clusterz1|#]

Fix is already known. It is quite minor.



 Comments   
Comment by Joe Fialli [ 17/Dec/10 ]

How bad is its impact? (Severity

The impact of this issue is unneccessary network traffic
when the DAS was not the Master and stop-cluster is called.
Then there is 3 to 4 log events that an instance requested
a resend of missed MasterChangeEvents that were not really missed.

*******
How often does it happen? Will many users see this problem? (Frequency)

For this issue, it is happening at stop-cluster time.
Based on the fix that is already known, there is a potential for
issues whenever the GroupLeader of the cluster changes due
to current one being stopped or killed.

******

How much effort is required to fix it? (Cost)
Fix is already done. No further cost.

******

What is the risk of fixing it and how will the risk be mitigated? (Risk)
It is riskier to not fix this issue since the fix is so straight forward.
There was an obvious bug in the code (no idea why) but boolean return value was
obviously incorrect.

Comment by Chris Kasso [ 17/Dec/10 ]

Approved for 3.1

Comment by Joe Fialli [ 21/Oct/11 ]

was fixed. overlooked closing this. verified that reported message is no longer being generated in any
Shoal Glassfish QE test runs.





[GLASSFISH-14664] ability to configure GMS member to use SSL for p2p communication Created: 12/Nov/10  Updated: 19/Sep/14

Status: In Progress
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 4.1

Type: Improvement Priority: Critical
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 14,664
Tags: 3_1-exclude, 3_1_1-scrubbed, 3_1_2-exclude

 Description   

See shoal issue 112 for details
http://java.net/jira/browse/SHOAL-112

Since failover system uses GMS messaging to replicate session data
over GMS over Grizzly TCP transport, it is desirable to have a means
to configure that this data is transferred to other members of the
cluster in a secured transport.



 Comments   
Comment by Joe Fialli [ 12/Nov/10 ]

additional information:

replication of session data only takes place between clustered instances on same
subnet behind a firewall.





[GLASSFISH-14663] capability to configure authentication for GMS members Created: 12/Nov/10  Updated: 19/Sep/14

Status: In Progress
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 4.1

Type: Improvement Priority: Critical
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 14,663
Tags: 3_1-exclude, 3_1_2-exclude, 3_2-exclude

 Description   

Each clustered application server instance is a GMS Member. Shoal Group Management Service(GMS) allows for group members to dynamically locate each other via common multicast address and port OR via virtual member list of IP Addresses (when not relying on multicast). The goal of addressing this issue is to authenticate that application servers trying to join as GMS members are allowed to join the cluster.

see details at shoal.dev.java.net issue 111
http://java.net/jira/browse/SHOAL-111






[GLASSFISH-14479] gms-specific task for upgrade test Created: 08/Nov/10  Updated: 30/Nov/10  Resolved: 08/Nov/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: not determined

Type: Task Priority: Critical
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All
URL: https://glassfish.dev.java.net/servlets/ReadMsg?list=dev&msgNo=19095


Issue Links:
Dependency
blocks GLASSFISH-14467 umbrella task for upgrade testing in ... Closed
Issuezilla Id: 14,479
Tags: 3_1-upgrade-task

 Description   

See parent issue 14467 for more details. To close this task, include a
description of what was tested in your area.



 Comments   
Comment by Bobby Bissett [ 08/Nov/10 ]

We've done the scenarios for these and have some GMS testing in the upgrade dev
test. Upgrading a cluster, I always check asadmin get-health as well.





[GLASSFISH-14298] GMS (hidden) command name inconsistencies Created: 29/Oct/10  Updated: 08/Dec/10  Resolved: 04/Nov/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms07

Type: Bug Priority: Minor
Reporter: Chris Kasso Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 14,298

 Description   

Command names, public and hidden, use dashes to separate words in the command
name. For example:

_get-activation-spec-class

The hidden commands _gmsAnnounceAfterStartClusterCommand,
_gmsAnnounceAfterStopClusterCommand, _gmsAnnounceBeforeStartClusterCommand and
_gmsAnnounceBeforeStopClusterCommand use camel case.



 Comments   
Comment by Bobby Bissett [ 04/Nov/10 ]

Fixed in revision 42455





[GLASSFISH-14148] NPE occurs in server log during creating cluster by admin console Created: 21/Oct/10  Updated: 25/Oct/10  Resolved: 25/Oct/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms06

Type: Bug Priority: Major
Reporter: jasonw401 Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Windows XP
Platform: All


Attachments: Zip Archive server_log.zip    
Issuezilla Id: 14,148

 Description   

NullPointerException appears in the server log when creating cluster by admin
console. Reproduce steps:
1. login to admin console and navigate to [Cluster] page
2. Click [New] button and enter Cluster name C1
3. Click [New] button and enter instance name i1
4. Click [OK] button

The new cluster is created but the following exception appears in server log:

[#|2010-10-
22T15:05:06.921+1100|INFO|glassfish3.1|ShoalLogger|_ThreadID=15;_ThreadName=Thre
ad-1;|GMS1099: GMS:Reporting Joined and Ready state to group C1|#]

[#|2010-10-
22T15:05:06.921+1100|SEVERE|glassfish3.1|javax.enterprise.system.core.org.glassf
ish.gms.bootstrap|_ThreadID=15;_ThreadName=Thread-1;|Exception while processing
config bean changes :
java.lang.NullPointerException
at
com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.reportJoinedA
ndReadyState(GroupCommunicationProviderImpl.java:446)
at
com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.reportJoinedAnd
ReadyState(GroupManagementServiceImpl.java:478)
at
com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.reportJoinedAnd
ReadyState(GroupManagementServiceImpl.java:460)
at org.glassfish.gms.bootstrap.GMSAdapterService$1.changed
(GMSAdapterService.java:246)
at org.jvnet.hk2.config.ConfigSupport.sortAndDispatch
(ConfigSupport.java:286)
at org.glassfish.gms.bootstrap.GMSAdapterService.changed
(GMSAdapterService.java:236)
at org.jvnet.hk2.config.Transactions$ConfigListenerJob.process
(Transactions.java:376)
at org.jvnet.hk2.config.Transactions$ConfigListenerJob.process
(Transactions.java:366)
at org.jvnet.hk2.config.Transactions$ConfigListenerNotifier$1$1.call
(Transactions.java:256)
at org.jvnet.hk2.config.Transactions$ConfigListenerNotifier$1$1.call
(Transactions.java:254)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

#]

[#|2010-10-
22T15:05:16.890+1100|INFO|glassfish3.1|null|_ThreadID=15;_ThreadName=Thread-
1;|Using DAS host FASTN9154 and port 4848 from existing das.properties for
node|#]

[#|2010-10-
22T15:05:16.890+1100|INFO|glassfish3.1|null|_ThreadID=15;_ThreadName=Thread-
1;|localhost. To use a different DAS, create a new node using create-node-ssh
or|#]

[#|2010-10-
22T15:05:16.890+1100|INFO|glassfish3.1|null|_ThreadID=15;_ThreadName=Thread-
1;|create-node-config. Create the instance with the new node and correct|#]

[#|2010-10-
22T15:05:16.890+1100|INFO|glassfish3.1|null|_ThreadID=15;_ThreadName=Thread-
1;|host and port:|#]

[#|2010-10-
22T15:05:16.890+1100|INFO|glassfish3.1|null|_ThreadID=15;_ThreadName=Thread-
1;|asadmin --host das_host --port das_port create-local-instance --node
node_name instance_name.|#]

[#|2010-10-
22T15:05:16.906+1100|INFO|glassfish3.1|null|_ThreadID=15;_ThreadName=Thread-
1;|Command _create-instance-filesystem executed successfully.|#]

[#|2010-10-
22T15:05:16.968+1100|INFO|glassfish3.1|javax.enterprise.system.tools.admin.com.s
un.enterprise.v3.admin.cluster|_ThreadID=15;_ThreadName=Thread-1;|Using DAS
host FASTN9154 and port 4848 from existing das.properties for node
localhost. To use a different DAS, create a new node using create-node-ssh or
create-node-config. Create the instance with the new node and correct
host and port:
asadmin --host das_host --port das_port create-local-instance --node node_name
instance_name.
Command _create-instance-filesystem executed successfully.|#]



 Comments   
Comment by Anissa Lam [ 21/Oct/10 ]

Stack trace shows the NPE is from GMS

java.lang.NullPointerException
at
com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.reportJoinedAndReadyState(Group
CommunicationProviderImpl.java:446)

reassign to GMS for initial evaluation.

Comment by Anissa Lam [ 21/Oct/10 ]

oops, i don't mean to change the summary. Let me put it back to original

Comment by Joe Fialli [ 22/Oct/10 ]

Still unconfirmed. But I don't doubt that there are conditions that
can cause this to happen, they are just not all represented in the steps to
recreate the bug.

The steps provided are not recreating this reported failure.
I was not able to use last night's nightly (it did not work for me),
so I attempted to recreate this in a local workspace and tried 3 or 4 times to
recreate the failure, unsuccessfully.

I find what one would expect in the server.log with the successful creation of
cluster and DAS joining that cluster (as first member).
The following log messages would not have all printed out in the DAS if
there was a failure.

[#|2010-10-22T10:29:49.492-0400|INFO|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=16;_ThreadName=Thread-1;|GMSAD1004:
Started GMS for instance server in group C1|#]

[#|2010-10-22T10:29:49.493-0400|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1099:
GMS:Reporting Joined and Ready state to group C1|#]

[#|2010-10-22T10:29:49.494-0400|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1040:
Calling reportMyState() with READY...|#]

[#|2010-10-22T10:29:49.497-0400|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1092:
GMS View Change Received for group C1 : Members in view for
JOINED_AND_READY_EVENT(before change analysis) are :
1: MemberId: server, MemberType: SPECTATOR, Address:
10.152.23.224:9115:228.9.218.140:13953:C1:server

#]

[#|2010-10-22T10:29:49.497-0400|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1016:
Analyzing new membership snapshot received as part of event :
JOINED_AND_READY_EVENT for Member: server of Group: C1|#]

Given that I can not recreate the failure, can you attach a server.log and
domain.xml of the DAS. I am assuming the reported NPE was in the DAS.
I believe that there is something in your config that the DAS is not able to
successfully join the GMS group.

*****

I do have one other question. Is it important for recreating the issue to
create instance "i1"? I did follow all steps, but the code in question that
had the NPE only runs on a DAS.

Comment by jasonw401 [ 23/Oct/10 ]

I have checked this issue again. Unfortunately I cannot reproduce this NPE as
well after restarting the DAS. I don't think creating instance i1 is important
for reproduce this issue, but now I remember something about it. The DAS I was
testing had been running for a couple of days. When I checked the server log, I
found the following messages:

GMS1092: GMS View Change Received for group C1 : Members in view for
IN_DOUBT_EVENT..
...
GMS1019: The following member has failed: i1 of Group: C1...

These messages remind me that a couple days ago I was testing cluster
reliability by killing the java process of a cluster instance. So I think this
could be related to this issue. I have tried to kill the cluster instance and
create a new cluster again, but I still cannot produce the NPE.

I have attached the server log and domain.xml. Please refer to the following
time stamps:

[#|2010-10-21T18:41:38.843+1100 ---> this is the time when I killed the cluster
intance. After that I deleted all the cluster configurations.

[#|2010-10-22T14:43:03.625+1100 ---> this is the time that the NPE occured for
the first time.

Comment by jasonw401 [ 23/Oct/10 ]

Created an attachment (id=5205)
server log and domain.xml

Comment by Joe Fialli [ 25/Oct/10 ]

I can not figure out how to recreate the state that caused this failure, but we
can fix it so the NPE does not happen.

Here are the key log messages in this issue. The first log event concerning the
failure to join group C1
is more interesting to figure out. But not enough info available in log to
figure out what happened.

[#|2010-10-22T14:43:03.625+1100|WARNING|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=15;_ThreadName=Thread-1;|GMSAD1008:
GMSException occurred : failed to join group C1|#]

[#|2010-10-22T14:43:03.625+1100|INFO|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=15;_ThreadName=Thread-1;|GMSAD1004:
Started GMS for instance server in group C1|#]

[#|2010-10-22T14:43:03.625+1100|INFO|glassfish3.1|ShoalLogger|_ThreadID=15;_ThreadName=Thread-1;|GMS1099:
GMS:Reporting Joined and Ready state to group C1|#]

[#|2010-10-22T14:43:03.625+1100|SEVERE|glassfish3.1|javax.enterprise.system.core.org.glassfish.gms.bootstrap|_ThreadID=15;_ThreadName=Thread-1;|Exception
while processing config bean changes :
java.lang.NullPointerException
at
com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.reportJoinedAndReadyState(GroupCommunicationProviderImpl.java:446)

When the DAS fails to join the cluster, no other operations should be attempted.

Found a bug in GMSAdapterService that was disregarding that the instance could
not join the cluster. Will fix that issue and close this issue.

Comment by Joe Fialli [ 25/Oct/10 ]

fix committed.





[GLASSFISH-14077] [PERF] investigate performance bottleneck in GMS deserialization in trade2 benchmark Created: 19/Oct/10  Updated: 07/Dec/10  Resolved: 07/Dec/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms07

Type: Bug Priority: Major
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File gmsstats.txt    
Issue Links:
Dependency
blocks GLASSFISH-13582 [PERF] Really huge regression with tr... Closed
Issuezilla Id: 14,077

 Description   

With availability-enabled set to true running the trade2 benchmark with
profiling info being collected, it was identified that GMS deserialization was
taking a substantial amount of time.

Given the buffering of replicated data that HA is using, it is believed that the
message size is around 80K. Given that GMS over Grizzly transport is relying on
composite buffer (virtual buffer that spans multiple grizzly physical buffers),
there is potential that improvements to that would assist this issue.

Alexey will assist in improving composite buffer management. GMS team will also
work on stats monitor so we know precisely what the usage characteristics are
and to track where the time is going during deserialization in this benchmark.



 Comments   
Comment by Joe Fialli [ 19/Oct/10 ]

Stack trace from trade2 that shows composite object get (ByteBuffersBuffer.get()
) occuring during NetworkUtility.deserialize().
Theory is that deserialization is waiting for virtual composite object
construction. During profiling the network latency is showing
as part of deserialization since deserialization is occuring in virtual buffer
(that is accumulating multiple grizzly buffers).

GMS-GrizzlyControllerThreadPool-Group-trade2(72)" daemon prio=10
tid=0x00002aabdca6a800 nid=0x6093 runnable [0x0000000044fc9000]
java.lang.Thread.State: RUNNABLE
at
com.sun.enterprise.mgmt.transport.ByteBuffersBuffer.get(ByteBuffersBuffer.java:56)
at
com.sun.enterprise.mgmt.transport.BufferInputStream.read(BufferInputStream.java:70)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2266)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2279)
at
java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3019)
at
java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2820)
at java.io.ObjectInputStream.readUTF(ObjectInputStream.java:1051)
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:616)
at java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:809)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1565)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1316)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1947)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at
com.sun.enterprise.mgmt.transport.NetworkUtility.deserialize(NetworkUtility.java:435)
at
com.sun.enterprise.mgmt.transport.MessageImpl.readMessagesInputStream(MessageImpl.java:277)
at com.sun.enterprise.mgmt.transport.MessageImpl.parseMessage(MessageImpl.java:265)
at
com.sun.enterprise.mgmt.transport.grizzly.GrizzlyMessageProtocolParser.hasNextMessage(GrizzlyMessageProtocolParser.java:229)
at
com.sun.grizzly.filter.ParserProtocolFilter.execute(ParserProtocolFilter.java:145)
at
com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:137)
at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:104)
at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:90)
at
com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:54)
at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:59)
at com.sun.grizzly.ContextTask.run(ContextTask.java:71)
at
com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:532)
at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:513)
at java.lang.Thread.run(Thread.java:619)

Comment by Joe Fialli [ 22/Oct/10 ]

Created an attachment (id=5200)
GMS send/receive message monitoring stats per targetcomponent for trade2

Comment by Joe Fialli [ 22/Oct/10 ]

Received three patches from Alexey to address this.

First patch was to incorporate performance improvements made to
ByteBuffersBuffer from Grizzly 2.0.

Second patch was to ensure workbuffer resets itself properly after errors.

Third patch ensured that when a selector fired that the ParserFilter just
returned if it is an OP_WRITE. Alexey observed in trade2 logs that there were
concurrernt READ and WRITES of same buffer.

Also, I have added
GMS monitoring stats taken periodically every 30 seconds for a 10 minute run
show that there is no issue when not performing an interferring task (such as
profiling). There looks to be issue with deserialization recovery to still work
on.

See attachment gmsstats.txt that demonstrate no bottleneck in gms send or
receive when no heavy duty monitoring or logging is occuring.

Comment by Joe Fialli [ 25/Oct/10 ]

Fix for deserialization failure recovery when large messages are being sent is
still being worked on.

So in the absence of overhead caused by general java profiling, jmap heap dumps,
excessively verbose logging level (GrizzlyMessageProtocolParser debug), we are
not aware of a bottleneck in gms send/receive based on numbers in gmsstats.txt
attached to this issue. Those stats were taken from a 10 minute trade2 run and
showed no gms message send write timeouts (nor any times greater than 120 ms
wall clock on send side.)

Comment by Joe Fialli [ 25/Oct/10 ]

Fix for deserialization failure recovery when large messages are being sent is
still being worked on.

So in the absence of overhead caused by general java profiling, jmap heap dumps,
excessively verbose logging level (GrizzlyMessageProtocolParser debug), we are
not aware of a bottleneck in gms send/receive based on numbers in gmsstats.txt
attached to this issue. Those stats were taken from a 10 minute trade2 run and
showed no gms message send write timeouts (nor any times greater than 120 ms
wall clock on send side.)

Comment by Joe Fialli [ 15/Nov/10 ]

Alexey provided a grizzly/shoal-gms fix so we no longer suffer deserialization
failures receiving gms messages nor write time outs sending gms messages when
profiling the trade2 run.

We performed jprofile on a trade2 run and additionally did another run taking
jmap -histo/jstack every 2 minutes. There were no write time outs or
deserialization failures in server.log.

Will close this issue with next shoal integration.

Comment by Joe Fialli [ 03/Dec/10 ]

there is no identified bottlenecks at this time.
however, have been blocked in getting a successful profiling trade2 run
by other outstanding issues.

leaving this open but all of the previously reported issues have been addressed.
This can be closed after a successful profiling run confirms no outstanding
bottlenecks in gms subsystem.

Comment by Joe Fialli [ 07/Dec/10 ]

Closing this since the initial issue was only observable when profiling or other extra monitoring tools slowed up the system so much that GMS started timing out during writing. This resulted in stream corruption and deserialization
WARNING messages. This is now corrected in gms over grizzly. So closing this issue for time being. It can be reopened if future profiling shows any issues with GMS deserialization.





[GLASSFISH-14030] [Shoal] Error needs to be logged when Backing Store is not available on a remote instance Created: 17/Oct/10  Updated: 03/Dec/10  Resolved: 03/Dec/10

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms07

Type: Bug Priority: Major
Reporter: varunrupela Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File gmsstats.txt    
Issuezilla Id: 14,030

 Description   

When a Backing Store is not available on a remote instance to replicate the
data, Shoal must log an Exception/Error message, when it tries to use the remote
Backing Store (for save or remove) to indicate that Replication / Failover will
Not Work.

Issues 13945 and 14029 discuss the scenario when remote Backing Stores are not
available.



 Comments   
Comment by Mahesh Kannan [ 18/Oct/10 ]

Will talk to the GMS team about this. Not sure if the WARNING/Exception can be
logged when BackingStores are created, but certainly the absence of a
BackingStore can be detected during gms.sendMessage.

Will update the issue once we understand what can be done here.

Comment by Mahesh Kannan [ 21/Oct/10 ]

Assigning this to GMS team for logging a WARNING message when a messages arrives
for a non existent targetToken

Comment by Joe Fialli [ 22/Oct/10 ]

Will log a message the first time a message arrives for a targetComponent with
no MessageActionFactory registered in a CORE member.

Comment by Joe Fialli [ 22/Oct/10 ]

Created an attachment (id=5201)
GMS send/receive message monitoring stats per targetcomponent for trade2

Comment by Joe Fialli [ 03/Dec/10 ]

Fixed.

Here is the log event that notifies that a message arrived in a GMS CORE member (all glassfish clustered instances
are CORE except the DAS) and there is no handler registered to process it. The target component is the
name of the ReplicationBackingStore type. (web application name for web sessions, ejb class name for ejb, ....)

GMS1116: unable to deliver message to a non-existent target component

{0}

for group:

{1}

. Note: only reported for first message.





[GLASSFISH-14006] reporting invalid gms Bind Interface Address Created: 15/Oct/10  Updated: 04/Nov/10  Resolved: 04/Nov/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms07

Type: Bug Priority: Major
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 14,006

 Description   

Glassfish gms placeholder for shoal gms issue 108

https://shoal.dev.java.net/issues/show_bug.cgi?id=108

Same method that is used in shoal gms should be used in GMSAdapterImpl
when it is processing gms-bind-interface-address value. If the
provided ip address is not valid, a WARNING will be emitting stating that the
value for gms-bind-interface-address is not valid and will not be used, the
default means for selecting an ip address will be used.



 Comments   
Comment by Bobby Bissett [ 27/Oct/10 ]

This is fixed in Shoal version 1.5.21-SNAPSHOT. Will add the code to GF to call
the Shoal address validator and will commit when we integrate 1.5.21.

Comment by Bobby Bissett [ 04/Nov/10 ]

Fixed in revision 42455





[GLASSFISH-13972] validate-multicast: asarch review comments Created: 13/Oct/10  Updated: 15/Oct/10  Resolved: 15/Oct/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms06

Type: Bug Priority: Major
Reporter: Tom Mueller Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 13,972

 Description   

These are comments from the ASArch review of asadmin commands that are new to 3.1
on 10/13/2010.

  • if bindinterface is an IP address, change option to --bindaddress (also change
    in create-cluster to match)
  • add -v short option for verbose
  • remove System.exit()


 Comments   
Comment by Bobby Bissett [ 14/Oct/10 ]

I've made all of these changes, along with returning a usable exit code so that
the command fails when multicast is not available (thank you VPN for not
permitting multicast!):

— begin output —
Listening for data...
Sending message with content
"dhcp-whq-twvpn-1-vpnpool-10-159-226-178.vpn.oracle.com" every 2,000 milliseconds
Exiting after 3 seconds. To change this timeout, use the --timeout command line
option.
Received no multicast data
Command validate-multicast failed.
— end —

When we do the next Shoal integration, I'll commit the GlassFish changes as well
and resolve the issue.

Comment by Bobby Bissett [ 14/Oct/10 ]

The fixes on the Shoal side have been committed. We'll do a Shoal promotion
today and I can integrate the GlassFish changes.

hostname% svn commit
Sending
impl/src/main/java/com/sun/enterprise/gms/tools/MultiCastReceiverThread.java
Sending
impl/src/main/java/com/sun/enterprise/gms/tools/MulticastSenderThread.java
Sending impl/src/main/java/com/sun/enterprise/gms/tools/MulticastTester.java
Sending
impl/src/main/resources/com/sun/enterprise/gms/tools/LocalStrings.properties
Transmitting file data ....
Committed revision 1299.

Comment by Bobby Bissett [ 14/Oct/10 ]

Fixed in revision 41736.

Sending
admin/config-api/src/main/java/com/sun/enterprise/config/serverbeans/Cluster.java
Sending
cluster/gms-adapter/src/main/java/org/glassfish/gms/admin/ValidateMulticastCommand.java
Sending packager/resources/pkg_conf.py
Sending pom.xml

Comment by Bobby Bissett [ 15/Oct/10 ]

Oops. Typo in last commit left the params not actually matching. Now it's
really fixed.

hostname% svn commit
Sending
admin/config-api/src/main/java/com/sun/enterprise/config/serverbeans/Cluster.java
Transmitting file data .
Committed revision 41765.





[GLASSFISH-13530] GMS_LISTENER port conflict when start-cluster of multiple instances on 1 machine Created: 17/Sep/10  Updated: 20/Feb/11  Resolved: 21/Sep/10

Status: Closed
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: V3

Type: Bug Priority: Critical
Reporter: gopaljorapur Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: Sun


Attachments: Text File server.log.das     Text File server.log.instance101     Text File server.log.instance102     Text File server.log.instance103    
Issuezilla Id: 13,530
Tags: 3_1-verified

 Description   

I have 3 instance cluster, all instances(instance101, instance102, instance103)
on the same box

One of the instance(instance102) is not detected by GMS

All logs attached



 Comments   
Comment by gopaljorapur [ 17/Sep/10 ]

Created an attachment (id=4915)
Server log of DAS

Comment by gopaljorapur [ 17/Sep/10 ]

Created an attachment (id=4916)
Instance log

Comment by gopaljorapur [ 17/Sep/10 ]

Created an attachment (id=4917)
instance log

Comment by gopaljorapur [ 17/Sep/10 ]

Created an attachment (id=4918)
i

Comment by Bobby Bissett [ 17/Sep/10 ]

Looking at the instance logs, it looks like you're using the same TCP port range
used by Grizzly for all instances. Sometimes this will work, and other times
(like this case it won't). Instance 1 and 2 have both tried to use port 9091 for
communication, and so instance 2 is blocked from being able to communicate with
GMS. This is only a problem when you run more than one instance on the same machine.

You could recreate the instances and specify the port ranges to make sure there
are no conflicts. But Joe has made changes in GMS so that you don't have to, so
you might as well try with the new GMS jars rather than change the way you're
setting up your system. I'm trying to promote a new version of Shoal now – if I
can, I'll send a link to the new Shoal bits in Maven. Currently I can't get
Hudson to respond to me. If this keeps up, I'll just attach a temporary
shoal-gms-impl jar to the bug report (but I'd rather point to a more official one).

One very simple workaround, in the mean time, is to start each instance
individually rather than using asadmin start-cluster. That should avoid the
concurrent-port-grabbing issue.

Comment by Joe Fialli [ 20/Sep/10 ]

Looking at the submitted logs, I confirmed that GMS_LISTENER_PORT-clustername is
not being set on each clustered server instance. Thus, the DAS and all clustered
instances are using the default Shoal gms port range of 9090 to 9120.

The reason that this issue just started happening in glassfish v3.1 was "asadmin
start-cluster" just started beginning all clustered instances at the same time
instead of serially.
This issue is already fixed in shoal gms workspace. So when the next shoal-gms
integration occurs, one will not have to do the workaround described below.
Before the integration occurs, below is the workaround so you will not
be blocked anymore.

Simplest workaround is to not use "asadmin start-cluster" and simply use
"asadmin start-instance" and start clustered instances serially rather than
concurrently.

A workaround that enables use of "asadmin start-cluster"
ith the current implementation, when running multiple clustered instances on one
machine, one must set GMS_LISTENER_PORT-<clustername>. Here is a script of how
one can do this in most convenient manner possible.

$GF_HOME/bin/asadmin create-domain --nopassword=true mydomain
$GF_HOME/bin/asadmin start-domain mydomain
$GF_HOME/bin/asadmin create-cluster myCluster

  1. need to set unique GMS_LISTENER_PORT when running multiple instances on same
    machine. When instances are all started at once, there was a bug in shoal gms
  2. that many instances will try to use the same first port in the default range.
    #commonly DAS uses default port 9090 and the failure to start is over
    #contention for port 9091
  3. no need to set GMS_LISTENER_PORT when running one instance on each machine
    (includes DAS running on its own machine)
    $GF_HOME/bin/asadmin create-instance --node localhost --cluster myCluster
    --systemproperties "GMS_LISTENER_PORT-myCluster=9491" instance1
    $GF_HOME/bin/asadmin create-instance --node localhost --cluster myCluster
    --systemproperties "GMS_LISTENER_PORT-myCluster=9492" instance2
    $GF_HOME/bin/asadmin create-instance --node localhost --cluster myCluster
    --systemproperties "GMS_LISTENER_PORT-myCluster=9493" instance3
    instance1
    $GF_HOME/bin/asadmin start-cluster myCluster
Comment by Joe Fialli [ 20/Sep/10 ]

Removed blocking since there is multiple workarounds available
for one to proceed.

As soon as next shoal-gms integration occurs, this will be marked fixed.

We have already performed extensive testing with latest shoal-gms jar
and confirmed that one will no longer need to set GMS_LISTENER_PORT-clustername
when running multiple instances on one machine.

Comment by Joe Fialli [ 20/Sep/10 ]

Altered subject to describe what is occurring.

GMS did not detect an instance since the instance failed to start with this
SEVERE warning.

[#|2010-09-17T14:03:52.579-0700|SEVERE|glassfish3.1|ShoalLogger|_ThreadID=15;_ThreadName=Thread-1;|Exception
during starting the controller
java.net.BindException: No free port within range:
9091=com.sun.grizzly.ReusableTCPSelectorHandler@40a47f
at com.sun.grizzly.TCPSelectorHandler.initSelector(TCPSelectorHandler.java:430)
at com.sun.grizzly.TCPSelectorHandler.preSelect(TCPSelectorHandler.java:376)
at com.sun.grizzly.SelectorHandlerRunner.doSelect(SelectorHandlerRunner.java:186)
at com.sun.grizzly.SelectorHandlerRunner.run(SelectorHandlerRunner.java:130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

#]

[#|2010-09-17T14:03:52.581-0700|WARNING|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=15;_ThreadName=Thread-1;|GMSAD1008:
GMSException occurred : failed to join group st-cluster|#]

Comment by Joe Fialli [ 21/Sep/10 ]

shoal-gms with fix for this issue confirmed to be integrated in
b21.





[GLASSFISH-13212] cluster nodes recieve inconsistent notifiactions 'failure' and 'joined and ready' Created: 31/Aug/10  Updated: 31/Aug/10  Resolved: 31/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: not determined

Type: Bug Priority: Major
Reporter: zorro Assignee: Joe Fialli
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Linux


Issuezilla Id: 13,212

 Description   

build: ogs-3.1-web-b18-08_30_2010

  • start DAS
  • wait for DAS to start
  • start cluster (9 CORE instances on 9 machines)
  • wait for all cluster instances to start
  • kill n1c1m4
  • wait 20 seconds
  • restart n1c1m4
  • wait 5 seconds
  • stop cluster
  • wait all cluster CORE nodes to stop
  • stop DAS
  • wait to DAS to stop
  • collect logs

bug:
Node 9 got 'joined and ready' notification whereas others got failure notifications.

Expected:
'joined and ready' and 'failure notification' should be mutually exclusive for
nodes of a cluster.
Otherwise, nodes receiving different notifications could take different business
actions and take the system into an inconsistent state.

Despite giving 15 seconds time after starting a cluster and after stopping the
cluster failure and appointed notifications are not seen in some cluster instances.
logs:
http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_31_10_t_11_15_02/scenario_0003_Tue_Aug_31_11_38_25_PDT_2010.html



 Comments   
Comment by Joe Fialli [ 31/Aug/10 ]

It is difficult to analyze this issue due to significant time skew between
instances in the cluster.

The following information is extracted from following file:
http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_31_10_t_11_15_02/scenario_0003_Tue_Aug_31_11_38_25_PDT_2010.html

Here is the REJOIN event from node 9 that was reported.

[#|2010-08-31T11:41:09.552-0700|WARNING|oracle-glassfish3.1|ShoalLogger|_ThreadID=15;_ThreadName=Thread-1;|Instance
n1c1m4 was restarted at 11:41:58 AM PDT on Aug 31, 2010.|#]

The above log event from node 9 is stating that instance "n1c1m4" was restarted
49 seconds in the future. Current time on node 9 is 11:41:09 and machine
running instance n1c1m4 was restarted at 11:41:58 in the future.
There is at least a 49 second skew in clock time between node9 and the machine
running n1c1m4. While we should be able to handle such a case, it is not ideal
conditions to be investigating this issue under. Given that the timing in the
test is in granularity of 15 and 20 second waits, the time skew should not be so
large in these initial test runs. (unless we have a test scenario that is
testing specifically how GMS fares when the clustered instances have significant
time skew between them.)

Examining time skew across clustered instances based on a common event.
The FailureSuspected event is handled in node9 at time 11:40:50 and it
is received in node8 at time 11:41:17.211, representing a skew of approximately
27 seconds between these instances. The FailureSuspected event was sent by
DAS at time 11:41:36.674 and received in node9 at 11:40:50, representing a skew
of 46 seconds between master and node9.

I would like to propose that this test run be considered invalid due to
significant time skew between machines in cluster not being the functionality
being tested. At this point, there is no way to infer if the time skew impacted
the test or not, but it is a variable better off being eliminated from initial
test runs.





[GLASSFISH-13209] failure in getting failure/appointed notifications in b18-08_30_2010 Created: 31/Aug/10  Updated: 15/Sep/10  Resolved: 15/Sep/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: not determined

Type: Bug Priority: Major
Reporter: zorro Assignee: Joe Fialli
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Linux


Issuezilla Id: 13,209

 Description   

build: ogs-3.1-web-b18-08_30_2010

  • start DAS
  • wait for DAS to start
  • start cluster (9 CORE instances on 9 machines)
  • wait for all cluster instances to start
  • kill n1c1m4
  • wait 10 seconds
  • stop cluster
  • wait all cluster CORE nodes to stop
  • stop DAS
  • wait to DAS to stop
  • collect logs

bug:

Despite giving 15 seconds time after starting a cluster and after stopping the
culster failure and appointed notifications are not seen in some cluster instances.
http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_31_10_t_10_28_29/scenario_0002_Tue_Aug_31_10_44_44_PDT_2010.html



 Comments   
Comment by Joe Fialli [ 03/Sep/10 ]

Confirmed that this issue was being observed on machines that had time
synchronized. However, unable to recreate failure on another set of machines at
this time. Tried with 4 instances on a single machine and was unable to
recreate this failure. Also tried with 7 instances on Oracle Enterprise Linux 5
OS. Running that many times to see if it will intermittently fail.

*******

Corrections to bug description.

When starting a 9 instance cluster, one needs to wait 45 seconds after starting
cluster for cluster to reach steady state. In logs that I have been analyzing,
that is the case. Below it stated that it only waited 15 seconds after
start-clsuter which is just too short a period of time.

Additionally, the glassfish application server turns off all logging when
shutting down. None of the GMS logging messages come out after shutdown, so
please place more time than 15 seconds from last event to shutting down cluster
just in case. There are log messages in shoal to announce the completion of
message processing that just are not happening due to glassfish terminating log
very early in shutdown cycle. This is ALWAYS last line in log now.

[#|2010-09-03T14:27:30.190-0400|INFO|glassfish3.1|javax.enterprise.system.tools.admin.com.sun.enterprise.v3.admin.cluster|_ThreadID=15;_ThreadName=Thread-1;|Server
shutdown initiated|#]

No GMS log messages come out after but we know gms shutdown is called since DAS
has planned shutdown notification due to instance initiating its shutdown.

Here are log messages we see in shoal gms simulated app server test after
shutdown starts.

[#|2010-09-03T14:52:25.809-0400|INFO|Shoal|ShoalLogger|_ThreadID=44;_ThreadName=MasterNode
processOutStandingMessages;ClassName=MasterNode$ProcessOutstandingMessagesTask;MethodName=run;|Completed
processing outstanding master node messages for member:instance03
group:testgroup oustandingMessages to process:0|#]

[#|2010-09-03T14:52:25.848-0400|INFO|Shoal|ShoalLogger|_ThreadID=15;_ThreadName=ViewWindowThread:testgroup;ClassName=ViewWindowImpl;MethodName=run;|normal
termination of ViewWindow thread for group testgroup|#]

[#|2010-09-03T14:52:25.849-0400|INFO|Shoal|ShoalLogger|_ThreadID=45;_ThreadName=MessageWindowThread:testgroup;ClassName=MessageWindow;MethodName=run;|MessageWindow
thread for group testgroup terminated due to shutdown notification|#]

[#|2010-09-03T14:52:25.849-0400|INFO|Shoal|ShoalLogger|_ThreadID=40;_ThreadName=com.sun.enterprise.ee.cms.impl.common.Router
Thread;ClassName=SignalHandler;MethodName=run;|SignalHandler task named
com.sun.enterprise.ee.cms.impl.common.Router Thread exiting|#]

These would be in glassfish server.log if logging was not shutting down so
quickly after shutdown starts. Following log message should show that
glassfish is calling gms.shutdown() and it is not coming out either.

logger.info("Calling gms.shutdown()...");

We know it is getting called since planned shutdown is occuring in the master
server.log.

Comment by Joe Fialli [ 08/Sep/10 ]

Analysis:
We have confirmed drops of multicast packets on the systems that this reported
on. These systems are configured with the Linux default values.

Below are the minimum values that we require.

net.core.rmem_max=524288
net.core.wmem_max=524288
net.core.rmem_default=524288
net.core.wmem_default=524288

On the systems that we are experiencing drops,
the following value is set for all of the above values: 131071.

We are still waiting for confirmation that configuring the system with these
values and rerunning test addresses the dropped messages reported by this issue.

Comment by Joe Fialli [ 15/Sep/10 ]

This issue was resolved by same resolution as one for issue 13200.
Thus, marking this as a duplication of that issue.
Same documentation of system configuration will correct this issue.

The solution to this issue is to document how to increase UDP buffer size for on
various OS. The initial issue was reported on Linux which has a small default
values that were too small for a 9 instance cluster. This is definitely a
group-management-service issue, but needed to select "docs" to indicate that the
solution to this issue is documenting how to tune UDP buffer size based on
number of instances in cluster and number of instances on a machine.

      • This issue has been marked as a duplicate of 13200 ***




[GLASSFISH-13166] umbrella feature for gms logging Created: 27/Aug/10  Updated: 26/Nov/10  Resolved: 24/Sep/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms06

Type: New Feature Priority: Major
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Dependency
depends on GLASSFISH-12196 Update external library Shoal GMS to ... Resolved
depends on GLASSFISH-11845 cluster/gms-adapter gms msg with mis... Resolved
depends on GLASSFISH-11990 Message Key is missing in LogStrings.... Resolved
Issuezilla Id: 13,166

 Description   

This feature is being used to track various issues around logging in the Shoal
and GlassFish GMS code. Some logging needs cleaning up, some needs message IDs,
some is still to be done, etc.



 Comments   
Comment by Bobby Bissett [ 27/Aug/10 ]

Added dependencies, cc list, and set final milestone to HCF since this issue
will include bug fixes. Individual sub-issues will have their own milestones set
appropriately.

Comment by Bobby Bissett [ 24/Sep/10 ]

Am marking fixed now that the sub-issues are all fixed and integrated.





[GLASSFISH-13084] GMSAnnounceBeforeStartClusterCommand generates error if gms-enabled=false Created: 23/Aug/10  Updated: 24/Aug/10  Resolved: 24/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1

Type: Bug Priority: Critical
Reporter: Joe Di Pol Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 13,084

 Description   

If you create a cluster using --gmsenabled=false:

$ asadmin create-cluster --gmsenabled=false ch1

And then later try to start the cluster, you get an error:

$ asadmin start-cluster ch1
remote failure: Usage: start-cluster [--verbose=false] clustername

Command start-cluster failed.

Here's the output in server.log:

[#|2010-08-20T09:45:14.168-0700|SEVERE|glassfish3.1|javax.enterprise.system.tools.admin.com.sun.enterprise.v3.admin|_ThreadID=14;_ThreadName=Thread-1;|The
log message is null.|#]

If I comment out this line in GMSAnnounceBeforeStartClusterCommand.java then the
error goes away and the cluster starts.

} finally

{ /* GMSAnnounceSupplementalInfo result = new GMSAnnounceSupplementalInfo(clusterMembers, gmsStartCluster, gmsadapter); report.setResultType(GMSAnnounceSupplementalInfo.class, result); * */ }

Maybe this needs an "if (gmsAdapterService.isGmsEnabled()"?



 Comments   
Comment by Joe Di Pol [ 23/Aug/10 ]

Just noticed that this affects GMSAnnounceBeforeStopClusterCommand too.

Comment by Joe Fialli [ 24/Aug/10 ]

Fix committed for gms supplementary command for before start and stop cluster.

Comment by Joe Fialli [ 24/Aug/10 ]

Just wanted to add that the resolution differ very slightly from proposed fix.

Check if instance variable gms is non-null before registering GMSSupplementalInfo
with report in finally block of before supplemental command added by gms for
start-cluster and stop-cluster.





[GLASSFISH-13079] make validate-multicast less verbose Created: 23/Aug/10  Updated: 17/Sep/10  Resolved: 17/Sep/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: v3.0.1
Fix Version/s: 3.1

Type: Improvement Priority: Minor
Reporter: Bobby Bissett Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: Macintosh


Issuezilla Id: 13,079

 Description   

RFE to remove the "|<uuid>" portion of the multicast messages that are output.
It makes the results a little hard to read.



 Comments   
Comment by Bobby Bissett [ 17/Sep/10 ]

Fixed in MS5.





[GLASSFISH-13056] Add validate-nodes-multicast command Created: 20/Aug/10  Updated: 07/Dec/11

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: future release

Type: New Feature Priority: Major
Reporter: Tom Mueller Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 13,056
Tags: 3_1_2-exclude

 Description   

This is a request for adding a validate-nodes-multicast command. This would be
a remote command (running in the DAS) that would use the SSH information in SSH
nodes to start validate-multicast on each node, and then would collect the
output and give a picture of what the multicast situation is for the entire
collection of nodes that are defined for the domain.

For example, if all nodes can communicate with each other via multicast (the
ideal situation for GMS), the output might be:

Multicast Groups:
1: localhost (DAS), n1, n2, n3, n4

However, if we have only n1<->n2 and n3<->n4 communicating, and the DAS can't
multicast to any of them, then the output could be:

Multicast Groups:
1: n1, n2
2: n3, n4

Isolated Nodes:
localhost (DAS)

If we had the situation where multicast doesn't work at all, the output could
be:

Isolated Nodes:
localhost (DAS), n1, n2, n3, n4

This information can currently be derived by running "asadmin validate-
multicast" on all of the nodes, and then analyzing the output. The idea of this
command is to automate the running of the command on all the nodes and to
analyze the output for the user.



 Comments   
Comment by Joe Fialli [ 25/Mar/11 ]

Recommend broadening this command to encompass validate-cluster.

In GlassFish 3.2, there will exist a mode to enable GMS without UDP multicast.
It would be helpful if this command could verify GMS discovery based on current cluster configuration,
independent of whether multicast is enabled or not.

Comment by Bobby Bissett [ 25/Apr/11 ]

The command should definitely use the cluster configuration information in domain.xml, such as multicast address/port, or whatever is being used for non-multicast setups. Based on user feedback, the command should also give some warning about settings that are NOT specified in the config. For instance, if no network adapter is specified for a node, the tool should let the user know that it's not specifying one when run.

Comment by Bobby Bissett [ 07/Dec/11 ]

Moving to Joe since I'm no longer on project.





[GLASSFISH-12971] Multi-cast problem on an oracle enterprise linux machine with DHCP Created: 12/Aug/10  Updated: 02/Sep/10  Resolved: 02/Sep/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1

Type: Bug Priority: Major
Reporter: mzh777 Assignee: Joe Fialli
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Linux


Attachments: XML File domain.xml     Text File linux_server.log     Text File solaris_server.log    
Issuezilla Id: 12,971

 Description   

Promoted build 15 on Oracle Enterprise Linux 5. One machine setup.

Use the script appserv-tests/devtests/ejb/ee/cluster-tests/create-cluster on an
oracle linux machine (10.132.106.153) to create a cluster with 4 instances for
build 15. The script was modified for BIND_INTERFACE_ADDRESS=10.132.106.153.
After finish the run, didn't see MemberId defined in server.log of inst1 (see
attached linux_server.log).

IP config info on linux:
$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:14:4F:24:04:B0
inet addr:10.132.106.153 Bcast:10.132.107.255 Mask:255.255.254.0
inet6 addr: fe80::214:4fff:fe24:4b0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1431430 errors:0 dropped:0 overruns:0 frame:0
TX packets:900521 errors:549 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:787573651 (751.0 MiB) TX bytes:1636481844 (1.5 GiB)
Interrupt:217 Base address:0x4000

eth1 Link encap:Ethernet HWaddr 00:14:4F:24:04:B1
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:74 Base address:0x6000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1039948 errors:0 dropped:0 overruns:0 frame:0
TX packets:1039948 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1456315744 (1.3 GiB) TX bytes:1456315744 (1.3 GiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

I was able to run the same script (with BIND_INTERFACE_ADDRESS change) on a
solaris machine for build 15 successfully.
% ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232
index 1
inet 127.0.0.1 netmask ff000000
eri0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 10.5.220.242 netmask ffffff00 broadcast 10.5.220.255
ether 0:3:ba:3a:29:b1
The server.log of inst1 is also attached.



 Comments   
Comment by mzh777 [ 12/Aug/10 ]

Created an attachment (id=4679)
inst1 server.log on linux

Comment by mzh777 [ 12/Aug/10 ]

Created an attachment (id=4680)
inst1 server.log on solaris

Comment by Joe Fialli [ 12/Aug/10 ]

Request for more information.

Could you attach the domain.xml for the DAS for the linux machine?

GMS_BIND_INTERFACE_ADDRESS is verified to be working by various tests,
thus, this is probably a configuration issue.

I will be looking for following in domain.xml.

<system-property name="GMS-BIND-INTERFACE-ADDRESS-ming-cluster"
value="10.132.106.153"/>

as a child of element <config name="server-config" />
(to configure the DAS) and the system property should be set in each server
element of the cluster in the domain.xml.

If all of the above looks correct in domain.xml on linux machine, the DAS
server.log from linux machine would assist in diagnosing why inst1 is not seeing
das.

<system-property
name="GMS-BIND-INTERFACE-ADDRESS-dev-cluster"
value="129.148.71.168"/>

Comment by mzh777 [ 12/Aug/10 ]

Created an attachment (id=4681)
DAS domain.xml

Comment by mzh777 [ 19/Aug/10 ]

I used the same script to create a cluster with 4 instances on a oracle
enterprise linux machine in lab and it worked. The instances are able to view
each other and GMS is now working. The ifconfig output from the lab machine:
-bash-3.2$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:23:8B:98:20:D4
inet addr:10.5.220.151 Bcast:10.5.220.255 Mask:255.255.255.0
inet6 addr: fe80::223:8bff:fe98:20d4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3695092 errors:0 dropped:0 overruns:0 frame:0
TX packets:1625946 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3541044324 (3.2 GiB) TX bytes:408901012 (389.9 MiB)
Memory:dffe0000-e0000000

eth1 Link encap:Ethernet HWaddr 00:23:8B:98:20:D5
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Memory:dffa0000-dffc0000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:338740 errors:0 dropped:0 overruns:0 frame:0
TX packets:338740 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:82001728 (78.2 MiB) TX bytes:82001728 (78.2 MiB)

sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

It seems an issue with MULTICAST of the network. But how to debug?
asadmin validate-multicase
doesn't return anything.

Comment by Joe Fialli [ 20/Aug/10 ]

Lowered priority since this looks like a multicast configuration issue, not a
complete failure to work on a specific OS platform. Additionally, a workaround
is provided below to allow one to manage this issue if the default network
interface on the system is not multicast enabled but other network interfaces
are. (not sure what the issue is here. Have seen virtual imaging cause issues
in past, not sure if virtual images are involved or not in this reported case)

Here is how to use "asadmin validate-multicast" to find a network interface with
working multicast on a machine.

The "asadmin validate-multicast" both defaults and allows one to specific which
network interface address to use for multicast.
I encourage you to try other network interfaces on your machine with the asadmin
validate-multicast command and
see if one of the other interface works.

Here is command to specify the explicit network interface to try:

asadmin validate-multicast --bindinterface <ip address of network interface to
try to send multicast over>

Perform ifconfig and try each network interface on your machine. If one works,
then set system-property GMS-BIND-INTERFACE-ADDRESS-your-cluster-name
on all server elements in your cluster AND in config element for DAS, the config
with name "server-config". (You can search your existing domain.xml with
a cluster for the string "gms-bind-interface-address" to get the exact name of
the system property that you need to set. All instances must use IP addresses
on same subnet.

Comment by mzh777 [ 24/Aug/10 ]

Modify the summary as this looks like a problem with multicasting on a DHCP machine.

Comment by Joe Fialli [ 02/Sep/10 ]

Marking this issue as WONTFIX.

GMS will only work when multicast is enabled on a network properly.
asadmin validate-multicast is a tool to verify if multicast is working or not.

The user installed OS himself and configured it for DHCP.
On lab machines with static IP address, multicast and thus GMS are working fine.

It is a user responsibility to work with network admin to configure multicast
properly for their environment.

I believe we identified that specifying the gms-bind-interface-address as
127.0.0.1 allowed one to run a cluster on a single DHCP machine for developer
level testing.





[GLASSFISH-12905] Spectator (DAS) is getting the 'appointed recovery notification' Created: 05/Aug/10  Updated: 25/Aug/10  Resolved: 25/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1

Type: Bug Priority: Major
Reporter: zorro Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: Linux


Issuezilla Id: 12,905

 Description   

build oges-3.1-web-b14-08_05_2010
DAS is getting 'appointed recovery notification' which was not the case in gf v2x.

http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_08_05_10_t_15_32_22/scenario_0003_Thu_Aug__5_15_33_09_PDT_2010.html



 Comments   
Comment by Joe Fialli [ 25/Aug/10 ]

Fix checked in on Aug 2nd, so definitely in latest build for verification.





[GLASSFISH-12858] appointed recovery notification not seen for 1-node failure (without restart) Created: 30/Jul/10  Updated: 02/Aug/10  Resolved: 02/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms05

Type: Bug Priority: Major
Reporter: zorro Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Linux


Issuezilla Id: 12,858

 Description   

build 14

Scenario 0002
================

  • start SPECTATOR
  • wait
  • start 10 CORE instances on 10 machines
  • kill n1c1m4
  • wait to CORE nodes to stop
  • wait to SPECTATOR to stop
  • collect logs
    ================

In this scenario one node is killed (without restarting it).
No appointed notification in the three runs below is seen.

http://aras2.sfbay.sun.com:8080/testresults/export1/gms/gf31/gms//set_07_30_10_t_20_19_15/final_Fri_Jul_30_21_33_11_PDT_2010.html

Expected:
Appointed recovery serve notification to be emitted.



 Comments   
Comment by Joe Fialli [ 02/Aug/10 ]

Confirmed this problem running 4 instance glassfish v3.1 cluster on one machine.

This appears to only occur when running under glassfish v3.1.
shoal gms runsimulatedcluster.sh kill test does see the required message.

#|2010-08-02T07:15:08.476-0400|INFO|Shoal|ShoalLogger|_ThreadID=13;_ThreadName=ViewWindowThread:testgroup;ClassName=RecoveryTargetSelector;MethodName=setRecoverySelectionState;|Appointed
Recovery Server:instance03:for failed member:instance02:for group:testgroup|#]
/ws/2010/transport/shoal/gms/LOGS/simulateCluster_kill//instance01.log
[#|2010-08-02T07:15:08.479-0400|INFO|Shoal|ShoalLogger|_ThreadID=13;_ThreadName=ViewWindowThread:testgroup;ClassName=RecoveryTargetSelector;MethodName=setRecoverySelectionState;|Appointed
Recovery Server:instance03:for failed member:instance02:for group:testgroup|#]
/ws/2010/transport/shoal/gms/LOGS/simulateCluster_kill//instance03.log
[#|2010-08-02T07:15:08.499-0400|INFO|Shoal|ShoalLogger|_ThreadID=13;_ThreadName=ViewWindowThread:testgroup;ClassName=RecoveryTargetSelector;MethodName=setRecoverySelectionState;|Appointed
Recovery Server:instance03:for failed member:instance02:for group:testgroup|#]
/ws/2010/transport/shoal/gms/LOGS/simulateCluster_kill//instance04.log

Comment by Joe Fialli [ 02/Aug/10 ]

Implemented a fix, verified it on my dev machine, checked the fix in.
Can be verified in first nightly v3.1 build available on Aug 3rd.





[GLASSFISH-12850] server not able to start if shoal-gms-impl jar is removed from modules directory Created: 30/Jul/10  Updated: 18/Aug/10  Resolved: 18/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms05

Type: Bug Priority: Major
Reporter: janey Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,850

 Description   

The GF RI distro should not include clustering jar files. But server is not able to start if shoal-gms-
impl.jar is removed from glassfish/modules directory.

After talking with Bobby Bissett, the understanding is that the shoal-gms-api.jar is the common api files
that server startup depends on not shoal-gms-impl.jar. Looks like shoal-gms-imp.jar currently bundles
classes from api.

Also, we need to move shoal-gms-api.jar to a common packager (e.g. glassfish-common).

cc'ing Snjezana



 Comments   
Comment by Bobby Bissett [ 30/Jul/10 ]

Somehow the impl jar is bundling most of the core classes in the api jar, and so
we have the same package split between the jars and this is a Bad Thing. I think
this was all working correctly when Joe did the work to split a class into
api/impl parts and we could remove a package that otherwise would be in both
jars. Somewhere it's been messed up along the way.

I have the OSGi metadata cleanup task anyway and will take this. I've confirmed
all the behavior that Jane was seeing in my workspace, so I know it's not an
issue during GF packaging somehow.

Will sort out the jar story and follow up with Snjezana if I need any help
moving the gms bits to glassfish-common. It may have to wait until mid-next week
though.

Comment by Snjezana Sevo-Zenzerovic [ 30/Jul/10 ]

Sorry, I'll introduce yet another twist - ideally, we should not move
shoal-gms-api into glassfish-common package, but if at all possible completely
obliterate the need for shoal-gms-api if the domain configuration does not
contain any cluster/gms related configuration...

I'll take the discussion offline with Bobby and Joe.

Comment by Bobby Bissett [ 30/Jul/10 ]

I might as well put it here since it's not in writing anywhere else: there used
to be one gms package, and we didn't want that loaded all the time if clustering
wasn't being used. But there is some small api that people need to code to so
that they can use gms when clustering is present. So the jar was split into
the api and impl pieces, and the api jar is supposed to be very small but
necessary so that potential gms client code can compile.

The bulk of the gms code is supposed to be in the impl jar, but there needs to
be some small amount that is imported anyway. For instance, the code that checks
whether or not there is a cluster needs some way to get gms started when needed.
So it depends on the smaller api jar, but not the larger impl jar. You're
correct that most of it should not be loaded if there are no clusters, and we've
worked to keep the api jar small.

Comment by Snjezana Sevo-Zenzerovic [ 30/Jul/10 ]

Well, my primary motivation for trying to keep all of shoal (including API jar)
together in a single IPS package was to be able to update shoal content
independently given that it comes from separate project. I completely understand
the rationale between API/impl split and it is the right thing to do. And now
I'm really taking if offline

Comment by Bobby Bissett [ 04/Aug/10 ]

I've tried with the new api/impl jars that don't have the code duplicated, but
am still seeing problem loading bootstrap module:

Unresolved constraint in bundle org.glassfish.cluster.gms-bootstrap [131]:
Unable to resolve 131.0: missing requirement [131.0] package;
(package=com.sun.enterprise.ee.cms.core) - [131.0] package;
(package=com.sun.enterprise.ee.cms.core)|#]

That package is in the shoal-gms-api jar, so there must be some problem in the
metadata. At least it doesn't work in the case where the api is removed and impl
is still present, which is progress.

Comment by Bobby Bissett [ 04/Aug/10 ]

I don't understand it (the error, OSGi, life). Here's the info in the manifest
of the shoal-gms-api.jar file:

— begin —
Manifest-Version: 1.0
Export-Package: com.sun.enterprise.ee.cms.core;uses:="com.sun.enterpri
se.ee.cms.spi",com.sun.enterprise.ee.cms.spi;uses:="com.sun.enterpris
e.ee.cms.core"
Built-By: bobby
Tool: Bnd-0.0.357
Bundle-Name: shoal-gms-api
Created-By: Apache Maven Bundle Plugin
Bundle-Version: 1.5.8.SNAPSHOT
Build-Jdk: 1.6.0_20
Bnd-LastModified: 1280947610041
Bundle-ManifestVersion: 2
Bundle-Activator: com.sun.enterprise.osgi.ShoalActivator
Import-Package: com.sun.enterprise.ee.cms.core,com.sun.enterprise.ee.c
ms.spi,com.sun.enterprise.osgi
Bundle-SymbolicName: org.shoal.gms-api
— end —

It's exporting the right things (I think). Could it be that it's also importing
the same packages it's exporting?

Comment by Bobby Bissett [ 11/Aug/10 ]

Setting target milestone. Have sent out some more requests for OSGi help, so this may be resolved much
sooner than ms5. (I imagine the fix is small; I just don't know what it is.)

Comment by Bobby Bissett [ 12/Aug/10 ]

Solved! I finally realized there was a Bundle-Activator element in both pom.xml
files containing the class com.sun.enterprise.osgi.ShoalActivator. But that
class only existed in the impl jar, and so the api jar probably wasn't loading
properly without it.

The com.sun.enterprise.osgi.ShoalActivator class didn't actually do anything, so
I removed it and the references to it. Then with a rebuilt shoal-gms-api.jar
file in my v3 modules dir, I could start the server without the
shoal-gms-impl.jar present.

If someone tries to create a cluster, all kinds of OSGi panic happens. I'll see
if there's anything I can do about that before promoting/integrating a new Shoal
version.

Comment by Sanjeeb Sahoo [ 18/Aug/10 ]

While doing the split, please also move
cluster/gms-bootstrap/src/main/java/org/glassfish/gms/bootstrap/GMSAdapterService.java
to an impl package, which does not get exported by default. We don't want users
to directly depend on GMSAdapterService, do we? This should be a straight
forward change.

Comment by Bobby Bissett [ 18/Aug/10 ]

Yes, we want users to directly inject an instance of GMSAdapterService. See
comments in the class.

Comment by Bobby Bissett [ 18/Aug/10 ]

The shoal changes have been integrated, and I made a change in GMSAdapterService
in revision 39887 to handle the missing impl gracefully.





[GLASSFISH-12826] asadmin get-health cluster fails Created: 27/Jul/10  Updated: 28/Jul/10  Resolved: 28/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms04

Type: Bug Priority: Major
Reporter: zorro Assignee: Joe Fialli
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Linux


Issuezilla Id: 12,826

 Description   

build glassfish-3.1-b13-07_27_2010

Clause GMS-06 in GlassFishv3.1GMS states:
GMS-06 P2 asadmin get-health clustered-instance or cluster YES Joe 3-7 days
feature parity note: only works for gms-enabled cluster

Bug:
get-health is not functioning as follows:

./asadmin get-health
CLI194 Previously supported command: get-health is not supported for this release.
iteas1@easqezorro1:/export/space/glassfishv3/glassfish/bin>./asadmin version
Version = GlassFish Server Open Source Edition 3.1-SNAPSHOT (build 13)
Command version executed successfully.

Expected:
As stated in the feature delivery list, GMS-06 should work in MS3



 Comments   
Comment by Joe Fialli [ 28/Jul/10 ]

Feature is not scheduled to be implemented till MS4.

      • This issue has been marked as a duplicate of 12193 ***




[GLASSFISH-12730] gms.getGroupHandle().getPreviousAliveAndReadyCoreView() returns different values on different instances Created: 19/Jul/10  Updated: 26/Nov/10  Resolved: 29/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms03

Type: Bug Priority: Critical
Reporter: Mahesh Kannan Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File SFSBDriver.war    
Issue Links:
Dependency
blocks GLASSFISH-12228 HA-3: Implement load() operation to r... Resolved
Issuezilla Id: 12,730

 Description   

This was working till yesterday...

Currently the replication module uses the following gms APIs to obtain current
and previous views:

List<String> currentAliveAndReadyMembers =
gms.getGroupHandle().getCurrentAliveOrReadyMembers();
AliveAndReadyView aView =
gms.getGroupHandle().getPreviousAliveAndReadyCoreView();

The above methods are called when JoinAndReady events or Failure events occur. I
noticed (and Rajiv too faces the issue) that
gms.getGroupHandle().getPreviousAliveAndReadyCoreView() return different values
on different instances when a gf instance is shutdown.

The previous view is used by replication module to load the data after a failure.

To reproduce

1. create a cluster of 3 or more instances.
2. Deploy a .war file using "asadmin deploy --target=<cluster-name>
--availabilityenabled=true --force=true <.war>
3. Now after a few seconds, kill an instance and look into gf logs of other
instances. The previous view reported by each of these may be different (this
happens intermittently).

Without this, HTTP and EJB failover will not work.

Note:- The previous view does match (on all instances) sometimes. In those
occasions, failover works.



 Comments   
Comment by Joe Fialli [ 20/Jul/10 ]

Will try to recreate based on description provided.

However, if this issue is timing related, may not be possible to recreate.
Recommend attaching server.log that illustrate issue.

Comment by Joe Fialli [ 20/Jul/10 ]

Can you confirm that the "kill" step was a "kill -9".

Comment by Joe Fialli [ 20/Jul/10 ]

Recreated this issue in shoal gms dev test "runsimulatecluster.sh stop" when
running with 10 instances. (all instances started at same time in background.)

This approximates original issue when asadmin start-cluster starts 4 instances
serially. (first instance1 is started and wait till finished starting up, then
start next instance in cluster and so on.)

Checked a fix into shoal gms that fixed issue that I recreated in shoal dev test.
Still need to verify in glassfish v3.1 env. (will do and report back.)

Comment by Mahesh Kannan [ 20/Jul/10 ]

Created an attachment (id=4601)
Web + Ejb app to reproduce this issue

Comment by Joe Fialli [ 21/Jul/10 ]

integrated fix into v3.1 b12 (MS3 build)

Confirmed fix with Mahesh using patch with gf v3.1.

Comment by Mahesh Kannan [ 21/Jul/10 ]

This is still seen on ms3 b12 build

Comment by Joe Fialli [ 21/Jul/10 ]

working on resolving the issue.

Workaround for what is checked into b12 is following.
Replace using "start-cluster" with
"start-instance" for all members in cluster and place each invocation
in background.

#asadmin start-cluster myCluster
asadmin start-instance instance1 &
asadmin start-instance instance2 &
asadmin start-instance instance3 &
asadmin start-instance instance4 &

All Shoal gms dev testing started up gms clients in background at same time.
(this simulates how GF v2.1 start-cluster starts instances at once)

Observed failing is that gf v3.1 start-cluster is currently serially starting
instances one after the other. Current shoal gms impl was not tested against
such a case. Working on new impl that will handle all cases but in meantime,
starting instances simultaneously would provide a workaround that will not fail
anywhere as frequently as the worst case of starting one after another.

Plan on having a well tested fix soon.

Comment by Joe Fialli [ 23/Jul/10 ]

fix is checked into shoal/gms workspace.

distributed developer level test within GFv3.1 is being run that validates
previous and current views in following manner:

After start-cluster is completed, the test verifies that all instances
have same CURRENT view. PREVIOUS view is undefined in current implementation.

After FAILURE, verify that all running instances have same CURRENT and PREVIOUS
view.

After PLANNED_SHUTDOWN, verify that all running instances have same CURRENT and
PREVIOUS view.

After instance start, all running instances should have same CURRENT view.
In current implementation, just started instance will have empty members
previous view.

Test is being run 30 times to check if any timing issues still remain.

Comment by Rajiv Mordani [ 28/Jul/10 ]

I just did a fresh checkout of the sources today. We need this bug to be fixed
for us to reliably test HA functionality. Currently when I do a start cluster
not all the instances have the correct "current view" and previous view".

Comment by Joe Fialli [ 28/Jul/10 ]

This bug was not marked that the fix was integrated.
The fix was checked into shoal/gms workspace. It was not integrated into
glassfish. To get the fix and try it out yourself, you can get the
latest shoal/gms.

% svn checkout
https://shoal.dev.java.net/svn/shoal/branches/SHOAL_1_1_ABSTRACTING_TRANSPORT

% cd shoal
% mvn install
% cp gms/impl/target/shoal-gms-impl.jar glassfishv3/glassfish/modules
% cp gms/api/target/shoal-gms-api.jar glassfishv3/glassfish/modules

I was awaiting review of changes proposed to StartClusterCommand.java so
GMS would know the difference between GROUP_STARTUP and INSTANCE_STARTUP.

I have it all working but the review comments are that I need to use
@Supplemental. Will work on this now.

P1 is too high for this issue. HA is suppose to be able to deal with
the situation that the hint on where the data is, is not correct, and do a
broadcast. This is non-optimal and must be fixed, but it is not a P1.

We will integrate the shoal/gms changes into glassfish v3.1 ws tomorrow.
However, until start-cluster is checked in, the previous view after
start-cluster completes could still be undefined and not the same on all instances.

Comment by Mahesh Kannan [ 28/Jul/10 ]

Yes replication module could have done (and eventually it should do) a broadcast
to find the session, but for MS3 I thought I could just rely on the
getPreviousView.

While I agree that HA should resort to broadcast if the instance that is
supposed to contain the data using getPreviousView doesn't have it, doing a
directed load request followed by a broadcast is far worse than doing a
broadcast. Atleast for this release, where we are considering just a single
failure (meaning the session states have been replicated before another
failure), I am assuming that we do not have to resort to broadcast.

I agree with you that this is not a P1 though.

Comment by Rajiv Mordani [ 29/Jul/10 ]

Hi Joe,
When do you plan to integrate this into GlassFish? I want to do a hand off to
QA but would like to use a build of GlassFish that will work better with the GMS
fix in there.

Mahesh do you plan to support the broadcast case in 3.1?

Comment by Bobby Bissett [ 29/Jul/10 ]

Adding self to cc list. We can promote a Shoal release and integrate into GF
today if everyone is ready.

Comment by Joe Fialli [ 29/Jul/10 ]

Fixes integrated into glassfish v3.1 workspace.(shoal-gms revision 1.5.6)
Will be in tonite nightly.

Confirm fixed by automated distributed dev test with 8 instances, run 30 times
to verify no intermittent issues occuring.





[GLASSFISH-12692] NPE and Obnoxiously long error Created: 16/Jul/10  Updated: 19/Jul/10  Resolved: 19/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1

Type: Bug Priority: Critical
Reporter: Byron Nevins Assignee: Joe Fialli
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,692

 Description   

1 severe error message for every item on the call stack!

[#|2010-07-16T12:48:36.412-0700|WARNING|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=15;_ThreadName=Configuration
Updater;|An exception occurred while processing GMS configuration properties:

{0}
java.lang.NullPointerException
at
org.glassfish.gms.GMSAdapterImpl.readGMSConfigProps(GMSAdapterImpl.java:289)
at org.glassfish.gms.GMSAdapterImpl.initializeGMS(GMSAdapterImpl.java:394)
at org.glassfish.gms.GMSAdapterImpl.initialize(GMSAdapterImpl.java:190)
at
org.glassfish.gms.bootstrap.GMSAdapterService.loadModule(GMSAdapterService.java:191)
at
org.glassfish.gms.bootstrap.GMSAdapterService.checkCluster(GMSAdapterService.java:171)
at
org.glassfish.gms.bootstrap.GMSAdapterService.postConstruct(GMSAdapterService.java:116)
at com.sun.hk2.component.AbstractWombImpl.inject(AbstractWombImpl.java:174)
at com.sun.hk2.component.ConstructorWomb$1.run(ConstructorWomb.java:87)
at java.security.AccessController.doPrivileged(Native Method)
at com.sun.hk2.component.ConstructorWomb.initialize(ConstructorWomb.java:84)
at com.sun.hk2.component.AbstractWombImpl.get(AbstractWombImpl.java:77)
at
com.sun.hk2.component.SingletonInhabitant.get(SingletonInhabitant.java:58)
at com.sun.hk2.component.LazyInhabitant.get(LazyInhabitant.java:107)
at
com.sun.hk2.component.AbstractInhabitantImpl.get(AbstractInhabitantImpl.java:60)
at
com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:236)
at
com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:128)
at
com.sun.enterprise.glassfish.bootstrap.GlassFishActivator$2.addingService(GlassFishActivator.java:157)
at
org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:896)
at
org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:261)
at org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:233)
at
org.osgi.util.tracker.ServiceTracker$Tracked.serviceChanged(ServiceTracker.java:840)
at
org.apache.felix.framework.util.EventDispatcher.invokeServiceListenerCallback(EventDispatcher.java:864)
at
org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:732)
at
org.apache.felix.framework.util.EventDispatcher.fireServiceEvent(EventDispatcher.java:662)
at org.apache.felix.framework.Felix.fireServiceEvent(Felix.java:3745)
at org.apache.felix.framework.Felix.access$000(Felix.java:80)
at org.apache.felix.framework.Felix$2.serviceChanged(Felix.java:717)
at
org.apache.felix.framework.ServiceRegistry.registerService(ServiceRegistry.java:107)
at org.apache.felix.framework.Felix.registerService(Felix.java:2862)
at
org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:251)
at
org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:229)
at
org.jvnet.hk2.osgiadapter.HK2Main$StartupContextService.updated(HK2Main.java:113)
at
org.apache.felix.cm.impl.ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1389)
at org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:88)
|#]

[#|2010-07-16T12:48:36.413-0700|WARNING|glassfish3.1|javax.org.glassfish.gms.org.glassfish.gms|_ThreadID=15;_ThreadName=Configuration
Updater;|An exception occurred while processing GMS configuration properties: {0}

java.lang.NullPointerException
at
org.glassfish.gms.GMSAdapterImpl.readGMSConfigProps(GMSAdapterImpl.java:289)
at org.glassfish.gms.GMSAdapterImpl.initializeGMS(GMSAdapterImpl.java:394)
at org.glassfish.gms.GMSAdapterImpl.initialize(GMSAdapterImpl.java:190)
at
org.glassfish.gms.bootstrap.GMSAdapterService.loadModule(GMSAdapterService.java:191)
at
org.glassfish.gms.bootstrap.GMSAdapterService.checkCluster(GMSAdapterService.java:171)
at
org.glassfish.gms.bootstrap.GMSAdapterService.postConstruct(GMSAdapterService.java:116)
at com.sun.hk2.component.AbstractWombImpl.inject(AbstractWombImpl.java:174)
at com.sun.hk2.component.ConstructorWomb$1.run(ConstructorWomb.java:87)
at java.security.AccessController.doPrivileged(Native Method)
at com.sun.hk2.component.ConstructorWomb.initialize(ConstructorWomb.java:84)
at com.sun.hk2.component.AbstractWombImpl.get(AbstractWombImpl.java:77)
at
com.sun.hk2.component.SingletonInhabitant.get(SingletonInhabitant.java:58)
at com.sun.hk2.component.LazyInhabitant.get(LazyInhabitant.java:107)
at
com.sun.hk2.component.AbstractInhabitantImpl.get(AbstractInhabitantImpl.java:60)
at
com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:236)
at
com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:128)
at
com.sun.enterprise.glassfish.bootstrap.GlassFishActivator$2.addingService(GlassFishActivator.java:157)
at
org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:896)
at
org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:261)
at org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:233)
at
org.osgi.util.tracker.ServiceTracker$Tracked.serviceChanged(ServiceTracker.java:840)
at
org.apache.felix.framework.util.EventDispatcher.invokeServiceListenerCallback(EventDispatcher.java:864)
at
org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:732)
at
org.apache.felix.framework.util.EventDispatcher.fireServiceEvent(EventDispatcher.java:662)
at org.apache.felix.framework.Felix.fireServiceEvent(Felix.java:3745)
at org.apache.felix.framework.Felix.access$000(Felix.java:80)
at org.apache.felix.framework.Felix$2.serviceChanged(Felix.java:717)
at
org.apache.felix.framework.ServiceRegistry.registerService(ServiceRegistry.java:107)
at org.apache.felix.framework.Felix.registerService(Felix.java:2862)
at
org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:251)
at
org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:229)
at
org.jvnet.hk2.osgiadapter.HK2Main$StartupContextService.updated(HK2Main.java:113)
at
org.apache.felix.cm.impl.ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1389)
at org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:88)

#]

[#|2010-07-16T12:48:36.421-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;|java.lang.NullPointerException|#]

[#|2010-07-16T12:48:36.422-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.glassfish.gms.GMSAdapterImpl.readGMSConfigProps(GMSAdapterImpl.java:337)|#]

[#|2010-07-16T12:48:36.422-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.glassfish.gms.GMSAdapterImpl.initializeGMS(GMSAdapterImpl.java:394)|#]

[#|2010-07-16T12:48:36.422-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.glassfish.gms.GMSAdapterImpl.initialize(GMSAdapterImpl.java:190)|#]

[#|2010-07-16T12:48:36.423-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.glassfish.gms.bootstrap.GMSAdapterService.loadModule(GMSAdapterService.java:191)|#]

[#|2010-07-16T12:48:36.423-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.glassfish.gms.bootstrap.GMSAdapterService.checkCluster(GMSAdapterService.java:171)|#]

[#|2010-07-16T12:48:36.423-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.glassfish.gms.bootstrap.GMSAdapterService.postConstruct(GMSAdapterService.java:116)|#]

[#|2010-07-16T12:48:36.423-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.AbstractWombImpl.inject(AbstractWombImpl.java:174)|#]

[#|2010-07-16T12:48:36.424-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.ConstructorWomb$1.run(ConstructorWomb.java:87)|#]

[#|2010-07-16T12:48:36.424-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at java.security.AccessController.doPrivileged(Native Method)|#]

[#|2010-07-16T12:48:36.424-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.ConstructorWomb.initialize(ConstructorWomb.java:84)|#]

[#|2010-07-16T12:48:36.424-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.AbstractWombImpl.get(AbstractWombImpl.java:77)|#]

[#|2010-07-16T12:48:36.425-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.SingletonInhabitant.get(SingletonInhabitant.java:58)|#]

[#|2010-07-16T12:48:36.425-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.LazyInhabitant.get(LazyInhabitant.java:107)|#]

[#|2010-07-16T12:48:36.425-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.hk2.component.AbstractInhabitantImpl.get(AbstractInhabitantImpl.java:60)|#]

[#|2010-07-16T12:48:36.427-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:236)|#]

[#|2010-07-16T12:48:36.428-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:128)|#]

[#|2010-07-16T12:48:36.428-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
com.sun.enterprise.glassfish.bootstrap.GlassFishActivator$2.addingService(GlassFishActivator.java:157)|#]

[#|2010-07-16T12:48:36.429-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.osgi.util.tracker.ServiceTracker$Tracked.customizerAdding(ServiceTracker.java:896)|#]

[#|2010-07-16T12:48:36.429-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.osgi.util.tracker.AbstractTracked.trackAdding(AbstractTracked.java:261)|#]

[#|2010-07-16T12:48:36.429-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:233)|#]

[#|2010-07-16T12:48:36.430-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.osgi.util.tracker.ServiceTracker$Tracked.serviceChanged(ServiceTracker.java:840)|#]

[#|2010-07-16T12:48:36.430-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.util.EventDispatcher.invokeServiceListenerCallback(EventDispatcher.java:864)|#]

[#|2010-07-16T12:48:36.431-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:732)|#]

[#|2010-07-16T12:48:36.431-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.util.EventDispatcher.fireServiceEvent(EventDispatcher.java:662)|#]

[#|2010-07-16T12:48:36.432-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.Felix.fireServiceEvent(Felix.java:3745)|#]

[#|2010-07-16T12:48:36.432-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at org.apache.felix.framework.Felix.access$000(Felix.java:80)|#]

[#|2010-07-16T12:48:36.433-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.Felix$2.serviceChanged(Felix.java:717)|#]

[#|2010-07-16T12:48:36.433-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.ServiceRegistry.registerService(ServiceRegistry.java:107)|#]

[#|2010-07-16T12:48:36.433-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.Felix.registerService(Felix.java:2862)|#]

[#|2010-07-16T12:48:36.434-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:251)|#]

[#|2010-07-16T12:48:36.434-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.framework.BundleContextImpl.registerService(BundleContextImpl.java:229)|#]

[#|2010-07-16T12:48:36.435-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.jvnet.hk2.osgiadapter.HK2Main$StartupContextService.updated(HK2Main.java:113)|#]

[#|2010-07-16T12:48:36.435-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.cm.impl.ConfigurationManager$UpdateConfiguration.run(ConfigurationManager.java:1389)|#]

[#|2010-07-16T12:48:36.436-0700|SEVERE|glassfish3.1|null|_ThreadID=15;_ThreadName=Configuration
Updater;| at
org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:88)|#]

[#|2010-07-16T12:48:36.436-0700|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.server|_ThreadID=15;_ThreadName=Configuration
Updater;|Startup service failed to start : null|#]

[#|2010-07-16T12:48:36.537-0700|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.server|_ThreadID=15;_ThreadName=Configuration
Updater;|GlassFish Server Open Source Edition 3.1-SNAPSHOT (re-continuous)
startup time : Felix(25238ms) startup services(13134ms) total(38372ms)|#]

[#|2010-07-16T12:48:37.868-0700|INFO|glassfish3.1|javax.enterprise.system.core.com.sun.enterprise.v3.server|_ThreadID=51;_ThreadName



 Comments   
Comment by Joe Fialli [ 19/Jul/10 ]

fix checked in, see 12719 for description of problem and its resolution.

      • This issue has been marked as a duplicate of 12719 ***




[GLASSFISH-12662] repetitive WARNING msg from GMS causes lots of rotated logfiles Created: 14/Jul/10  Updated: 19/Jul/10  Resolved: 19/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms03

Type: Bug Priority: Major
Reporter: Anissa Lam Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Mac OS X
Platform: Macintosh


Attachments: Text File ABC-server.log     Text File server.log_2010-07-14T00-38-49    
Issuezilla Id: 12,662

 Description   

I have created 1 standalone instance and 1 instance thats part of a cluster. Both are local on my
machine.
Very often, i am seeing this WARNING in server.log coming up every 2 min. Maybe this is due to my
machine sleeping during the night ?
When i wake up this morning, there are many huge log file on my system.

~/Awork/V3/v3/dist-gf/glassfish/domains/domain1/logs 246) ll
total 60200
rw-rr- 1 anilam owner 2002298 Jul 13 13:02 server.log_2010-07-13T13-02-36
rw-rr- 1 anilam owner 2003620 Jul 14 00:38 server.log_2010-07-14T00-38-49
rw-rr- 1 anilam owner 2004963 Jul 14 01:29 server.log_2010-07-14T01-29-29
rw-rr- 1 anilam owner 2000381 Jul 14 01:29 server.log_2010-07-14T01-29-31
rw-rr- 1 anilam owner 2004180 Jul 14 01:36 server.log_2010-07-14T01-36-49
rw-rr- 1 anilam owner 2001510 Jul 14 01:36 server.log_2010-07-14T01-36-50
rw-rr- 1 anilam owner 2002610 Jul 14 01:36 server.log_2010-07-14T01-36-51
rw-rr- 1 anilam owner 2001849 Jul 14 01:46 server.log_2010-07-14T01-46-26
rw-rr- 1 anilam owner 2002492 Jul 14 01:46 server.log_2010-07-14T01-46-27
rw-rr- 1 anilam owner 2001328 Jul 14 01:46 server.log_2010-07-14T01-46-29
rw-rr- 1 anilam owner 2014212 Jul 14 01:57 server.log_2010-07-14T01-46-24
rw-rr- 1 anilam owner 0 Jul 14 02:03 server.log.lck
rw-rr- 1 anilam owner 2002649 Jul 14 02:05 server.log_2010-07-14T02-05-09
rw-rr- 1 anilam owner 2001532 Jul 14 02:05 server.log_2010-07-14T02-05-11
rw-rr- 1 anilam owner 2002077 Jul 14 02:05 server.log_2010-07-14T02-05-13
rw-rr- 1 anilam owner 2002545 Jul 14 02:05 server.log_2010-07-14T02-05-15
rw-rr- 1 anilam owner 16384 Jul 14 02:05 jvm.log
rw-rr- 1 anilam owner 735066 Jul 14 08:28 server.log

followed is the WARNING. I will attach one of the rotated log file.

I am not sure if this should be under 'admin' or gms. Please reassign if needed.

[#|2010-07-14T08:07:08.102-
0700|WARNING|glassfish3.1|ShoalLogger|_ThreadID=30;_ThreadName=HealthMonitor for
Group:clusterABC;|Failed to send message
java.io.IOException: Can't assign requested address
at java.net.PlainDatagramSocketImpl.send(Native Method)
at java.net.DatagramSocket.send(DatagramSocket.java:612)
at
com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender.doBroadcast(BlockingIOMulticastSender.j
ava:250)
at
com.sun.enterprise.mgmt.transport.AbstractMulticastMessageSender.broadcast(AbstractMulticastMessa
geSender.java:65)
at
com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.broadcast(GrizzlyNetworkManager.j
ava:538)
at com.sun.enterprise.mgmt.HealthMonitor.send(HealthMonitor.java:830)
at com.sun.enterprise.mgmt.HealthMonitor.reportMyState(HealthMonitor.java:726)
at com.sun.enterprise.mgmt.HealthMonitor.run(HealthMonitor.java:773)
at java.lang.Thread.run(Thread.java:637)

#]

[#|2010-07-14T08:07:08.102-
0700|WARNING|glassfish3.1|ShoalLogger|_ThreadID=30;_ThreadName=HealthMonitor for
Group:clusterABC;|null, send returned false|#]

[#|2010-07-14T08:07:08.119-
0700|WARNING|glassfish3.1|ShoalLogger|_ThreadID=45;_ThreadName=HealthMonitor for
Group:clusterXYZ;|Failed to send message
java.io.IOException: Can't assign requested address
at java.net.PlainDatagramSocketImpl.send(Native Method)
at java.net.DatagramSocket.send(DatagramSocket.java:612)
at
com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender.doBroadcast(BlockingIOMulticastSender.j
ava:250)
at
com.sun.enterprise.mgmt.transport.AbstractMulticastMessageSender.broadcast(AbstractMulticastMessa
geSender.java:65)
at
com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.broadcast(GrizzlyNetworkManager.j
ava:538)
at com.sun.enterprise.mgmt.HealthMonitor.send(HealthMonitor.java:830)
at com.sun.enterprise.mgmt.HealthMonitor.reportMyState(HealthMonitor.java:726)
at com.sun.enterprise.mgmt.HealthMonitor.run(HealthMonitor.java:773)
at java.lang.Thread.run(Thread.java:637)

#]

[#|2010-07-14T08:07:08.119-
0700|WARNING|glassfish3.1|ShoalLogger|_ThreadID=45;_ThreadName=HealthMonitor for
Group:clusterXYZ;|null, send returned false|#]

[#|2010-07-14T08:07:10.103-
0700|WARNING|glassfish3.1|ShoalLogger|_ThreadID=30;_ThreadName=HealthMonitor for
Group:clusterABC;|Failed to send message
java.io.IOException: Can't assign requested address
at java.net.PlainDatagramSocketImpl.send(Native Method)
at java.net.DatagramSocket.send(DatagramSocket.java:612)
at
com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender.doBroadcast(BlockingIOMulticastSender.j
ava:250)
at
com.sun.enterprise.mgmt.transport.AbstractMulticastMessageSender.broadcast(AbstractMulticastMessa
geSender.java:65)
at
com.sun.enterprise.mgmt.transport.grizzly.GrizzlyNetworkManager.broadcast(GrizzlyNetworkManager.j
ava:538)
at com.sun.enterprise.mgmt.HealthMonitor.send(HealthMonitor.java:830)
at com.sun.enterprise.mgmt.HealthMonitor.reportMyState(HealthMonitor.java:726)
at com.sun.enterprise.mgmt.HealthMonitor.run(HealthMonitor.java:773)
at java.lang.Thread.run(Thread.java:637)

#]


 Comments   
Comment by Tom Mueller [ 14/Jul/10 ]

GMS issue.

Comment by Joe Fialli [ 14/Jul/10 ]

accepting this issue.

Comment by Joe Fialli [ 14/Jul/10 ]

Will immediately fix issue that same error is repeatedly reported.
No matter what the issue is, we can not be outputting same error message that
many times.

Could the reporter attach a server.log that contains ShoalLogger CONFIG in it
and one of the server.log with the WARNINGS in it. (perhaps it can be same
server.log) It would be helpful to have a server.log from the DAS
(which is ~/Awork/V3/v3/dist-gf/glassfish/domains/domain1) and one from
the clustered instance (which would be found under
~/Awork/V3/v3/dist-gf/glassfish/nodeagent/<nameofinstance>/logs/)

Would like to see if both instances were experiencing same issue.

Since this issue states that the system may have gone to sleep, it would
help to set the OS and Platform to reflect the system that this issue occurred
on.

Comment by Joe Fialli [ 14/Jul/10 ]

recreated issue of excessive log messages by pulling network cable of ip address
that glassfish app server was using. Problem stopped once I put the cable back
in. So the issue could have been the result of shutting off a wireless network
(perhaps by sleep) or pulling a ethernet cable from the system (assuming that
this issue occurred on a laptop.)

Comment by Anissa Lam [ 14/Jul/10 ]

I will be attaching the 2 server.log, one from DAS which has all those WARNING
and another one from the instance ABC-2. I don't see the problem with the instance log file.

Yes, internet connection is terminated when the machine goes to sleep, that may explain it.
Updated platform and OS, i am using MacOS 10.5.8

Comment by Anissa Lam [ 14/Jul/10 ]

Created an attachment (id=4577)
DAS log file

Comment by Anissa Lam [ 14/Jul/10 ]

Created an attachment (id=4578)
server.log from the instance

Comment by Joe Fialli [ 15/Jul/10 ]

Fix checked in to shoal/gms. Will be fixed by M3.

Next shoal/gms integration into gf v3.1 will fix this.

Single WARNING message will only be reported every 2 hours per message send that
fails.

Comment by Joe Fialli [ 19/Jul/10 ]

GMS jars integrated for M3.

Comment by Joe Fialli [ 19/Jul/10 ]

GMS jars integrated for M3.





[GLASSFISH-12563] upgrade from v2.1 cluster & gms element's to v3.1 Created: 07/Jul/10  Updated: 18/Aug/10  Resolved: 18/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms04

Type: New Feature Priority: Critical
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,563

 Description   

Upgrade from v2.1 cluster and group-management-service element's
attributes/properties to v3.1 cluster/group-management-server.

See http://wikis.sun.com/download/attachments/209654762/gmsconfig_gfv3_1.rtf
on how v2.1 maps to v3.1.



 Comments   
Comment by Bobby Bissett [ 03/Aug/10 ]

Already part of the planning for cluster upgrade.

      • This issue has been marked as a duplicate of 12738 ***
Comment by Bhakti Mehta [ 09/Aug/10 ]

Bobby,
I have upgraded the cluster elements as part of issue 12738, will add devtest
for that There are still changes in gms config which need to be taken care of.
Would you be taking care of those? Let me know.
Regards,
Bhakti

Comment by Bobby Bissett [ 10/Aug/10 ]

Can you give me more information? What else needs to happen that isn't part of
issue 12738? Do you just want me to pick up from where you started?

Comment by Bhakti Mehta [ 10/Aug/10 ]

Bobby,
I think you will have to take care of these , basically the config changes
related to gms for eg
From: fd-protocol-timeout-in-millis
To: failure-detection.heartbeat-frequency-in-millis

From: fd-protocol-max-tries
To: failure-detection.max-missed-heartbeats

From: fd-protocol-timeout-in-millis
To: failure-detection.verify-failure-waittime-in-millis

  • Pro
Comment by Joe Fialli [ 18/Aug/10 ]

Completed implementation and manual testing.
Automated testing to be placed in admin devtesting. (will be added shortly)





[GLASSFISH-12321] split shoal-gms-1.x.jar into an api and impl jar Created: 21/Jun/10  Updated: 01/Jul/10  Resolved: 01/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms03

Type: New Feature Priority: Critical
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,321

 Description   

Only shoal-gms-api-1.x.jar should be loaded in glassfish v3.1 when gms is not
being used. For DAS, gms is only used if there exists at least one cluster that
has gms-enabled set to true. For clustered instance, gms should only be enabled
if the cluster that the instance belongs to has gms-enabled of true. (Note that
cluster attribute gms-enabled defaults to true and only shows in domain.xml if a
cluster element has gms-enabled set to false.



 Comments   
Comment by Joe Fialli [ 21/Jun/10 ]

set to introduce shoal-gms-api-1.x.jar and shoal-gms-impl-1.x.jar by M3.
impl jar will only get loaded when gms is actually being used. shoal-gms-api is
always loaded.

Comment by Bobby Bissett [ 30/Jun/10 ]

"You break it, you own it."

Comment by Bobby Bissett [ 01/Jul/10 ]

The split shoal-gms jars are now available in maven, version 1.5.3. They have
been incorporated into GlassFish as of revision 38263.

Log message from commit:
Switched to newer version of shoal in order to use new shoal-gms-api and
shoal-gms-impl jars. The gms bootstrap module in GF should now
only depend on the api and the impl module and gms adapter module will be loaded
only when needed.

Also changed the name of the gms adapter service from GmsAdapterService to
GMSAdapterService for consistency.





[GLASSFISH-12196] Update external library Shoal GMS to meet GF v3.1 logging requirements Created: 09/Jun/10  Updated: 26/Nov/10  Resolved: 15/Sep/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms05

Type: Bug Priority: Critical
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Dependency
blocks GLASSFISH-13166 umbrella feature for gms logging Resolved
Issuezilla Id: 12,196

 Description   

Place ALL SEVERE, WARNING, INFO log messages into logstrings.properties. Must
have event id (GMS-050)



 Comments   
Comment by Joe Fialli [ 15/Sep/10 ]

completed.

Updated both glassfish v3.1 cluster/gms-bootstrap and cluster/gms-adapter.
Updated external workspace shoal/gms to meet these logging requirements.





[GLASSFISH-12195] multicast enabled diagnostic utility Created: 09/Jun/10  Updated: 18/Aug/10  Resolved: 18/Aug/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms04

Type: New Feature Priority: Critical
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,195

 Description   

diagnosis tool to determine if multicast is enabled for a subnet.

more packaging and documenting than an engineering effort



 Comments   
Comment by Bobby Bissett [ 03/Aug/10 ]

Code is finished for the tool along with a script (in Shoal ws) to run it. Need
to find out in tech meeting how it should be integrated into GF. If we want an
asadmin command for it, that should take <=1 day.

Comment by Bobby Bissett [ 18/Aug/10 ]

This has been integrated. The asadmin command is:

asadmin validate-multicast





[GLASSFISH-12194] Monitoring Stats Provider Created: 09/Jun/10  Updated: 17/Oct/12

Status: Open
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: future release

Type: New Feature Priority: Major
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,194

 Description   

message throughput, thread utilitization, number of detect SUSPECTED, number of
FAILURES



 Comments   
Comment by Joe Fialli [ 18/Aug/10 ]

deferred to ms5

Comment by Joe Fialli [ 15/Sep/10 ]

will implement to be used for development testing.

expose to end users in 3.2.





[GLASSFISH-12192] REJOIN subevent for JOIN and JOINED_AND_READY Created: 09/Jun/10  Updated: 14/Jul/10  Resolved: 14/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms03

Type: New Feature Priority: Blocker
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,192

 Description   

Specified in
http://wiki.glassfish.java.net/attach/V3FunctionalSpecs/gms_gfv3_1_onepager.txt.

Functionality required when app server registered as a native OS service and
it restarts faster than default GMS heartbeat failure detection can detect failure.

With this implementation, when GMS notices that an instance has restarted and
no FAILURE notification was sent out, the rejoining instance will have a REJOIN
subevent that represents the past failure. GMS clients must register a handler
for JOIN or JOINED_AND_READY and check for the REJOIN event to identify all
cases that a previous gms member has FAILED.



 Comments   
Comment by Bobby Bissett [ 30/Jun/10 ]

Bobby needs things to keep him out of trouble.

Comment by Bobby Bissett [ 14/Jul/10 ]

Fixed in shoal revision 1057.





[GLASSFISH-12191] Introduce GMS GroupHandle.getPreviousAliveOrReadyCoreView() Created: 09/Jun/10  Updated: 14/Jul/10  Resolved: 14/Jul/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms03

Type: New Feature Priority: Critical
Reporter: Joe Fialli Assignee: Joe Fialli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,191

 Description   

Required method for HA.

Specified in
http://wiki.glassfish.java.net/attach/V3FunctionalSpecs/gms_gfv3_1_onepager.txt



 Comments   
Comment by Joe Fialli [ 14/Jul/10 ]

Initial implementation delivered into shoal gms workspace.

See GroupHandle.getPreviousAliveAndReadyCoreView() and getCurrentAliveAndReadyCoreView().

Also introduced AliveAndReadyView that is returned by above two methods.





[GLASSFISH-12190] Configure Shoal GMS via domain.xml Created: 09/Jun/10  Updated: 22/Jun/10  Resolved: 22/Jun/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms02

Type: New Feature Priority: Blocker
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,190

 Description   

Implement the following specification in Glassfish v3.1.

http://wiki.glassfish.java.net/PageInfo.jsp?page=GlassFishv3.1GMS/gmsconfig_gfv3_1.rtf



 Comments   
Comment by Joe Fialli [ 22/Jun/10 ]

Integrated into M2.

Documentation of configuration is in following document.

http://wiki.glassfish.java.net/PageInfo.jsp?page=GlassFishv3.1GMS/gmsconfig_gfv3_1.rtf

This document was AS arch reviewed on June 22, 2010.





[GLASSFISH-12189] Integrate Shoal GMS using tmp property files to configure shoal Created: 09/Jun/10  Updated: 22/Jun/10  Resolved: 22/Jun/10

Status: Resolved
Project: glassfish
Component/s: group_management_service
Affects Version/s: 3.1
Fix Version/s: 3.1_ms02

Type: New Feature Priority: Blocker
Reporter: Joe Fialli Assignee: Bobby Bissett
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issuezilla Id: 12,189

 Description   

Enable module cluster/gms-adapater to provide GMSService.
Users will access GMSService with following line.

@Inject(optional="true") GMSService gmsService;

If gms-enabled is true for the cluster that the instance belongs to,
then gmsService will be set; otherwise it will not be.

Format of temporary property files and scripts to distribute them are being
worked on.



 Comments