1 2 Previous Next 17 Replies Latest reply: Jan 23, 2012 10:47 AM by Lin Ye RSS

Replication timeout with UDP

Lin Ye Novice

We are running some performance test against our application using Infinispan. We tried to compare TCP vs. UDP. One issue we noticed is that for the chunk of data in the same size, while the TCP works fine the UDP got "Replication timeout". I got couple questions:

  1. We were doing synchronous update, and noticed the "Replication timeout" on the node where update is done; however, the remote node running a listener did receive notification on data being updated. Is this desirable? Could we end up with partially updated and inconsistent data?
  2. Although we can chunk data into smalll pieces for update, I am wondering if there is any JGroups UDP configuration we can tweak to achieve what TCP could do? Attached are our TCP & UDP configuration files. Please suggest what changes could help.

 

Thanks in advance.

 

Lin

  • 1. Re: Replication timeout with UDP
    Manik Surtani Master

    Lin Ye wrote:

    1. We were doing synchronous update, and noticed the "Replication timeout" on the node where update is done; however, the remote node running a listener did receive notification on data being updated. Is this desirable? Could we end up with partially updated and inconsistent data?

    The replication timeout may have pertained to the any of the data owners.  For example, a put() may have to replicate to 2 separate data owners, and one of them may have received teh update and the other may not have.  If you were not running in a transaction, then this could lead to inconsistencies, however if you were running in a transaction then the first owner would roll back its update.

  • 2. Re: Replication timeout with UDP
    Bela Ban Master

    Re 2: what's the average size of the keys/values (mainly values) you put into Infinispan ? For a 9 node cluster running on UDP, updating (20% writes, 80% reads) 1K values at all nodes, I can get ca 22'000 accesses / node / sec consistently, just to give you an idea of the type of performance to expect. This is on a 1 gig network.

     

    So UDP definitely works, you must have misconfigured something...

     

    The diff between TCP and UDP is that UDP scales much better to larger clusters for cluster-wide messages (e.g. Infinispan's invalidation in its current form, stability messages, view installations, failure detection and so on).

     

    You didn't post your config...

    Cheers,

  • 3. Re: Replication timeout with UDP
    Lin Ye Novice

    Bela/Manik,

     

    Thanks for the response.

     

    Our key is a string with a length ranging from 11 to 16. For the value, if I wrap a ByteArrayOutputStream in an ObjectOutputStream, and write my value object into it, the size is around 890. For our TCP configuration, we can write 100k objects in a putAll() without probjem. However, we were seeing replication timeout when we reached 5k objects in a batch for our UDP configuration.

     

    Sorry I forgot to attach the configuration files. Here they are now, and I'd appreciate your suggestions.

     

    Furthermore, I would expect better scalability for UDP over TCP as you. However, during my test earlier last year with 10 nodes in replicated mode, I didn't see a huge difference. We'll re-run the tests after we fixed the above UDP issue, and share the comparison results with you.

     

    Thanks,

    Lin

  • 4. Re: Replication timeout with UDP
    Bela Ban Master

    Our key is a string with a length ranging from 11 to 16. For the value, if I wrap a ByteArrayOutputStream in an ObjectOutputStream, and write my value object into it, the size is around 890.

     

    What ? A string of 16 chars gets blown up into 890 bytes ? This is certainly not the core issue, but I suggest look into making this smaller (maybe Infinispan's marshalling code already does), so you can send more key/value modifications across the wire !

     

    For our TCP configuration, we can write 100k objects in a putAll() without probjem. However, we were seeing replication timeout when we reached 5k objects in a batch for our UDP configuration.

     

    You mean a cache.put(key,val) where val is ca 100K ?

     

    Do you have an estimate of what the total data is in your batch ? I assume you're modifying more than one single 5K value, aren't you ?

     

    So I assume you're using synchronous replication ? If you enable logging, what do you see on the receiver side ?

     

    Is this reproduceable ?

     

     

    Sorry I forgot to attach the configuration files. Here they are now, and I'd appreciate your suggestions.

     

    The UDP config looks OK, although you could use the same values for FC in udp.xml that you use in tcp.xml.

     

     

    Furthermore, I would expect better scalability for UDP over TCP as you. However, during my test earlier last year with 10 nodes in replicated mode, I didn't see a huge difference. We'll re-run the tests after we fixed the above UDP issue, and share the comparison results with you.

     

    Might be a network / switch issue. Difficult to diagnose from here... Make you you apply [1], this might help. If you use replication, you might also run the JGroups perf test with udp.xml, to see whether the max perf can be reached in your environment.

    If you use distribution, there's another test you can run, but I think you use replication, correct ?

     

    [1] http://community.jboss.org/wiki/PerfTuning

  • 5. Re: Replication timeout with UDP
    Lin Ye Novice

    Sorry, we got network issue on Friday, and couldn't respond to you.

     

    Basically, the 16 chars string is the ID of the our value object. As I mentioned, if I write the value object into an ObjectOutputStream, that wrap up a ByteArrayStream, the size is 890. However, when I call cache.putAll(), what I actually passed in is an object that got a Map interface, with 5k key-value mappings (in the case of UDP) in the map. And the object with the Map interface got marshalled by the JBoss marshalling, and I can't really tell the size of the Map after marshalling.

     

    We used a timer to schedule an update of 5k key-value pairs every 1 second. How the timer works is that if the update finishes within 1 sec, it would schedule an update the next second; otherwise (like in our timeout case), it would schedule the next update after the prior one finishes.

     

    Yes, we used synchronous replication. And it is reproduceable. I looked at the Perf Tuning page you recommended, and tuning ethernet control flow may help. I'll try that, and see how it goes. Will update you later.

  • 6. Re: Replication timeout with UDP
    Lin Ye Novice

    Hi Bela,

     

    I get around to try what you suggested. I tried flow control per [1] above. The following is my configuration:

    Pause parameters for eth0:

    Autonegotiate:  off

    RX:             on

    TX:             on

     

    With the above configuration, I run the PerfTest. Here is the command I used:

    java -cp "./*" -Djgroups.bind_addr=<address> -server -Xmx600M -Xms400M -XX:+UseParallelGC -XX:+AggressiveHeap -XX:CompileThreshold=100 -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=31 -Dcom.sun.management.jmxremote -Dresolve.dns=false org.jgroups.tests.perf.Test -sender -config "/home/ssi/jgroups_test/conf/config.txt" -props "/home/ssi/jgroups_test/conf/udp.xml" (tcp.xml for TCP).

     

    For a number times of run, I got various results. However, my best results are not even close to what you got. My UDP result is like:

    Configuration is:

     

    ----------------------- TEST -----------------------

    Date: Thu Jul 28 14:16:18 EDT 2011

    Run by: ssi

     

    log_interval:   1000000

    msg_size:       1000

    config: /home/ssi/jgroups_test/conf/config.txt

    transport:      org.jgroups.tests.perf.transports.JGroupsTransport

    num_senders:    2

    sender: true

    props:  /home/ssi/jgroups_test/conf/udp.xml

    num_msgs:       1000000

    num_members:    2

    JGroups version: 2.10.0.GA

     

    Jul 28, 2011 2:16:18 PM org.jgroups.logging.JDKLogImpl info

    INFO: JGroups version: 2.10.0.GA

     

    -------------------------------------------------------------------

    GMS: address=dt-ly-l-ssi01-17011, cluster=perf, physical address=10.44.44.134:38107

    -------------------------------------------------------------------

    -- dt-ly-l-ssi01-17011 joined

    -- waiting for 2 members to join

    -- dt-ly-l-ssi02-37957 joined

    -- READY (2 acks)

     

    -- sending 1000000 1KB messages

    -- received 1000000 messages (34956 ms, 28607.39 msgs/sec, 28.61MB/sec)

    ++ sent 1000000

    -- received 2000000 messages (12552 ms, 79668.58 msgs/sec, 79.67MB/sec)

     

    -- results:

     

    dt-ly-l-ssi01-17011 (myself):

    num_msgs_expected=2000000, num_msgs_received=2000000 (loss rate=0.0%), received=2GB, time=47508ms, msgs/sec=42098.17, throughput=42.1MB

     

    dt-ly-l-ssi02-37957:

    num_msgs_expected=2000000, num_msgs_received=2000000 (loss rate=0.0%), received=2GB, time=47478ms, msgs/sec=42124.77, throughput=42.12MB

     

    combined: 42111.47 msgs/sec averaged over all receivers (throughput=42.11MB/sec)

     

    The TCP results looked bad:

    Configuration is:

     

    ----------------------- TEST -----------------------

    Date: Thu Jul 28 14:35:21 EDT 2011

    Run by: ssi

     

    log_interval:   1000000

    msg_size:       1000

    config: /home/ssi/jgroups_test/conf/config.txt

    transport:      org.jgroups.tests.perf.transports.JGroupsTransport

    num_senders:    2

    sender: true

    props:  /home/ssi/jgroups_test/conf/tcp.xml

    num_msgs:       1000000

    num_members:    2

    JGroups version: 2.10.0.GA

     

    Jul 28, 2011 2:35:21 PM org.jgroups.logging.JDKLogImpl info

    INFO: JGroups version: 2.10.0.GA

     

    -------------------------------------------------------------------

    GMS: address=dt-ly-l-ssi02-346, cluster=perf, physical address=10.44.44.8:7800

    -------------------------------------------------------------------

    -- dt-ly-l-ssi02-346 joined

    -- waiting for 2 members to join

    -- dt-ly-l-ssi01-13097 joined

    -- READY (2 acks)

     

    -- sending 1000000 1KB messages

    -- received 1000000 messages (639760 ms, 1563.09 msgs/sec, 1.56MB/sec)

    ++ sent 1000000

    -- received 2000000 messages (410661 ms, 2435.1 msgs/sec, 2.44MB/sec)

     

    -- results:

     

    dt-ly-l-ssi01-13097:

    num_msgs_expected=2000000, num_msgs_received=2000000 (loss rate=0.0%), received=2GB, time=1050494ms, msgs/sec=1903.87, throughput=1.9MB

     

    dt-ly-l-ssi02-346 (myself):

    num_msgs_expected=2000000, num_msgs_received=2000000 (loss rate=0.0%), received=2GB, time=1050421ms, msgs/sec=1904, throughput=1.9MB

     

    combined: 1903.93 msgs/sec averaged over all receivers (throughput=1.9MB/sec)

     

    What problem do you think in my case? Also, the above data indicates the UDP performs much better than the TCP. However, when we used Infinipsan, we didn't have replication issue with TCP. In contrary, the replication timeout happened to UDP. When I was running the test, the 2 nodes were pretty much idle on other activities other than the Infinispan update. The update is done at node1 consistently. Attached is the log files on the 2 nodes. The first replication timeout occurred around 18:32, and there were GMS warnings in both logs around that time. However, I didn't see the GMS warnings in all cases of replication timeout. Sometimes I got the timeout without any JGroups related logs around that time.

     

    My send/receive buffer configuration for JGroups is as I attached before. And the OS level send/receive buffer set up is follows:

    net.ipv4.udp_wmem_min = 4096

    net.ipv4.udp_rmem_min = 4096

    net.ipv4.udp_mem = 385920       514560  771840

    net.ipv4.tcp_rmem = 4096        87380   20981520

    net.ipv4.tcp_wmem = 4096        16384   4194304

    net.ipv4.tcp_mem = 196608       262144  393216

    net.ipv4.igmp_max_memberships = 20

    net.core.optmem_max = 20480

    net.core.rmem_default = 129024

    net.core.wmem_default = 129024

    net.core.rmem_max = 25165824

    net.core.wmem_max = 4194304

    vm.lowmem_reserve_ratio = 256   256     32

    vm.overcommit_memory = 0

     

    I would appreciate your feedbacks, as we do need to address this issue.

     

    Regards,

    Lin

  • 7. Re: Replication timeout with UDP
    Lin Ye Novice

    Sorry I forgot to attach the logs again. Here they are.

  • 8. Re: Replication timeout with UDP
    Mircea Markus Master

    However, when I call cache.putAll(), what I actually passed in is an object that got a Map interface, with 5k key-value mappings (in the case of UDP) in the map. And the object with the Map interface got marshalled by the JBoss marshalling, and I can't really tell the size of the Map after marshalling.

    A putAll serializes all the kev-values passed (5k in your scenario) into a byte array and sends this byte array to all the nodes that might use it: i.e.

    in case of distribution to all the nodes that own a least a key. With a 5k putMap, there's a high chance for this list to be the entire cluster.

    I think if you group the keys based on their main owner and do multiple putAll operations would significantly increase your perfromance. Also this is something that infinspan can/should be able to do for you, so if you find this useful please create a JIRA to add such a feature.

  • 9. Re: Replication timeout with UDP
    Lin Ye Novice

    Hi Mircea,

     

    Thanks for sharing thoughts.

     

    My current test is against a replicated cache, and there is no "main owner" at the moment. So at this point I am more concerned about my environment with this replication issue for UDP particularly. I don't know if it's a network/switch issue, or anything else. I need you guys' help to trace down the root cause of the problem.

     

    Thanks,

    Lin

  • 10. Re: Replication timeout with UDP
    Bela Ban Master

    Hard to say what's going on, without access to the system you're running the test on.

     

    A few things that I think should be changed though are:

    • rmem_max of 25MB is low for a perf test, this should be increased
    • 100 iterations before the JIT starts generating machine code is a bit on the low side, this will result in a slow speed ramp up. I suggest put it back to 10'000 (the default, or leave it out), and use -warmup 200000. I also suggest use a larger number of messages
      • For 2 nodes, you should get ~125MBytes/sec/node on a 1Gbit network !
    • Check your switches/firewalls etc; this is out of my control

     

    Running the perf test is not your original issue, but the low numbers you get point to some configuration/network issue...

     

    With only 2 nodes, you should be getting perf that's close to the bandwidth, ca 125MBytes/sec/node !

  • 11. Re: Replication timeout with UDP
    Lin Ye Novice

    Bela,

     

    Thanks for the response.

     

    I increased the rmem_max to 250MB per your suggestion, and it now looks like:

    net.core.rmem_max = 251658240

    net.core.wmem_max = 4194304

     

    I am not sure what the "100 iterations before JIT" refers to, and how to change that configuration. Regardless, I did use the warmup, and a larger number of messages. However, I still don't see a big difference, and the result is shown bellow.

     

    Also, when I switch between (autoneg=on, rx=off, tx=off) & (autoneg=off, rx=on, tx=on), there is no big difference either. So basically, for (autoneg=on, rx=off, tx=off), my result is close to what you posted. However, changing to (autoneg=off, rx=on, tx=on) doesn't make any improvements. Do you get any hint from this behavior? You mentioned checking switches/firewall, what specific configurations do you recommend to look at for the switches/firewall? I'd appreciate your suggestions.

     

    Configuration is:

     

    ----------------------- TEST -----------------------

    Date: Mon Aug 15 10:03:05 EDT 2011

    Run by: ssi

     

    log_interval:   1000000

    msg_size:       1000

    config: /home/ssi/jgroups_test/conf/config.txt

    transport:      org.jgroups.tests.perf.transports.JGroupsTransport

    num_senders:    2

    sender: true

    props:  /home/ssi/jgroups_test/conf/udp.xml

    num_msgs:       10000000

    num_members:    2

    JGroups version: 2.10.0.GA

     

    Aug 15, 2011 10:03:05 AM org.jgroups.logging.JDKLogImpl info

    INFO: JGroups version: 2.10.0.GA

     

    -------------------------------------------------------------------

    GMS: address=dt-ly-l-ssi02-30806, cluster=perf, physical address=10.44.44.8:55316

    -------------------------------------------------------------------

    -- dt-ly-l-ssi02-30806 joined

    -- waiting for 2 members to join

    -- dt-ly-l-ssi01-26706 joined

    -- READY (2 acks)

     

    sending 200000 warmup messages

    done

    -- sending 10000000 1KB messages

    -- received 1000000 messages (30856 ms, 32408.61 msgs/sec, 32.41MB/sec)

    -- received 2000000 messages (30066 ms, 33260.16 msgs/sec, 33.26MB/sec)

    ++ sent 1000000

    -- received 3000000 messages (26397 ms, 37883.09 msgs/sec, 37.88MB/sec)

    -- received 4000000 messages (25792 ms, 38771.71 msgs/sec, 38.77MB/sec)

    ++ sent 2000000

    -- received 5000000 messages (23577 ms, 42414.22 msgs/sec, 42.41MB/sec)

    -- received 6000000 messages (24441 ms, 40914.86 msgs/sec, 40.91MB/sec)

    ++ sent 3000000

    -- received 7000000 messages (26455 ms, 37800.04 msgs/sec, 37.8MB/sec)

    -- received 8000000 messages (27253 ms, 36693.21 msgs/sec, 36.69MB/sec)

    ++ sent 4000000

    -- received 9000000 messages (26900 ms, 37174.72 msgs/sec, 37.17MB/sec)

    -- received 10000000 messages (22045 ms, 45361.76 msgs/sec, 45.36MB/sec)

    ++ sent 5000000

    -- received 11000000 messages (25225 ms, 39643.21 msgs/sec, 39.64MB/sec)

    -- received 12000000 messages (24220 ms, 41288.19 msgs/sec, 41.29MB/sec)

    ++ sent 6000000

    -- received 13000000 messages (24025 ms, 41623.31 msgs/sec, 41.62MB/sec)

    -- received 14000000 messages (24647 ms, 40572.89 msgs/sec, 40.57MB/sec)

    ++ sent 7000000

    -- received 15000000 messages (24947 ms, 40084.98 msgs/sec, 40.08MB/sec)

    ++ sent 8000000

    -- received 16000000 messages (26292 ms, 38034.38 msgs/sec, 38.03MB/sec)

    -- received 17000000 messages (25332 ms, 39475.76 msgs/sec, 39.48MB/sec)

    ++ sent 9000000

    -- received 18000000 messages (25964 ms, 38514.87 msgs/sec, 38.51MB/sec)

    -- received 19000000 messages (23029 ms, 43423.51 msgs/sec, 43.42MB/sec)

    ++ sent 10000000

    -- received 20000000 messages (22555 ms, 44336.07 msgs/sec, 44.34MB/sec)

     

    -- results:

     

    dt-ly-l-ssi02-30806 (myself):

    num_msgs_expected=20000000, num_msgs_received=20000000 (loss rate=0.0%), received=20GB, time=510018ms, msgs/sec=39214.3, throughput=39.21MB

     

    dt-ly-l-ssi01-26706:

    num_msgs_expected=20000000, num_msgs_received=20000000 (loss rate=0.0%), received=20GB, time=509974ms, msgs/sec=39217.69, throughput=39.22MB

     

    combined: 39215.99 msgs/sec averaged over all receivers (throughput=39.22MB/sec)

  • 12. Re: Replication timeout with UDP
    Bela Ban Master

    Hi Lin,

     

    hard to say what's wrong, without having access to the system (HW/SW). Again, I'm using this on 4 HP blades with a 1 GB switch, and get good results, see [1].

     

    You could try out the following things:

    • iperf: run it with -u -b1000M as client and server, to see what the max performance is
    • Then run iperf with -B239.1.1.1, I mean pick a multicast address
      • Both iperf runs should get close to 1000 MBits/sec --> 125 MBytes/sec
    • Are both of your nodes hanging off of the same switch ?
    • If the switch is manageable, access it and see whether there is
      • traffic prioritization: turn this off, especially for UDP datagrams and multicasts
      • flow control: may need to turn it off, too
      • rate limiting: turn this off
    • Check your JVM options, e.g. do you use -Xms and -Xmx ? I use -Xms300m and -Xmx300m by default
      • Take a look at JGroups/bin/jgroups.sh [2]. These are the flags used in my perf test. You might want to use jgroups.sh as well (change $IP_ADDR).
    • If you have remote access to your test environment, I could take a look

     

    Hope this helps

     

    [1] http://www.jgroups.org/performance.html

     

    [2] https://github.com/belaban/JGroups/blob/master/bin/jgroups.sh

  • 13. Re: Replication timeout with UDP
    Lin Ye Novice

    Hi Bela,

     

    Thanks for the advice. I tried both tests you suggested, and the detailed results are bellow. Both tests got around 740MBits/sec, and I am afraid it's not close enough. I don't have access to the switch myself, but I'll ask the lab people to try what you suggested. Meanwhile, do you think it's a switch issue? I'd appreciate your opinions.

     

    [ssi@SSELabClusterMaster src]$ ./iperf -c 10.80.8.2 -u -b 1000M

    ------------------------------------------------------------

    Client connecting to 10.80.8.2, UDP port 5001

    Sending 1470 byte datagrams

    UDP buffer size:  108 KByte (default)

    ------------------------------------------------------------

    [  3] local 10.80.8.1 port 48502 connected with 10.80.8.2 port 5001

    [ ID] Interval       Transfer     Bandwidth

    [  3]  0.0-10.0 sec   882 MBytes   740 Mbits/sec

    [  3] Sent 629362 datagrams

    [  3] Server Report:

    [  3]  0.0-10.0 sec   881 MBytes   739 Mbits/sec   0.048 ms 1136/629361 (0.18%)

    [  3]  0.0-10.0 sec  1 datagrams received out-of-order

     

    [ssi@SSELabClusterMaster src]$ ./iperf -c 10.80.8.2 -u -b 1000M -B239.1.1.1

    ------------------------------------------------------------

    Client connecting to 10.80.8.2, UDP port 5001

    Binding to local address 239.1.1.1

    Joining multicast group  239.1.1.1

    Sending 1470 byte datagrams

    UDP buffer size:  108 KByte (default)

    ------------------------------------------------------------

    [  3] local 239.1.1.1 port 5001 connected with 10.80.8.2 port 5001

    [ ID] Interval       Transfer     Bandwidth

    [  3]  0.0-10.0 sec   881 MBytes   739 Mbits/sec

    [  3] Sent 628606 datagrams

     

    Thanks,

    Lin

  • 14. Re: Replication timeout with UDP
    Bela Ban Master

    To try out multicasting with iperf, you'll need to do the following:

     

    server: iperf -us -i 2 -B238.1.1.1 -w 100M

    client: iperf -uc 238.1.1.1 -b 1000M

1 2 Previous Next