X-Git-Url: https://code.wpia.club/?a=blobdiff_plain;f=lib%2Fopenssl%2Fcrypto%2Fbn%2Fasm%2Fia64.S;h=f2404a3c1e0ef702f9390ce2cb9f992b4df41b0d;hb=HEAD;hp=951abc53ea5bbae8e308643147fcaa26182e560b;hpb=f69f31caeda734d6d9c8ab00e693671ac7512bea;p=cassiopeia.git diff --git a/lib/openssl/crypto/bn/asm/ia64.S b/lib/openssl/crypto/bn/asm/ia64.S index 951abc5..f2404a3 100644 --- a/lib/openssl/crypto/bn/asm/ia64.S +++ b/lib/openssl/crypto/bn/asm/ia64.S @@ -3,6 +3,13 @@ .ident "ia64.S, Version 2.1" .ident "IA-64 ISA artwork by Andy Polyakov " +// Copyright 2001-2016 The OpenSSL Project Authors. All Rights Reserved. +// +// Licensed under the OpenSSL license (the "License"). You may not use +// this file except in compliance with the License. You can obtain a copy +// in the file LICENSE in the source distribution or at +// https://www.openssl.org/source/license.html + // // ==================================================================== // Written by Andy Polyakov for the OpenSSL @@ -22,7 +29,7 @@ // ports is the same, i.e. 2, while I need 4. In other words, to this // module Itanium2 remains effectively as "wide" as Itanium. Yet it's // essentially different in respect to this module, and a re-tune was -// required. Well, because some intruction latencies has changed. Most +// required. Well, because some instruction latencies has changed. Most // noticeably those intensively used: // // Itanium Itanium2 @@ -363,7 +370,7 @@ bn_mul_words: // The loop therefore spins at the latency of xma minus 1, or in other // words at 6*(n+4) ticks:-( Compare to the "production" loop above // that runs in 2*(n+11) where the low latency problem is worked around -// by moving the dependency to one-tick latent interger ALU. Note that +// by moving the dependency to one-tick latent integer ALU. Note that // "distance" between ldf8 and xma is not latency of ldf8, but the // *difference* between xma and ldf8 latencies. .L_bn_mul_words_ctop: @@ -422,10 +429,10 @@ bn_mul_add_words: // This loop spins in 3*(n+10) ticks on Itanium and in 2*(n+10) on // Itanium 2. Yes, unlike previous versions it scales:-) Previous -// version was peforming *all* additions in IALU and was starving +// version was performing *all* additions in IALU and was starving // for those even on Itanium 2. In this version one addition is // moved to FPU and is folded with multiplication. This is at cost -// of propogating the result from previous call to this subroutine +// of propagating the result from previous call to this subroutine // to L2 cache... In other words negligible even for shorter keys. // *Overall* performance improvement [over previous version] varies // from 11 to 22 percent depending on key length. @@ -495,7 +502,7 @@ bn_sqr_words: // scalability. The decision will very likely be reconsidered after the // benchmark program is profiled. I.e. if perfomance gain on Itanium // will appear larger than loss on "wider" IA-64, then the loop should -// be explicitely split and the epilogue compressed. +// be explicitly split and the epilogue compressed. .L_bn_sqr_words_ctop: { .mfi; (p16) ldf8 f32=[r33],8 (p25) xmpy.lu f42=f41,f41 @@ -568,7 +575,7 @@ bn_sqr_comba8: // I've estimated this routine to run in ~120 ticks, but in reality // (i.e. according to ar.itc) it takes ~160 ticks. Are those extra // cycles consumed for instructions fetch? Or did I misinterpret some -// clause in Itanium µ-architecture manual? Comments are welcomed and +// clause in Itanium µ-architecture manual? Comments are welcomed and // highly appreciated. // // On Itanium 2 it takes ~190 ticks. This is because of stalls on