Cray Single Precision Vs. Ieee 754 Double Precision

What is meant by single precision and double precision in floating point data types?

Single Precision is a 32-bit type; Double Precision is a 64-bit type. Accordingly, Doubles store a much broader range of values, and with much more precision.The floating point format is comprised of a sign bit, exponent bits (which give you range) and fraction bits (which give you numerical precision).32-bit floats have 1 sign, 8 exponent bits and 24 fraction bits, whereas 64-bit floats have 1 sign, 11 for exponent and 53 for fraction. (The first effective bit of fraction is always implied — that’s how they add up to 32 and 64 bits, respectively.)With 3 extra exponent bits (from 8 to 11), Double’s maximum exponent is 8x that of Single, leading to much greater range. And with 29 extra bits of precision (from 24->53), Double is far more precise when representing problematic numbers such as 1/7 or Pi.

How is a float or double value stored in memory in java?

There used to be a bunch of ways of storing floats and doubles in the bad old days, but they were standardized so that they could be hardware accelerated. Adding hardware acceleration to math was a major trend in computing back in the nineties, and was something of a predecessor to graphics acceleration (which uses floats).Today, basically all floats in all languages use exactly the same storage method: they store a 32 bit integer divided into three parts. Those three parts are called the sign (1 bit), the exponent (8 bits) and the mantissa (23 bits).The double precision numbers have a similar format, just with more bits and bigger values (1 for the sign, 11 for the exponent, and 52 for the mantissa, specifically). Note that the mantissa in this diagram is called the "fraction."If we call the sign s and consider it equal to 1 or 0, and if we call the exponent part x, and if we call the mantissa part m, then the value of the number is usually:[math]2^{x-151}(1-2s)(m+2^{24})[/math]The value 1.0 is represented with s=0, m=0, x=127, for example. The extra [math]2^{24}[/math] added to m is called the "implicit leading bit," and is there because every binary number in exponent notation (except zero) starts with a 1, so the implied leading bit doesn't need to be stored. Note that this default rule cannot store a zero.There are two important exceptions to this notation:Firstly, if x is 255 (the max value), then the number is illegal, and its meaning depends on m; if m is 0, then it is positive (s=0) or negative (s=1) infinity. If m is not zero, then it is NaN (not-a-number). There are many legal ways to represent NaN, and this is sometimes used to differentiate between cases where NaN ought to throw an exception and where it ought not to.The second case is when x=0. In this case, the formula becomes:[math]2^{-150}(1-2s)m[/math]This allows for an entire range of extra-small numbers, called denormalized numbers, to be generated, and has the double advantage that if you set all the bits to zero, the encoded number is also zero. Another major advantage of this system is that the integer comparisons (<, >, etc) work in nearly the same way for floats and integers!Anyway, hopefully that helped; floats are a dense topic.

In your favorite programming language, what is one feature that you indignantly refuse to use in your code due to the sheer stupidity of its existence?

Not indignant about it, but I essentially refuse to use the “float” type in C or C++.This generally results in using a floating point representation that has poor precision and poor range of values. For instance, for IEEE 754 single precision floating point (a typical “float” type), the smallest integer that can’t be represented exactly is the number 16,777,217. That’s small enough that you really just expect numbers like that to behave correctly. And numbers like [math]10^{40}[/math] can’t be represented at all. Such numbers are well within the realm of numbers used to answer real-world questions, such as “how many atoms in the earth?” If you were to use “float” type to calculate that number, it would likely end up returning a value of +Infinity, which isn’t particularly useful.Back in the 1970s or 1980s, it made a lot of sense to have a type with those limitations; performance of double precision floating point was poor. (In 1985 and 1986, I sped up a program by a factor of about 5x by converting it from using floating point to using fixed point math.)But most modern computers have hardware floating point units, and double precision floating point doesn’t give up much in the way of performance, so I’ll pretty much always use double precision (the “double” type) when needed.EDIT — in response to the comments about using less memory and vectorization of operations… While at Qualcomm, I worked with their HVX vector processing solution, which is part of their Snapdragon chips. It provides vector operations which work on 1024-bit vectors, but the individual units within the vector are fixed-point, not floating-point. (8-bit/16-bit/32-bit, your choice as the developer.) Qualcomm made the specific decision not to use floating point, as they found that floating point was algorithmically not needed for most computer vision and image processing applications.So in a sense, in many cases, if you’re choosing to use “float” over a fixed point type based on “int32_t” you could be under-utilizing 8 bits (exponent representation) out of every 32 bits, and might get better results by using fixed-point with 32 bits.

How did they manage to land Curiosity on specific place?

The simplest answer is that you simulate landing where you want to land, and then run the movie in reverse to see what your trajectory looks like arriving at Mars. You launch to that trajectory from Earth, do a few (three to five) maneuvers between Earth and Mars to fine tune to the desired trajectory and arrive at the right entry point in the atmosphere at the right time. For Curiosity, those were within a few hundred meters and a few seconds! (Our navigators are really good.)We’re not there yet. Even though we hit the atmosphere very accurately, random variations in the atmosphere density will spread out the set of possible points on the surface where we will land (called the landing ellipse, due to its shape) to around a hundred kilometers. We can fix that with another kind of control called hypersonic guidance.The entry capsule has an offset center of gravity which gives it a little bit of lift. The capsule can be rotated with thrusters to rotate that lift vector. It can be turned to be lift up, down, left, right, etc. We take the knowledge of where in the atmosphere we entered (which is better than the control) and propagate that with gyroscopes and accelerometers on the vehicle to get the actual trajectory. Where the current trajectory differs from the desired trajectory due to density variations, we rotate the lift vector appropriately to correct forward, back, left, or right to get back on target.With that, Curiosity was able to target an ellipse about 20 km in length and 7 km wide. It actually landed 2.4 km from the center of the target.

What is the answer for (1+1e20)-(1e20) and 1+(1e20-1e20)?

The nature of the question and the accompanying imperative to "justify the answer" both scream homework problem.That said, here's the only answer that's guaranteed to be correct given the question as asked (i.e. without any clarifying details): It depends.It depends on the environment in which these expressions are evaluated. User assumed a theoretical math (infinite precision) environment, while most other answers assumed a computing platform of some sort.Assuming a computing platform:It depends on the language in which the expression is being evaluated. For instance, the terms in between the operators could conceivably be interpreted as numbers...or strings (with "+" denoting string concatenation and "-" denoting string match-and-splice).Assuming a numeric interpretation:It depends on the internal representation of said numbers. Options include, but may not be limited to:Fixed-width integers < 62-bit (with attendant consideration to how integer overflow is handled)Fixed-width integers >= 62-bit (for which integer overflow is not an issue)Arbitrary-width integers (e.g. using the GNU MP library)Single-precision IEEE 754 floatsDouble-precision IEEE 754 floatsQuad-precision IEEE 754 floatsNon-IEEE 754 floatsSymbolic integers

TRENDING NEWS

POPULAR NEWS

Cray Single Precision Vs. Ieee 754 Double Precision

TRENDING NEWS