8378: Floating-Point Numbers
时间限制: 1 Sec 内存限制: 128 MB
提交: 10 解决: 4
[提交] [状态] [讨论版] [命题人:admin]
题目描述
In this problem, we consider floating-point number formats, data representation formats to approximate real numbers on computers.
Scientific notation is a method to express a number, frequently used for numbers too large or too small to be written tersely in usual decimal form. In scientific notation, all numbers are written in the form m × 10e. Here, m (called significand) is a number greater than or equal to 1 and less than 10, and e (called exponent) is an integer. For example, a number 13.5 is equal to 1.35×101, so we can express it in scientific notation with significand 1.35 and exponent 1.
As binary number representation is convenient on computers, let's consider binary scientific notation with base two, instead of ten. In binary scientific notation, all numbers are written in the form m × 2e. Since the base is two, m is limited to be less than 2. For example, 13.5 is equal to 1.6875×23, so we can express it in binary scientific notation with significand 1.6875 and exponent 3. The significand 1.6875 is equal to 1 + 1/2 + 1/8 + 1/16, which is 1.10112 in binary notation. Similarly, the exponent 3 can be expressed as 112 in binary notation.
A floating-point number expresses a number in binary scientific notation in finite number of bits. Although the accuracy of the significand and the range of the exponent are limited by the number of bits, we can express numbers in a wide range with reasonably high accuracy.
In this problem, we consider a 64-bit floating-point number format, simplified from one actually used widely, in which only those numbers greater than or equal to 1 can be expressed. Here, the first 12 bits are used for the exponent and the remaining 52 bits for the significand. Let's denote the 64 bits of a floating-point number by b64...b1. With e an unsigned binary integer (b64...b53)2, and with m a binary fraction represented by the remaining 52 bits plus one (1.b52...b1)2, the floating-point number represents the number m × 2e.
We show below the bit string of the representation of 13.5 in the format described above.
In floating-point addition operations, the results have to be approximated by numbers representable in floating-point format. Here, we assume that the approximation is by truncation. When the sum of two floating-point numbers a and b is expressed in binary scientific notation as a + b = m × 2e (1 ≤ m < 2, 0 ≤ e < 212), the result of addition operation on them will be a floating-point number with its first 12 bits representing e as an unsigned integer and the remaining 52 bits representing the first 52 bits of the binary fraction of m.
A disadvantage of this approximation method is that the approximation error accumulates easily. To verify this, let's make an experiment of adding a floating-point number many times, as in the pseudocode shown below. Here, s and a are floating-point numbers, and the results of individual addition are approximated as described above.
s := a
for n times {
s := s + a
}
For a given floating-point number a and a number of repetitions n, compute the bits of the floating-point number s when the above pseudocode finishes.
输入
The input consists of at most 1000 datasets, each in the following format.
n
b52...b1
n is the number of repetitions. (1 ≤ n ≤ 1018) For each i, bi is either 0 or 1. As for the floating-point number a in the pseudocode, the exponent is 0 and the significand is b52...b1.
The end of the input is indicated by a line containing a zero.
输出
For each dataset, the 64 bits of the floating-point number s after finishing the pseudocode should be output as a sequence of 64 digits, each being 0 or 1 in one line.
样例输入
1
0000000000000000000000000000000000000000000000000000
2
0000000000000000000000000000000000000000000000000000
3
0000000000000000000000000000000000000000000000000000
4
0000000000000000000000000000000000000000000000000000
7
1101000000000000000000000000000000000000000000000000
100
1100011010100001100111100101000111001001111100101011
123456789
1010101010101010101010101010101010101010101010101010
1000000000000000000
1111111111111111111111111111111111111111111111111111
0
样例输出
0000000000010000000000000000000000000000000000000000000000000000
0000000000011000000000000000000000000000000000000000000000000000
0000000000100000000000000000000000000000000000000000000000000000
0000000000100100000000000000000000000000000000000000000000000000
0000000000111101000000000000000000000000000000000000000000000000
0000000001110110011010111011100001101110110010001001010101111111
0000000110111000100001110101011001000111100001010011110101011000
0000001101010000000000000000000000000000000000000000000000000000
来源/分类
【题意】
看图,解释了小数的科学计数法存储方式,前12位存指数,后52位存小数部分,整数部分默认为1。
给出n,和小数a的后52位,求n+1个a相加的结果(注意是逐个相加,并考虑每一步的误差损失),用题目描述的存储方式输出。
【分析】
错误做法1:我第一遍直接用java大数计算出来 (n+1)*a 的具体值,然后从最高位的第二位开始数52位保留下来,作为小数,剩下的位数长度作为指数e。结果前7个样例是对的,最后一个是错的。错误:题目声明了这种存储方式是有误差损失的,而我直接大数算出具体值是不损失的,故答案不符。
错误做法2:放弃大数之后,我想到用快速幂的方法实现这n次加法,然而得出的结果还是最后一个是错的。原因:快速幂过程中,只会加倍,并不能考虑到逐个加a时产生的损失, 因为逐个加a时,没加一次都有可能产生误差,而用快速幂的思想只能在加倍是产生误差,大概可以理解成 产生的误差不一样。
正确做法:考虑一个过程,我们是逐个加a,其实很多时候,我们加上一个a,数值的最高位可能并不变化,也就是不产生进位,那指数e肯定不会变化。
所以我们寻找临界状态,即现在的值 加上 几个a,能恰好进位,这个用普通除法就能算出。然后我们直接让数值加上这些a,进位。一旦进位,指数e就可以加1,而小数部分就需要右移一位了。 不断重复这个过程,知道n用尽或者小数部分不足以产生影响。
小数部分直接用long long型变量存下来。
【代码】
/****
***author: winter2121
****/
#include<bits/stdc++.h>
#define rep(i,s,t) for(int i=s;i<=t;i++)
#define SI(i) scanf("%d",&i)
#define PI(i) printf("%d\n",i)
using namespace std;
typedef long long ll;
const int mod=1e9+7;
const int MAX=2e5+5;
const int INF=0x3f3f3f3f;
const double eps=1e-8;
int dir[9][2]={0,1,0,-1,1,0,-1,0, -1,-1,-1,1,1,-1,1,1};
template<class T>bool gmax(T &a,T b){return a<b?a=b,1:0;}
template<class T>bool gmin(T &a,T b){return a>b?a=b,1:0;}
template<class T>void gmod(T &a,T b){a=((a+b)%mod+mod)%mod;}
ll gcd(ll a,ll b){ while(b) b^=a^=b^=a%=b; return a;}
ll inv(ll b){return b==1?1:(mod-mod/b)*inv(mod%b)%mod;}
ll qpow(ll n,ll m)
{
n%=mod;
ll ans=1;
for(;m;m>>=1)
{
if(m&1)ans=ans*n%mod;
n=n*n%mod;
}
return ans;
}
int main()
{
ll n;
int bit;
while(scanf("%lld",&n),n)
{
ll num=1ll<<52;
for(int i=51;i>=0;i--)
{
scanf("%1d",&bit);
num|=((ll)bit)<<i;
}
ll ans=0,res=num;
while(n>0)
{
// cout<<num<<endl;
ll Tim=((1ll<<53)-res +num-1)/num; //Tim次就能加到1<<53
// cout<<n<<' '<<Tim<<endl;
if(Tim>n) //超了
{
res+=n*num;
break;
}
else
{
res+=Tim*num;
ans++; //进位
res>>=1; //产生了进位,幂加1,所以res右移1
num>>=1; //幂涨了,所以最初的小数也就右移了
if(num<1)break; //当num不足1时,不会再对后面的加法产生影响。
n-=Tim;
}
}
for(int i=11;i>=0;i--)
printf("%d",((1<<i)&ans)?1:0);
for(int i=51;i>=0;i--)
printf("%d",((1ll<<i)&res)?1:0);
printf("\n");
}
return 0;
}
标签:binary,point,浮点数,Point,number,upc,floating,bits,ll From: https://blog.51cto.com/u_16125110/6404558