怎么求中位数syz

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

求中位数相关
一．快排与中位数算法
问题描述：
某石油公司计划建造一条由东向西的主输油管道。

该管道要穿过一个有n口油井的油田。

从每口油井都要有一条输油管道沿最短路经(或南或北)与主管道相连。

如果给定n口油井的位置,即它们的x坐标（东西向）和y坐标（南北向）,应如何确定主管道的最优位置, 即使各油井到主管道之间的输油管道长度总和最小的位置?证明可在线性时间内确定主管道的最优位置。

解决方法：求y坐标的中位数。

1．最坏时间复杂度为O(n2)，平均为O(n)
int Partition(int A[],int low,int high)
{
int temp = A[low];
int i = low ,j = high;
while (i < j)
{
while (i < j && A[j] >= temp) j--;
while (i < j && A[i] <= temp) i++;
swap(A[i],A[j]);
}
A[low] = A[i];
A[i] = temp;
return i;
}
int RandomizedPartition(int A[],int low,int high)
{
int i = rand() % (high-low+1) + low;
swap(A[i],A[low]);
return Partition(A,low,high);
}
void QuickSort(int A[],int p,int r)
{
if (p < r)
{
int q = Partition(A,p,r);
QuickSort(A,p,q-1);
QuickSort(A,q+1,r);
}
}
void RandomizedQuickSort(int A[],int p,int r)
{
if (p < r)
{
int q = RandomizedPartition(A,p,r);
RandomizedQuickSort(A,p,q-1);
RandomizedQuickSort(A,q+1,r);
}
}
int RandomizedSelect(int A[],int low,int high,int k) {
if (low == high)
return A[low];
int i = RandomizedPartition(A,low,high);
int j = i - low + 1;
if (k <= j) return RandomizedSelect(A,low,i,k);
else
return RandomizedSelect(A,i+1,high,k-j); }
int main(void)
{
int A[10];
int i;
srand(time(NULL));
for (i = 0; i < 10;++i)
{
A[i] = rand() % 100;
cout << A[i] << " ";
}
cout << endl;
cout << RandomizedSelect(A,0,9,5) << endl;
RandomizedQuickSort(A,0,9);
for (i = 0; i < 10;++i)
cout << A[i] << " ";
cout << endl;
return 0;
}
2．We can ﬁnd the i th smallest element in O(n) time in the worst case. Weíll describe a procedure SELECT that does so.
SELECT recursively partitions the input array.
•Idea: Guarantee a good split when the array is partitioned.
•Will use the deterministic procedure PARTITION, but with a small modiﬁca-
tion. Instead of assuming that the last element of the subarray is the pivot, the
modiﬁed PARTITION pro cedure is told which element to use as the pivot.
SELECT works on an array of n > 1 elements. It executes the following steps:
1. Divide the n elements into groups of 5. Get n/5 groups: n/5 groups with exactly 5 elements and, if 5 does not divide n, one group with the remaining n mod 5 elements.
2. Find the median of each of the n/5 groups:
•Run insertion sort on each group. Takes O(1) time per group since each group has ≤5 elements.
•Then just pick the median from each group, in O(1) time.
3. Find the median x of the n/5 medians by a recursive call to SELECT. (If n/5 is even, then follow our convention and ﬁnd the lower median.)
4. Using the modiﬁed version of PARTITION that takes the pivot element as input,
partition the input array around x. Let x be the kth element of the array after
partitioning, so that there are k −1 elements on the low side of the partition and
n − k elements on the high side.
5. Now there are three possibilities:
•If i = k, just return x.
•If i < k, return the i th smallest element on the low side of the partition by
making a recursive call to SELECT.
•If i > k, return the (i −k)th smallest element on the high side of the partition
by making a recursive call to SELECT.
二．腾讯题：海量数据中找出中位数
题目：在一个文件中有10G 个整数，乱序排列，要求找出中位数。

内存限制为2G。

只写出思路即可（内存限制为2G的意思就是，可以使用2G的空间来运行程序，而不考虑这台机器上的其他软件的占用内存）。

关于中位数：数据排序后，位置在最中间的数值。

即将数据分成两部分，一部分大于该数值，一部分小于该数值。

中位数的位置：当样本数为奇数时，中位数=(N+1)/2 ; 当样本数为偶数时，中位数为N/2与1+N/2的均值（那么10G个数的中位数，就第5G大的数与第5G+1大的数的均值了）。

分析：明显是一道工程性很强的题目，和一般的查找中位数的题目有几点不同。

1. 原数据不能读进内存，不然可以用快速选择，如果数的范围合适的话还可以考虑桶排序或者计数排序，但这里假设是32位整数，仍有4G种取值，需要一个16G大小的数组来计数。

2. 若看成从N个数中找出第K大的数，如果K个数可以读进内存，可以利用最小或最大堆，但这里K=N/2,有5G个数，仍然不能读进内存。

3. 接上，对于N个数和K个数都不能一次读进内存的情况，《编程之美》里给出一个方案：设k<K,且k个数可以完全读进内存，那么先构建k个数的堆，先找出第0到k大的数，再扫描一遍数组找出第k+1到2k的数，再扫描直到找出第K个数。

虽然每次时间大约是nlog(k)，但需要扫描ceil(K/k) 次，这里要扫描5次。

解法：首先假设是32位无符号整数。

1. 读一遍10G个整数，把整数映射到256M个区段中，用一个64位无符号整数给每个相应区段记数。

说明：整数范围是0 - 2^32 - 1，一共有4G种取值，映射到256M个区段，则每个区段有16（4G/256M = 16）种值，每16个值算一段，0～15是第1段，16～31是第2段，……2^32-16 ～2^32-1是第256M段。

一个64位无符号整数最大值是0～8G-1，这里先不考虑溢出的情况。

总共占用内存256M×8B=2GB。

2. 从前到后对每一段的计数累加，当累加的和超过5G时停止，找出这个区段（即累加停止时达到的区段，也是中位数所在的区段）的数值范围，设为[a，a+15]，同时记录累加到前一个区段的总数，设为m。

然后，释放除这个区段占用的内存。

3. 再读一遍10G个整数，把在[a，a+15]内的每个值计数，即有16个计数。

4. 对新的计数依次累加，每次的和设为n，当m+n的值超过5G时停止，此时的这个计数所对应的数就是中位数。

总结：
1.以上方法只要读两遍整数，对每个整数也只是常数时间的操作，总体来说是线性时间。

2. 考虑其他情况。

若是有符号的整数，只需改变映射即可。

若是64为整数，则增加每个区段的范围，那么在第二次读数时，要考虑更多的计数。

若过某个计数溢出，那么可认定所在的区段或代表整数为所求，这里只需做好相应的处理。

噢，忘了还要找第5G+1大的数了，相信有了以上的成果，找到这个数也不难了吧。

3. 时空权衡。

花费256个区段也许只是恰好配合2GB的内存（其实也不是，呵呵）。

可以增大区段范围，
减少区段数目，节省一些内存，虽然增加第二部分的对单个数值的计数，但第一部分对每个区段的计数加快了（总体改变？？待测）。

4. 映射时尽量用位操作，由于每个区段的起点都是2的整数幂，映射起来也很方便。

三．给定n个互不相同的数组成集合S，以及正整数k<=n，设计一个O(n)时间算法找出S 中最接近S的中位数的k个数。

首先假定读者知道找一个无序数列中的第k小的数的select算法，这个算法的时间复杂度是O(n)的。

如果给出的n是奇数，则中位数是第n/2个；如果是偶数，则中位数是第n/2和n/2 + 1个的中位数。

(也可以无论n是奇是偶，都取n/2)
第一步，通过select算法得到这个中位数；
第二步，这个集合中的每个数都减去这个中位数然后取abs()；
第三步，在第二步得到的集合中，取第k小的数，再取小于第k小的数的k-1个数，他们原来的数，就是要求的k个数。

有两个已排好序（从小到大）的数组A和B，长度均为n,把它们合并为一个数组，要求时间代价为O(logn),请给出算法。

下面是我在网上找到的：
Say the two arrays are sorted and increasing, namely A and B.
It is easy to find the median of each array in O(1) time.
Assume the median of array A is m and the median of array B is n.
Then,
1' If m=n, then clearly the median after merging is also m, the algorithm holds.
2' If m <n, then reserve the half of sequence A in which all numbers are greater than
m, also reserve the half of sequence B in which all numbers are smaller than n.
Run the algorithm on the two new arrays.
3' If m>n, then reserve the half of sequence A in which all numbers are smaller than
m, also reserve the half of sequence B in which all numbers are larger than n.
Run the algorithm on the two new arrays.
Time complexity: O(logn)。