Week 8-9: Data Streams and Streaming Algorithms Flashcards
Data Stream
An ordered and potentially infinite sequence of elements.
Data streams have the following properties:
1. Potentially infinite (transient, single pass over the data, only summaries can be stored, real-time processing)
2. Non-static (incremental updates, concept drift, forgetting old data)
3. Temporal order may be important
Data streams can be modelled using server logs, a collection of small text files (great for online posts), and an ordered list of potentially infinite elements.
Element/Data Point
An element can include, but isn’t limited to:
1. Event
2. Figure
3. Record
4. Graph (in context of social network data streams)
Data Stream Example 1:
SELECT
System.Timestamp AS OutputTime, dspl AS SensorName,
Avg(temp) AS AvgTemperature
INTO
output
FROM
InputStream TIMESTAMP BY time
GROUP BY TumblingWindow(second,30),dspl
HAVING Avg(temp)>100
Monitor Avg(temp) every 30 seconds and display SensorName, Avg(temp) if AVg(temp) > 100 degrees.
Data Stream Management System (DSMS)
It’s software that handles continuous data stream.
Processor
Streams enter the DSMS via the Processor, and often have different velocity (arrival rate) and data types. Inputs streams need not have the same number of elements.
Archival Storage
It archives streams. Archival Storage isn’t for querying due to very low efficiency of data retrieval.
Limited Working Storage
It serves as storage space in the disk or in the main memory. It’s used for answering queries. It can’t store all data.
Standing Queries
They’re in the Processor and permanently execute and produce output at appropriate times.
Example: “output an alert ,when the variable TEMPERATURE exceeds 25)
Ad-hoc Queries
These are asked once about the current state of a stream or streams. The output of the DSMS, which comes from the Processor, is the answers to the query.
Example: “list hte total number of transactions in the last hour”
Queries: Sampling Data from a Stream
Build a random sample, such as a random subset of the stream.
Queries: Sliding Windows
Find the number of items of type x in the last k elements of the stream.
Queries: Filtering a Data Stream
Select elements with property x from the stream.
Queries: Counting Distinct Elements
Number of distinct elements in the last k elements of the stream.
Queries: Estimating Frequency Moments
Estimate the average/standard deviation of the last k elements.
Queries: Finding Frequent Elements
Return the elements that appear most frequently in the stream. Can specify the k-most elements to return.
Data Stream Query Applications
- Mining Query Streams: Google search wants to know what queries are more frequent today than yesterday.
- Mining Click Streams: websites want to know which of its pages are getting an unusual number of clicks in the past hour.
- Mining social network news feeds: e.g., looking for trending topics on Twitter, Facebook.
Naïve Solution for Sampling
If we want to store a proportion r of the data stream, then sample proportion r of each user’s queries.
Does the Naïve Solution for Sampling Work for Estimating the Proportion of Duplicate Queries in the Stream by an Average User?
No.
Assuming that 1/10 of the queries are sampled, then the fraction of duplicate queries based on the sample is d/(10x+19d).
The correct answer is d/(x+d)
A better solution is to select 1/10th of users, store all their queries, and count the duplicates. You can use a hash function h:user -> {1,2,…,10}. If h(user)-1, we accept their query. Otherwise, we discard it.
Reservoir Sampling
The goal of reservoir sampling is to get a uniform sample of the stream.
Method:
1. Add the first s tuples from the stream into a reservoir R.
2. For j>s, with probability s/j replace a random entry of R with the j-th tuple of the stream.
3. At j=n, return R.
The result is that R contains each tuple seen so far with probability s/n.
DGIM Method
This approach is used to approximate how many 1’s are in the last k bits, where k <= N. The Naïve Method is brute force and isn’t practical for large values of N.
DGIM Method doesn’t assume uniformity, is a bounded approximation error of O(1/r) (r is the maximum number of buckets for each size), and stores only O(log^2 N) (base 2) bits.
Note that the larger r is, the less the error, but the more bits we need to store.
DGIM Method: Timestamp
EAch element has timestamp t mod N, where t is 1,2,3,… and N is the window length.
EAch timestamp needs O(log N) bits of space.
Memory is required to store a bucket. We have O(log N) buckets, which means O(log(N) * log(N)) = O(log^2 N) bits.
DGIM Method: Bucket
A bucket is a segment of the window. It contains the timestamp of its most recent element and the number of 1s in it (size).
The right end of a bucket is always a 1. EVery 1 is in a bucket. No position is in more than 1 bucket. There are 1 or 2 buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets can’t increase in size as we move from left to right.
DGIM Method: Pseudocode
For each new bit:
if bit == 1:
if end-time of current bucket is prior to N time units before the current time:
drop the last (oldest) bucket
else:
create a new bucket of size k (k starts at 1 and then increases by a factor of 2)
if there are now three buckets of size k:
combine the two oldest buckets into a bucket size of 2*k
if bit == 0:
No changes
Add the sizes of all the buckets, including half the size of the last bucket, which may be cut off for extending beyond N.
Hash Function Filter
Given a set of keys S that we want to filter (i.e. white-list of emails):
1. Create a bit array B of n bits, initially all 0’s.
2. Choose a hash function h with range [0,n)
3. Hash each member of s \in S to one of n buckets, and set that bit to 1, i.e. B[h(s)]=1
4. Hash each element a of the stream with h and output a if it hashes to a bit of B that is 1 (i.e., Output a if B[h(a)] == 1)
If an element is in S, it will hash to a bucket that has its bit set to 1, so it’ll aways get through (no false negatives).
If an element isn’t in S, the chance that it gets through is 1 - e^{- |S|/n}, with n the range of the hash function.