2012 F HW1(1) .pdf
Original filename: 2012-F-HW1(1).pdf
This PDF 1.5 document has been generated by Microsoft® Word 2010, and has been sent on pdf-archive.com on 20/09/2012 at 17:30, from IP address 76.127.x.x.
The current document download page has been viewed 804 times.
File size: 494 KB (5 pages).
Privacy: public file
Download original PDF file
2012-F-HW1(1).pdf (PDF, 494 KB)
Share on social networks
Link to this file download page
Fall 2012 CS 3200: Homework 1
Goals: (1) Experience how much of a difference optimizing data-intensive programs can make and how
this depends on physical data layout and data and query properties. (2) Gain deeper understanding of
approaches we will later revisit when discussing index structures in a database. (3) Find out if you have
sufficient programming skills to handle future homework, in particular for the JDBC-based assignment. If
you find this assignment very difficult, please contact the instructor ASAP!
This homework is to be completed individually (i.e., no teams). You can discuss general problems with
other students, but you have to create all deliverables yourself from scratch. In particular, it is not
allowed to copy someone else’s code or text and modify it.
All deliverables for this HW are due on Friday, September 21 at 11pm. For late submissions you will lose
one percentage point per hour after the deadline. This HW is worth 100 points and accounts for 10% of
your overall homework score. To encourage early work, you will receive a 5-point bonus if you submit
your solution on or before Monday, September 17 at 11pm. (Notice that your total score cannot exceed
100 points, but the extra points would compensate for any deductions.)
Please submit your solution through the Blackboard assignment submission system and make sure your
report is a PDF file. All programs need to be submitted as source code files, e.g., MyProgram.java.
Package all your solution files, including the report, into a single ZIP file. Make sure you are using
standard ZIP, i.e., the format also used by Windows archives.
A note about Java versus other programming languages: We strongly recommend that everybody uses
Java, as we cannot help you with problems in other languages. This assignment also points to Java
resources and approaches. If you choose another language, you need to find out on your own how to
get the equivalent functionality. In short, you are allowed to use other languages at your own risk.
Consider a relation Users with schema (userID, age, stuff) that is stored in a large file. Our goal is to find
the user IDs for users in a certain age range. We will explore the performance of two different
algorithms for four different queries. For each of the eight combinations you will measure the total
runtime and report some results.
The Users data set has 7 million tuples (aka records). Each tuple consists of a userID of type long (8 byte
integer type), the age value of type long, and the stuff field, which is a sequence of 20 values of type
long. User ages range from 0 to 99 years old. There are about 100,000 users of each age for age 0 to 69,
and about 10 users of each age for ages 70 to 99. The data is not ordered in any way. This data set is
available on Blackboard.
Sequential Scan Algorithm
Implement the simple sequential scan algorithm for finding the userIDs for all users in a selected age
range between low and high (including low and high). It has the following pseudo-code:
1. Open Users file and make sure reading starts at the beginning of the file
2. Initialize and start timer
3. While not reached the end of file do
a. Read next (userID, age, stuff) tuple from file
// You can also skip irrelevant values
b. If low age and age high then add userID to output data structure
4. Stop timer and return elapsed time
5. Close file
The second algorithm uses an index to “jump” directly to the tuples with the selected ages in the Users
file. To make life easier, we will construct an in-memory index, instead of creating a file-based index.
1. // Construct an in-memory index on the age attribute of the Users relation
2. Create an array Index[ 100 ], where each element Index[ i ] is a list data structure that will
contain file locations where a tuple with age=i is found in the file
3. Open Users file and make sure reading starts at the beginning of the file
4. While not reached the end of file do
a. pos = current position (i.e., offset) in the file
b. Read next (userID, age, stuff) tuple from file
// You can also skip irrelevant values
c. Index[ age ].append( pos )
// Add position of record to corresponding index entry
5. Close file
7. // Now we have an appropriate in-memory index on age for our Users data file
8. // Let’s use that index to try and speed up the query
9. Open Users file
10. Initialize and start timer
11. For age from low to high do
// For each age in selected range [low, high]
a. For each position pos stored in the list at Index[ age ] do
i. Seek (i.e., jump) to position pos in the Users file
ii. Read userID from that position and add it to the output data structure
12. Stop timer and return elapsed time
13. Close file
Notice how the index allows us to directly “jump” to the right locations in the file.
We want to run the following four queries:
Q1: Find the userIDs of all users who are exactly 75 years old. (low=75, high=75)
Q2: Find the userIDs of all users who are between 80 and 99 years old. (low=80, high=99)
Q3: Find the userIDs of all users who are exactly 25 years old. (low=25, high=25)
Q4: Find the userIDs of all users who are between 30 and 49 years old. (low=30, high=49)
Implement the two algorithms discussed above. (60 points)
Run each of the algorithms for each of the queries, making sure the buffer size for the sequential scan is
at least a few hundred kilobytes, while the index-based file accesses should not use a buffer. Repeat this
at least three times. This gives you an idea about the variance of the observed results, e.g., due to
external effects such as other processes running on your machine.
Report the runtime for each of these eight experiments for each of the three runs. (Note: You need to
show at least 24 numbers here.) Since runtime might be affected significantly by the method of writing
the output, make sure you turn off all screen printing and file writing statements when measuring
runtime. Also make sure you keep the Users file on the local disk of the machine you are working on. For
instance, if you work in a CCIS lab and store the file on a network drive, file I/O becomes very slow and
might skew the results.
Discuss the following: (30 points)
1. For each query, argue why the index does or does not speed up the query.
2. Consider the scan algorithm. How does its runtime depend on the query? Explain why this
3. Consider the index-based algorithm. How does its runtime depend on the query? Explain why
4. Queries Q1 and Q3 look very similar. Is the runtime of the index-based algorithm also similar for
the two queries? If not, explain why.
5. Queries Q2 and Q4 also look very similar. Is the runtime of the index-based algorithm also
similar for the two queries? If not, explain why.
6. Discuss anything else that you observe about the results and find interesting, surprising, or
Testing Your Code
For Q1, report all userIDs returned by your two programs (2.5 points).
For Q2, Q3, and Q4, report the number of userIDs found by your two programs (7.5 points).
Write a brief report about your findings, using the following structure.
This should provide information such as class number, HW number, and your name.
Briefly discuss how you implemented the algorithms in Java (or whichever language you used) and if
there were any difficulties you encountered. Explain which files hold which source code.
Explain on what platform (hardware specs like CPU speed, CPU cache size, memory size, hard disk type,
size and speed, operating system) you ran your experiments and which commands you used to measure
Discuss the points requested in the task description and report the results requested in the code testing
Deliverables (All in a SINGLE ZIP File)
1. The report as discussed above. (1 PDF file)
2. Your Java source code for the two algorithms. (one or more .java files)
The java.io library has all the file-related functionality we need for this homework. Take a look at
http://docs.oracle.com/javase/6/docs/api/. For sequentially scanning a file, you can create a file input
stream as follows:
FileInputStream FIS = new FileInputStream( inputFileName );
To add buffering to it, you can use
BufferedInputStream BIS = new BufferedInputStream( FIS, desiredBufferSize );
Since the example file stores all data as long integers, create a DataInputStream object:
DataInputStream DIS = new DataInputStream( BIS );
Then you can use DIS.readLong() to read a single long value from the file. Note that to read a complete
tuple (userID, age, stuff) from Users, you need to (1) read a long to get the userID, (2) read another long
to get the age, and (3) read 20 more longs to get all “stuff” entries. You can skip over irrelevant bytes by
using method skipBytes. Whenever you need to read from DIS from the beginning, just close and then
re-create that stream. (Or find a more elegant way by looking in the Java API.)
For the index-based file access, the RandomAccessFile class is more useful. You can create a
RandomAccessFile object by using the constructor that takes the file name as input. RandomAccessFile
already has a readLong() method. The seek(long pos) method allows you to “jump” to a desired position
in the file. To get the current position in the file, you can use the getFilePointer() method. This is useful
during index construction (line 4.a).
Everything else is essentially almost a one-to-one translation of the pseudo-code. As a guideline,
excluding the necessary boilerplate code for imports and exception handling etc., your program should
ideally not have more than 200 lines of code.
In all cases, do not forget to close your files, streams, and writers in the end.
To measure the runtime of a program call, you can do the following:
long startTime = System.currentTimeMillis();
long endTime = System.currentTimeMillis();
A final suggestion for the index data structure: The pseudo-code suggests that Index be an array where
each element Index[ i ] contains a list of file position pointers. You do not have to use those exact data
structures. Feel free to use any data structure that can efficiently store and retrieve all relevant file
positions for each age value between 0 and 99. Probably the simplest data structure would be a twodimensional array of longs (notice that file pointers coincidentally are of type long). The array would
have 100 rows and a sufficiently large number of columns, which you can find by counting the number
of users for each age while scanning the file.
Link to this page
Use the permanent link to the download page to share your document on Facebook, Twitter, LinkedIn, or directly with a contact by e-Mail, Messenger, Whatsapp, Line..
Use the short link to share your document on Twitter or by text message (SMS)
Copy the following HTML code to share your document on a Website or Blog