Archive: Data
October 24, 2008
Using web services with Google Docs

Last week I wrote about a couple of cool dynamic data capabilities that are built in to Google Docs, including the GoogleFinance function, which lets you link to external stock ticker data in your spreadsheets.
Hackszine reader Tony Hirst, who previously showed us how to incorporate Wikipedia tables into a spreadsheet document, sent us some examples for accessing different web services from the Spreadsheets application using the importXML function. His examples include a howto for calling Amazon Web Services and another for accessing the New York Times Campaign Data API. Tony mentions, "I've now started thinking that google docs is a good place for people with little coding experience to play with web services."
For non-coders, the importXML feature is great in that it gives your spreadsheets access to a number of existing APIs without you needing to do a lot of work. For the programmers out there, however, this is even more powerful - you now have a mechanism for easily presenting and graphing your application data, assuming you can toss together a quick XML service.
Calling Amazon Associates/Ecommerce Web Services from a Google Spreadsheet
Viewing Campaign Finance Data In a Google Spreadsheet via the New York Times Campaign Data API
Previously
HOWTO - track stocks in Google Spreadsheets
Scraping Wikipedia tables with Google Spreadsheets
Posted by Jason Striegel |
Oct 24, 2008 07:33 PM
Ajax, Amazon, Data, Google, Life |
Permalink
| Comments (0)
| TrackBack
| Digg It
| Tag w/del.icio.us
October 23, 2008
Simple stock quote grabbing with Perl
Hackszine reader 3riador wrote in to recommend a quick and easy way to grab stock quotes using Perl and the Finance::Quote CPAN module. The codebase is actively maintained and has been around for some time, having first been distributed as part of GNUCash before becoming its own project.
Paul Fenwick, one of the GNUCash developers, had this to say in an article for The Perl Journal in 2000:
If you have a reason to watch the world's financial markets, and you know a little about perl, then you may find that the Finance::Quote package comes in handy. I personally use it to remind myself that I should never buy shares, as I have a good history of losing money on the stock exchange. However, you can use Finance::Quote to help track those tasty stock options you've been offered, or even to help you build dynamic artwork driven by fluctuations in the world markets.
Near as I can tell, the dynamic artwork that's referred to is the Stock Puppets presentation which was shown at 2000s Burning Man event (can anyone confirm this?). The idea was to have large marionettes controlled directly by stock market data, some servos, Basic Stamp microcontrollers, and IBM Thinkpads pulling market data using the Finance::Quote library.
To use Finance::Quote in your own projects is a simple task. Here's a few lines of code that will print the current price of Google shares:
#!/usr/bin/perl -wuse strict;
use Finance::Quote;
my $q = Finance::Quote->new();
my %data = $q->fetch('usa', 'GOOG');
print $data{'GOOG', 'price'} . "\n";
The Dabbler Blog has more information on installation and basic usage, and The Perl Journal article is a good resource for those wishing to delve any deeper.
Finance::Quote Perl Library
Dabbler Blog - Fast and Simple Stock Quotes Using Perl
Finance::Quote Article In The Perl Journal
Stock Puppets
Posted by Jason Striegel |
Oct 23, 2008 07:53 PM
Data, Life, Online Investing, Perl |
Permalink
| Comments (3)
| TrackBack
| Digg It
| Tag w/del.icio.us
October 18, 2008
Easy OS drive cloning for Blades with Compact Flash
If you've ever been tasked with setting up a server room full of machines, you can sympathize with the challenge of doing this with 90 boxes that use slow Compact Flash storage. Hackszine reader Left-O-Matic sent in the following story in which he describes a pretty efficient way to clone a ton of Windows-based Blade servers using Linux, a ton of USB CF adapters and GParted, the swiss army knife of filesystem tools that lets you grow, shrink, and duplicate most unix and windows partition formats.
Working in a high stressed R&D environment, I find myself crunched for time, fighting new requests that almost always end with "...we needed this yesterday".In the past 2 years I've worked out a system using Dell 1950s and SunFire X4600s onboard RAID controllers to effectively clone entire Node setups for deployment around the world, using RAID 10 for the Dells and Raid 1 for the Suns (Server 2003 with appropiate licences for all machines).
That all came to a crashing halt when someone higher up the food chain decided to consolidate this mess using Sun 6000 Series blades. The X6250 Blade did not pose a problem as we ordered them with Raid Expansion Modules (REM) to control the Hard drives.
The $h!t hit the fan when the X6450 Blades rolled through the door loaded with 4 Quad-Core Xeons no room for 2.5" SAS Drives... This leads to the problem. Loading Server 2003 on to 90+ Blades equipped with 48GB CF cards using a USB CD-ROM.
Now let me remind you that this needed to be done "yesterday".
With the time required to load Server 2003, set the system paramters, and harden the security for each blade, I'm looking at around 5 hours a piece. Lets do the math: 5x90=450 hours...YIKES! and lets just say that I can 2 or 3 at a time...that's still forever.
On top of that, almost any windows based program won't work correctly.
Solution:
Enter the greatest FREE Linux based solution.GParted!
I found a crappy PC lying around the office that has a bunch of USB Ports on it. I then downloaded the LiveCD version, booted up the PC from CD, plugged in 10 CF card readers, and loaded them all with brand new CF cards fresh out of the blades along with one master CF card.
GParted allowed me to first create a NTFS partition on each card (leaving a 8MB slack space) and attach a boot flag to it.
Next I select the data from the Master CF card, clicked copy, then selected the destination partition and clicked paste.
The beauty of this program is that instead of do each step one at a time and waitng the 2 hours for each copy, it enabled me to line up 10 jobs that set it to copy the data from the master CF card to the destination's card, basically cloning 10 machines in 16 hours (Compact Flash transfer speeds are really, really, really slow). So after I had transfered the master copy to a internal HDD, it cut the time by..well a lot, eliminating the delay of reading a CF.
So, in conclusion before I leave for the evening, I setup a batch to clone and it's ready for me in the morning.
Quick and easy, nice and cheezy.
Thanks for the tip Left-O-Matic! I'm sure there are more than a few of you IT folks out there who could save some time this way. I have to admit I haven't done anything like this in many years. What's your favorite way to clone a room full of boxes? Ghost? GParted? Send us your cloning tips in the comments.
Posted by Jason Striegel |
Oct 18, 2008 08:24 PM
Data, Linux Server, Windows Server |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
October 15, 2008
Scraping Wikipedia tables with Google Spreadsheets
Fitting in nicely with the discussion on pulling financial data into Google Spreadsheets, the OUseful blog recently demonstrated another Spreadsheet data import function, importHTML(), which allows you to easily link an external HTML table to your workbook.
The Google spreadsheet function =importHTML("","table",N) will scrape a table from an HMTL web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N'th table in the page (counting starts at 0) as the target table for data scraping.
The author goes on to show you how to pull a country population table from a Wikipedia entry into a spreadsheet, create a graph from it, publish the spreadsheet as a CSV, consume the CSV in Yahoo Pipes, export the Pipe output to KML, and import the KML into a Google Map. Whew!
The importHTML function will accept either "list" or "table" as the second parameter, which allows you to retrieve records from either UL/OL/DL lists or TABLE contents, respectively. If you want to retrieve something that's not table or list based, the importXML may also come in handy. With importXML, you can pull data from any XML or HTML file using an XPath query to target a specific tag or attribute. For more information on these import functions, consult the official documentation below.
Data Scraping Wikipedia with Google Spreadsheets
Google Docs Documentation: Functions For External Data
Previously:
HOWTO - track stocks in Google Spreadsheets
Posted by Jason Striegel |
Oct 15, 2008 11:49 PM
Ajax, Data, Google, Google Maps, Life, Mapping, Yahoo! |
Permalink
| Comments (0)
| TrackBack
| Digg It
| Tag w/del.icio.us
October 13, 2008
HOWTO - track stocks in Google Spreadsheets

One of the most convenient features in Google Spreadsheets is the ability to pull live external data sources into any worksheet. Instead of copying data into your worksheet, when the linked source changes, the cells in your spreadsheet will automatically update, which can save a lot of work if you pull reports regularly. This external data can be pulled from XML, other spreadsheet documents, and even (assuming you can bear to look) current and historical stock quotes from Google Finance.
Linking a worksheet to Google Finance is as simple as calling the GoogleFinance spreadsheet function. There are two ways to use it: you can pull current information on a ticker symbol, or you can pull historical trade data for a particular date range. Here's how:
Retrieving Current Stock Information
If you call the GoogleFinance function with two attributes, you can link to current market data for a particular ticker symbol. Just open any cell in your worksheet and enter the following:
=GoogleFinance("symbol", "attribute")
Replace "symbol" with the ticker id, such as GOOG or AAPL. The attribute parameter determines what information will be retrieved for that symbol. There are a number of supported attributes, including price, volume, tradetime, beta, pe (price to earnings ratio), and changepct. If you omit the attribute parameter, it will default to "price". There are a number of other possible attributes which I haven't listed, including some specific to mutual funds, so check the documentation link below for the full list.
Pulling Historical Stock Data
Another thing that you can do is retrieve historical stock data over a large date range. Once you have this in your spreadsheets, you can use formulas to process, compare, and chart this information over time.
Here's the syntax for pulling historical stock data:
=GoogleFinance("symbol", "attribute", "start_date", "end_date", "interval")
As in the previous example, "symbol" needs to be replaced with the desired ticker ID. The "attribute" parameter, however, works a little differently. It's possible values are limited to high, low, open, close, vol, and all. "start_date" and "end_date" define the range of data that will be retrieved, and interval should be set to "DAILY", "WEEKLY", or a number from 1-7, which represents the number of days between measurements.
When the stock data is retrieved, a number of columns and rows will be consumed to capture the linked data, so make sure you have room to accommodate the data you've requested. It's not a bad practice to contain this data in separate sheet. One thing I noticed is that the column names always appear in French for me, despite my language preference settings. If you notice this as well, you'll just have to ignore it until it's fixed.
You can have up to 250 of these Google Finance feeds in a single spreadsheet. It's not an unlimited amount, but it's not exactly lightning fast to pull that much data anyway. If you need more than that , one possible option is to separate your report data into different spreadsheets and then refresh them as needed.
Example Google Finance Spreadsheet
GoogleFinance Documentation and Examples
Posted by Jason Striegel |
Oct 13, 2008 08:53 PM
Ajax, Data, Google, Life, Online Investing, Productivity |
Permalink
| Comments (5)
| TrackBack
| Digg It
| Tag w/del.icio.us
September 14, 2008
SnackUpon

The Yahoo Hack Day 2008 event wrapped up this weekend and the winners have been announced. One of my favorite entries that didn't make the cut, but is worth mentioning is SnackUpon, a Yahoo Pipes application that provides StumbleUpon-like behavior using your Delicious username as input.
The output is RSS-feed based on data gathered from what you bookmark in delicious. Simply enter a delicious username in the form and the output is a feed of random webpages found by searching yahoo and delicious for a random set of tags from the given user's delicious account.
The output isn't perfect - it seems like the search algorithm could use a bit of polish to return more relevant results. That said, the idea is a worthwhile, if simplified, take on the classic "agent" device that the 1990's promised we'd have in the 2000's. Given a list of a user's bookmarks, the agent should be able to mine the news, blogs, links from similar people, and new bookmarks with similar classification, allow the user to consume the results at her leisure, and use the user's feedback (new bookmarks) to refine future searches.
SnackUpon
The Full List of Hack Day Entries and Winners
Posted by Jason Striegel |
Sep 14, 2008 07:03 PM
Ajax, Data, Yahoo! |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
September 6, 2008
Write a Hadoop MapReduce job in any programming language
Hadoop is a Java-based distributed application and storage framework that's designed to run on thousands of commodity machines. You can think of it as an open source approximation of Google's search infrastructure. Yahoo!, in fact, runs many components of its search and ad products on Hadoop, and it's not too surprising that they are a major contributor to the project.
MapReduce is a method for writing software that can be parallelized across thousands of machines to process enormous amounts of data. For instance, let's say you want to count the number of referrals, by domain, in all the world's Apache server logs. Here's the gist of how you'd do it:
- Get all the world to upload their server logs to your gigantor distributed file system. You might automate and approximate this by having every web administrator add some javascript code to their site that causes their visitor's browsers to ping your own server, resulting in one giant log file of all the world's server logs. Your filesystem of choice is HDFS, the Hadoop Distributed Filesystem, which handles partitioning and replicating this enormous file between all of your cluster nodes.
- Split the world's largest log file into tiny pieces, and have your thousands of cluster machines parse the pieces, looking for referrers. This is the "Map" phase. Each chunk is processed and the referrers found in that chunk are output back to the system, which stores the output keyed by the referrer hostname. The chunk assignments are optimized so that the cluster nodes will process chunks of data that happen to be stored on their local fragment of the distributed file system.
- Finally, all the outputs from the Map phase are collated. This is called the "Reduce" phase. The cluster nodes are assigned a hostname key that was created during the Map phase. All of the outputs for that key are read in by the node and counted. The node then outputs a single result which is the domain name of the referrer, and the total number of referrals that were produced from that referrer. This is done hundreds of thousands of times, once for each referrer domain, and distributed across the thousands of cluster nodes.
At the end of this hypothetical MapReduce job, you're left with a concise list of each domain that's referred traffic, and a count of how many referrals it's given. What's cool about Hadoop and MapReduce is that it makes writing distributed applications like this surprisingly simple. The two functions to perform the example referrer parsing might only be about 20 lines of code. Hadoop takes care of the immense challenges of distributed storage and processing, letting you focus on your specific task.
Since Hadoop is written in Java, the natural way for you to create distributed jobs is to encapsulate your Map and Reduce functions into a java class. If you're not a Java junkie, though, don't worry, there's a job wrapper called HadoopStreaming which can communicate with any program you write with the usual STDIN and STDOUT. This lets you write your distributed job in Perl, Python or even a shell script! You create two programs, one for the mapper and one for the reducer, and HadoopStreaming handles uploading them to all of the cluster nodes and passing data to and from your programs.
If you want to play around with this, I really recommend a couple of howtos written by German hacker Michael G. Noll. He put together a walkthrough for getting Hadoop up and running on Ubuntu, and also a nice introduction to writing a MapReduce program using HadoopStreaming (with Python as an example).
Are any Hackszine readers using Hadoop? Let us know what you're doing and point us to more information in the comments.
Hadoop
Running Hadoop On Ubuntu Linux
Writing An Hadoop MapReduce Program In Python
Posted by Jason Striegel |
Sep 6, 2008 09:58 PM
Data, Software Engineering |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
September 5, 2008
Read Excel files in Perl and PHP
Relational databases that speak SQL are the data-storage backbone for most developers. Unfortunately, but most of the data that's created outside the control of the technology caste at a typical workplace is in Excel format. Because of this, being able to procedurally read and write Excel documents with a familiar language can open up a whole world of possibilities for automation and data migration.
Assuming you're attempting to read and write standard text (Ie. not binary/graphic) data from Excel worksheets, this is actually fairly doable in PHP and Perl.
A recent article by Mike Diehl at Linux Journal peaked my interest in this. He shows off some of the features of the Spreadsheet::ParseExcel Perl module, which can be used to pull data and even formatting information from cells in an Excel worksheet. Once you have your hands on the data, you can do what you want with it: output it to XML, toss it in a database for subsequent querying, or even convert it into other Excel documents (oh, the shame).
Perl Excel Libraries and Information
Spreadsheet:ParseExcel - Read from Excel 95/97/2000 documents
Spreadsheet:WriteExcel - Write to Excel 97/2000/2002/2003 documents
Linux Journal - Reading Native Excel Files in Perl
There are libraries for dealing with native Excel files in PHP as well. The following two seem to be the only options for binary Excel documents.
PHP Excel Libraries
PHP Excel_Reader - Read Excel 95 and 97 documents
Spreadsheet_Excel_Writer - Write Excel 5.0 documents
Reading and Writing Spreadsheets with PHP
With the most recent version of Excel, there is an XML file format option that will allow you to read and write data in a worksheet by directly interacting with the saved file's DOM. IBM has a document that details doing this with PHP, and it would be straightforward to apply this technique to Perl as well.
Read/Write XML Excel Data in PHP
Finally, if all you need to do is output a document that can be read in Excel, a standard CSV-format file will usually do the trick. Escaping can be a bit tricky, however, and my preferred format has become a plain-old HTML table. Just create a file that contains a TABLE element (no BODY or HTML tags necessary), with any number of TR rows and html-escaped data in the TDs, and save it out. If you use the XLS file extension, it will open directly in Excel with a double-click and Excel never seems to mind reading in the data.
Do you have any other Excel programming hacks? Give us a shout in the comments.
Posted by Jason Striegel |
Sep 5, 2008 08:23 PM
Data, Excel, PHP, Perl |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
August 6, 2008
Memcached and high performance MySQL
Memcached is a distributed object caching system that was originally developed to improve the performance of LiveJournal and has subsequently been used as a scaling strategy for a number of high-load sites. It serves as a large, extremely fast hash table that can be spread across many servers and accessed simultaneously from multiple processes. It's designed to be used for almost any back-end caching need, and for high performance web applications, it's a great complement to a database like MySQL.
In a typical environment, a web developer might employ a combination of process level caching and the built-in MySQL query caching to eke out that extra bit of performance from an application. The problem is that in-process caching is limited to the web process running on a single server. In a load-balanced configuration, each server is maintaining its own cache, limiting the efficiency and available size of the cache. Similarly, MySQL's query cache is limited to the server that the MySQL process is running on. The query cache is also limited in that it can only cache row results. With memcached you can set up a number cache servers which can store any type of serialized object and this data can be shared by all of the loadbalanced web servers. Cool, no?
To set up a memcached server, you simple download the daemon and run it with a few parameters. From the memcached web site:
First, you start up the memcached daemon on as many spare machines as you have. The daemon has no configuration file, just a few command line options, only 3 or 4 of which you'll likely use:
# ./memcached -d -m 2048 -l 10.0.0.40 -p 11211This starts memcached up as a daemon, using 2GB of memory, and listening on IP 10.0.0.40, port 11211. Because a 32-bit process can only address 4GB of virtual memory (usually significantly less, depending on your operating system), if you have a 32-bit server with 4-64GB of memory using PAE you can just run multiple processes on the machine, each using 2 or 3GB of memory.
It's about as simple as it gets. There's no real configuration. No authentication. It's just a gigantor hash table. Obviously, you'd set this up on a private, non-addressable network. From there, the work of querying and updating the cache is completely up to the application designer. You are afforded the basic functions of set, get, and delete. Here's a simple example in PHP:
$memcache = new Memcache; $memcache->addServer('10.0.0.40', 11211); $memcache->addServer('10.0.0.41', 11211);$value= "Data to cache";
$memcache->set('thekey', $value, 60);
echo "Caching for 60 seconds: $value <br>\n";$retrieved = $memcache->get('thekey');
echo "Retrieved: $retrieved <br>\n";
The PHP library takes care of the dirty work of serializing any value you pass to the cache, so you can send and retrieve arrays or even complete data objects.
In your application's data layer, instead of immediately hitting the database, you can now query memcached first. If the item is found, there's no need to hit the database and assemble the data object. If the key is not found, you select the relevant data from the database and store the derived object in the cache. Similarly, you update the cache whenever your data object is altered and updated in the database. Assuming your API is structured well, only a few edits need to be made to dramatically alter the scalability and performance of your application.
I've linked to a few resources below where you can find more information on using memcached in your application. In addition to the documentation on the memcached web site, Todd Hoff has compiled a list of articles on memcached and summarized several memcached performance techniques. It's a pretty versatile tool. For those of you who've used memcached, give us a holler in the comments and share your tips and tricks.
Memcached
Strategies for Using Memcached and MySQL Better Together
Memcached and MySQL tutorial (PDF)
Posted by Jason Striegel |
Aug 6, 2008 10:37 PM
Data, Linux, Linux Server, MySQL, Software Engineering |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
August 4, 2008
Shield your files with Reed-Solomon codes
Thanassis Tsiodras wrote in about a utility for adding additional error correction redundancy to your backup data:
The way storage quality has been nose-diving in the last years, you'll inevitably end up losing data because of bad sectors. Backing up, using RAID and version control repositories are some of the methods used to cope ; here's another that can help prevent data loss from bad sectors. It is a software-only method, and it has saved me from a lot of grief.
The technique uses Reed-Solomon coding to add additional parity bytes to your data. If you suffer partial damage to the storage media, these files can still be recoverable.
Storage media are of course block devices, that work or fail on 512-byte sector boundaries (for hard disks and floppies, at least - in CDs and DVDs the sector size is 2048 bytes). This is why the shielded stream must be interleaved every N bytes (that is, the encoded bytes must be placed in the shielded file at offsets 1,N,2N,...,2,2+N,etc): In this way, 512 shielded blocks pass through each sector (for 512 byte sectors), and if a sector becomes defective, only one byte is lost in each of the shielded 255-byte blocks that pass through this sector. The algorithm can handle 16 of those errors, so data will only be lost if sector i, sector i+N, sector i+2N, ... up to sector i+15N are lost! Taking into account the fact that sector errors are local events (in terms of storage space), chances are quite high that the file will be completely recovered, even if a large number of sectors (in this implementation: up to 127 consecutive ones) are lost.
The application works similar to any other command line archiving utility, so you can tar your files as normal and then send them to the freeze.sh script. Running melt.sh on the archive will return your original data, even if there was a reasonable amount of corruption to the file. Thanks, Thanassis!
Hardening your files with Reed-Solomon codes
Posted by Jason Striegel |
Aug 4, 2008 10:04 PM
Data, Linux |
Permalink
| Comments (0)
| TrackBack
| Digg It
| Tag w/del.icio.us
July 23, 2008
NTFS Alternate Data Streams - hide files inside other files
The NTFS file system has support for additional data, called Alternate Data Streams (ADS), to be attached to any file. Normally this is used by the operating system and file explorer to bind extra data to a file, such as the file's access control information, searchable file meta-data like keywords, comments and revision history, and even information that can mark a file as having been downloaded from the internet. Because this extra information is bound to the file at the filesystem level, you can move the file from one folder to another and all of the various meta-information and permission data stays with the file.
The interesting thing is that a file can have 0 to many ADS forks attached to any file or directory. While some of the ADS identifiers are use by the OS, there's nothing stopping you from adding other ADS forks to a file. You can do this directly from the command line, using a simple colon ":" notation.
Let's say you have a file called test.txt. You can store a secret message in the file like this:
echo "This is a secret" > test.txt:secretdata
If you view the contents of the file, you won't see anything peculiar. If you know about the existence of the secretdata ADS entry, however, you can easily extract the hidden information with the following command:
more < test.txt:secretdata > output.txt
When you now open output.txt, you'll find your secret data inside.
Because it's a lower level OS feature, you can even trick most programs into loading the data. In the scenario above, you could actually load and edit the secretdata stream inside of notepad by running "notepad test.txt:secretdata".You can even store and execute binary data of any particular size in an ADS fork. For instance, maybe you want to shove solitaire inside one of your text file's ADS entries:
type c:\winnt\system32\sol.exe > test.txt:timewaster.exe
Running the file is as simple as "start .\test.txt:timewaster.exe". Wild, no?
So the odd thing is that all these hidden streams are floating about your filesystem and until Vista's /R flag on the DIR command, there hasn't really been a very good built-in way of detecting them. To solve this, Frank Heyne created an application called LADS which is an excellent command line utility that will scan a directory and print out stream names and sizes for files within it.
There's was also a tool released in an MSDN article about file streams that will at an extra tab to the file properties in Windows Explorer. I've linked to a FAQ that Frank maintains about ADS that walks you through setting up the dll and registry entries to make this work. When it's activated, the Streams tab in the properties panel will let you create, view, edit or delete the stream data that's attached to any file, right in Explorer.
I can see how this file system feature could be useful, but it's a little odd that it's so hidden from the user and there seem to be a few problems with the concept. Obviously, because of ADS's hidden nature, there are a number of malicious uses that can be employed by jerk-o's who write virii and that sort of thing. Even ignoring that, there are also data interchange issues—moving a file between NTFS and another file system causes the loss of all this attached information. Call me old fashioned, but I like my files the way they used to be, with a start, an end, and some bytes in between.
Frank Heyne - Alternate Data Streams in NTFS FAQ
LADS - NTFS alternate data stream list utility
The Dark Side of NTFS
MSDN: A Programmer's Perspective on NTFS Streams and Hard Links
Posted by Jason Striegel |
Jul 23, 2008 10:30 PM
Cryptography, Data, Windows, Windows Server |
Permalink
| Comments (3)
| TrackBack
| Digg It
| Tag w/del.icio.us
July 15, 2008
When to denormalize
There's been a bit of a database religious war on Dare Obasanjo and Jeff Atwood's blogs, all on the subject of database normalization: when to normalize, when not to, and the performance and data integrity issues that underly the decision.
Here's the root of the argument. What we've all been taught regarding database design is irrelevant if the design can't deliver the necessary performance results.
The 3rd normal form helps to ensure that the relationships in your DB reflect reality, that you don't have duplicate data, that the zero to many relationships in your system can accommodate any potential scenario, and that space isn't wasted and reserved for data that isn't explicitly being used. The downside is that a single object within the system may span many tables and, as your dataset grows large, the joins and/or multiple selects required to extract entities from the system begins to impact the system's performance.
By denormalizing, you can compromise and pull some of those relationships back into the parent table. You might decide, for instance, that a user can have only 3 phone numbers, 1 work address, and 1 home address. In doing so, you've met the requirements of the common scenario and removed the need to join to separate address or contact number tables. This isn't an uncommon compromise. Just look at the contacts table in your average cell phone to see it in action.
Jeff writes:
Both solutions have their pros and cons. So let me put the question to you: which is better -- a normalized database, or a denormalized database?Trick question! The answer is that it doesn't matter! Until you have millions and millions of rows of data, that is. Everything is fast for small n.
So for large n, what's the solution? In my personal experience, you can usually have it both ways.
Design your database to 3NF from the beginning to ensure data integrity and to allow room for growth, additional relationships, and the sanity of future querying and indexing. Only when you find there are performance problems do you need to think about optimizing. Usually this can be accomplished through smarter querying. When it cannot, you derive a denormalized data set from the normalized source. This can be as simple as an extra field in the parent table that derives sort information on inserts, or it can be a full-blown object cache table that's updated from the official source at some regular interval or when an important even occurs.
Read the discussions and share your comments. To me, the big takeaway is that there's no one solution that will fit every real world problem. Ultimately, your final design has to reflect the unique needs of the problem that is being solved.
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
Posted by Jason Striegel |
Jul 15, 2008 08:47 PM
Data, Software Engineering |
Permalink
| Comments (0)
| TrackBack
| Digg It
| Tag w/del.icio.us
July 5, 2008
Crawling AJAX
Traditionally, a web spider system is tasked with connecting to a server, pulling down the HTML document, scanning the document for anchor links to other HTTP URLs and repeating the same process on all of the discovered URLs. Each URL represents a different state of the traditional web site. In an AJAX application, much of the page content isn't contained in the HTML document, but is dynamically inserted by Javascript during page load. Furthermore, anchor links can trigger javascript events instead of pointing to other documents. The state of the application is defined by the series of Javascript events that were triggered after page load. The result is that the traditional spider is only able to see a small fraction of the site's content and is unable to index any of the application's state information.
So how do we go about fixing the problem?
Crawl AJAX Like A Human Would
To crawl AJAX, the spider needs to understand more about a page than just its HTML. It needs to be able to understand the structure of the document as well as the Javascript that manipulates it. To be able to investigate the deeper state of an application, the crawling process also needs to be able to recognize and execute events within the document to simulate the paths that might be taken by a real user.
Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of describing the "event-driven" approach to web crawling. It's about creating a smarter class of web crawling software which is able to retrieve, execute, and parse dynamic, Javascript-driven DOM content, much like a human would operate a full-featured web browser.
The "protocol-driven" approach does not work when the crawler comes across an Ajax embedded page. This is because all target resources are part of JavaScript code and are embedded in the DOM context. It is important to both understand and trigger this DOM-based activity. In the process, this has lead to another approach called "event-driven" crawling. It has following three key components
- Javascript analysis and interpretation with linking to Ajax
- DOM event handling and dispatching
- Dynamic DOM content extraction
The Necessary Tools
The easiest way to implement an AJAX-enabled, event-driven crawler is to use a modern browser as the underlying platform. There are a couple of tools available, namely Watir and Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract page data after it has processed any Javascript.
Watir is a library that enables browser automation using Ruby. It was originally built for IE, but it's been ported to both Firefox and Safari as well. The Watir API allows you to launch a browser process and then directly extract and click on anchor links from your Ruby application. This application alone makes me want to get more familiar with Ruby.
Crowbar is another interesting tool which uses a headless version of Firefox to render and parse web content. What's cool is that it provides a web server interface to the browser, so you can issue simple GET or POST requests from any language and then scrape the results as needed. This lets you interact with the browser from even simple command line scripts, using curl or wget.
Which tool you use depends on the needs of your crawler. Crowbar has the benefit of being language agnostic and simple to integrate into a traditional crawler design to extract page information that would only be present after a page has completed loading. Watir, on the other hand, gives you deeper, interactive access to the browser, allowing you to trigger subsequent Javascript events. The downside is that the logic behind a crawler that can dig deep into application state is quite a bit more complicated, and with Watir you are tied to Ruby which may or may not be your cup of tea.
Crowbar - server-side headless Firefox
Watir - browser remote control in Ruby
Crawling Ajax-driven Web 2.0 Applications (PDF)
Posted by Jason Striegel |
Jul 5, 2008 12:57 PM
Ajax, Data, Web |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
June 24, 2008
Videos from past Shmoocons
You may have dug the videos of past DEFCON conferences that we posted back in May, but there's a whole other infosec conference, Shmoocon, which is held in D.C. every February.
ShmooCon is an annual East coast hacker convention hell-bent on offering three days of an interesting atmosphere for demonstrating technology exploitation, inventive software & hardware solutions, and open discussions of critical infosec issues.
It's a while until the next conference comes up, but there have been some great presentations at past conferences, most of which are available online. Peteris Krumins recently assembled links to all of the videos and presentation files that are available at the Shmoocon site (including the 2008 conference), posting them to his blog as a single big index.
A quick search on YouTube also turned up a series of videos by Scott Moulton from Shmoocon 2007 and 2008 on the topic of data recovery for both traditional hard disks and flash drives. It's pretty fascinating stuff, whether you're interested in this from a forensics or security perspective, or if you've ever just wondered what exactly goes into recovering important data from a crashed disk when you send it out to a data recovery shop.
Hacking Videos from Shmoocon
Scott Moulton's videos on data recovery for SSD flash drives and hard disks
Shmoocon Infosec Conference
See also: Videos from past DEFCONs
Posted by Jason Striegel |
Jun 24, 2008 09:14 PM
Cryptography, Data, Network Security |
Permalink
| Comments (1)
| TrackBack
| Digg It
| Tag w/del.icio.us
May 13, 2008
drop.io - simple anonymous file sharing
Sometimes I need to send files to people that are too large to attach to an email. Inevitably, the solution is to upload it to an ftp or web server that I have access to and then send the recipient a download url. It's a pretty inefficient process, and unless you like your ftp server becoming an overwhelming mess of random downloads, you have to remember to go back and remove things at a later date.
drop.io is a web service that solves this sort of problem perfectly. You create a drop URL with a unique name, upload a file to it, and set an expiration time when it will be deleted, all in a single step. The drop folder can have both an access and an admin password, and you can choose what level of access (read, read/write, read/write/delete) the non-admin has. After you've created a drop folder, you can continue to add files and notes to it via the web interface or by email. Each drop also has a phone extension that will allow you to call in and record messages that are added to the drop. It's brilliantly simple.
What I like best is that aside from tracking IP for legal or terms of service violations, it's completely anonymous. You don't make an account to use the service. There is no profile. The drop folders aren't search indexable unless you choose to make them without passwords and publish the URL somewhere crawlable. You can renew the expiration period of the drop, but when it expires, it goes away along with its contents.
I like.
drop.io - Simple Private Exchange
Posted by Jason Striegel |
May 13, 2008 08:25 PM
Data |
Permalink
| Comments (2)
| TrackBack
| Digg It
| Tag w/del.icio.us
Bloggers
Welcome to the Hacks Blog!
Categories
- Ajax
- Amazon
- Android
- AppleTV
- arduino
- Astronomy
- Baseball
- BlackBerry
- Blogging
- Body
- Cars
- Cryptography
- Data
- Design
- Education
- Electronics
- Energy
- Events
- Excel
- Excerpts
- Firefox
- Flash
- Flickr
- Flying Things
- Food
- Gaming
- Gmail
- Google Earth
- Google Maps
- Government
- Greasemonkey
- Hacks Series
- Hackszine Podcast
- Halo
- Hardware
- Home
- Home Theater
- iPhone
- iPod
- IRC
- iTunes
- Java
- Kindle
- Knoppix
- Language
- LEGO
- Life
- Lifehacker
- Linux
- Linux Desktop
- Linux Multimedia
- Linux Server
- Mac
- Mapping
- Math
- Microsoft Office
- Mind
- Mind Performance
- Mobile Phones
- Music
- MySpace
- MySQL
- NetFlix
- Network Security
- olpc
- Online Investing
- OpenOffice
- Outdoor
- Parenting
- PCs
- PDAs
- Perl
- Philosophy
- Photography
- PHP
- Pleo
- Podcast
- Podcasting
- Productivity
- PSP
- Retro Computing
- Retro Gaming
- Science
- Screencasts
- Security
- Shopping
- Skype
- Smart Home
- Software Engineering
- Sports
- SQL
- Statistics
- Survival
- TiVo
- Transportation
- Travel
- Ubuntu
- User Interface
- Video
- Virtualization
- Visual Studio
- VoIP
- Web
- Web Site Measurement
- Windows
- Windows Server
- Wireless
- Word
- World
- Xbox
- Yahoo!
- YouTube
Archives
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
Recent Posts
- Minty soldering jig
- Selecting row number in MySQL
- iPhone 3G software unlock
- Python on Android
- Controlling Sony camcorders with the Arduino
- Gradient text effect in CSS
- Retro gaming emulators that include (legal) ROMs?
- Das DereLicht - ham radio transmitter from a CFL bulb
- Using Google App Engine as a personal CDN
- Route-me - Open Source mapping library for iPhone
www.flickr.com
|






Recent comments