Page 1 of 1

Nuts and Bolts

Posted: May 17th, 2012, 9:00 pm
by rocketsciguy
I'm creating this topic for the discussion of the stuff that goes on behind the scenes, to keep it from clogging up normal discussions. I hope to be able to pick the brains of our benevolent developers. If any of them wish to tidy up the other forum topics that have posts that have this level of discussion, feel free to dump them in here.

WARNING!
:geek: Geek alert from this point forward! :geek:
Corfman Clan wrote: here's the SQL query for the above pic, another small glimpse into the LonelyCache database schema: :roll:

Code: Select all

SELECT Row,Ranking,CacherID,Cacher,Total
      ,(select sum(OverallPoints)from GeocacherPoints
        Where CacherID=r.CacherID and job=1 and RegionID = 1) FindsTotal
      ,(select sum(OverallPoints)from GeocacherPoints
        Where CacherID=r.CacherID and job=2 and RegionID = 1) HidesTotal
 FROM
 (SELECT
   ROW_NUMBER() OVER (ORDER BY Total DESC, g.Name) AS Row
  ,RANK() OVER (ORDER BY Total DESC) AS Ranking
  ,CacherID, g.Name as Cacher, Total
  FROM
 (SELECT CacherID, SUM(g.OverallPoints) AS Total
  FROM dbo.GeocacherPoints g
  WHERE regionid = 1 AND Job IN (1,2)
  GROUP BY CacherID) s
JOIN dbo.Geocacher g ON g.ID = CacherID) r
WHERE Row between 1 and 100
ORDER BY Row
So, because I noticed, I have to ask: Why the one letter source tables/queries names? "r", "g", "s", "p"? Or is this some syntax with which I am unfamiliar?

Another question or two, because I'd like to understand how a website like this operates behind the scenes. Are the webpages static HTML generated en masse with each update (hope not), or generated on demand from a set of template pages which use, e.g., PHP fields passed through the URL to query the database? Looks like you have the point totals pre-computed and stored in tables, probably with the heavy crunching performed offline then pushed to the server, is that right? It would be cool if the Live API data were fed directly into your live database, and all the supporting queries update with new data as it comes in. But I imagine that would be difficult to keep from breaking and might be "unfair" because of the round-robin way the Pocket Queries bring in data, and would likely bog down the server enormously. Just displaying this Top 10 rankings would require that every cache's points be recomputed, LCP-wide, and every cacher's total hide and find scores be summed -- looks like you're taking the sane route of doing those computations off-line.

Re: Nuts and Bolts

Posted: May 18th, 2012, 1:13 am
by Corfman Clan
rocketsciguy wrote:So, because I noticed, I have to ask: Why the one letter source tables/queries names? "r", "g", "s", "p"? Or is this some syntax with which I am unfamiliar?
I'm no SQL export so I usually run to my brother Robert for help. The problem there is he's absolutely amazing with Oracle SQL and we're using SQL Server. A lot transfers but there are differences. I also run to Redfist too since he's much better with SQL than I. So take what follows with a grain of salt. :?

For Select statements, there's the select part, that is, what you want to select, and the predicate (the from part, that is, what you are selecting from). If the predicate is another select, then it has to be named. So, SELECT a FROM (SELECT c FROM d) b the "b" is giving a name to the select within the parenthesis. This is the case with the "r" and "s". I picked "r" for selecting the rank, and "s" for selecting the sum.

I'm not positive but I believe the "g" is only needed for shorthand. That is, it is a lot easier to just type g.Name than dbo.Geocacher.Name. Wow, I just noticed I used "g" to alias two different tables too, though they are scoped differently so it is not an issue.
rocketsciguy wrote:Are the webpages static HTML generated en masse with each update (hope not), or generated on demand from a set of template pages which use, e.g., PHP fields passed through the URL to query the database?
The website is using Microsoft's .NET so the pages are essentially templates with server side code that runs when they are requested.

For the LeaderBoard page, the above query is generated and then run and the results are utilized in creating the HTML for the page. Let's take a look at the above query. It is using RegionID=1. RegionID=1 is LonelyCache wide. For some other region (such as AZ, CO, etc), the RegionID would be different. The query also has Job=1 or Job=2. Job specifies what type of statistics are stored in the GeocacherPoints table. 1 is for find points and 2 is for hide points. We'll have different job value for the backcountry boards. You can also see use of the OverallPoints column in the table. We have other columns for the different cache types too. So for the Multi-cache board, we'll replace the OverallPoints column with the MultiPoints column. At the bottom of the query, you'll also see WHERE Row BETWEEN 1 AND 100. By adjusting the 1 and 100, we can, for example, allow paging through the leaderboard.

So, why did I just write all the above? Well, when the leaderboard page is requested, there will be some parameters passed in the URL that specify the region and board and the page of the board to display. The server side code will evaluate all that and modify the query as needed then execute the query. When the results come back, the server side code will then generate the HTML for the page and that is then sent by the web server to your webbrowser and displayed.
rocketsciguy wrote: Looks like you have the point totals pre-computed and stored in tables, probably with the heavy crunching performed offline then pushed to the server, is that right?
Yes, things are set up to update the statistics once a day once things are up to date. We decided to not spend the time updating the statistics if data gathering is taking place. On the test site, that is set to 4:00 am Eastern time. Usually they're done a few minutes after 4, other times they don't run until a few hours later. Updating the statistics seems to take a few minutes of pretty intense number crunching on the database server. That time will only grow as the number of caches, cachers, and logs increase and as we add more statistics. The updates are done online at the server though. That is, all statistic updates are actually SQL update or insert queries.
rocketsciguy wrote: It would be cool if the Live API data were fed directly into your live database, and all the supporting queries update with new data as it comes in. But I imagine that would be difficult to keep from breaking and might be "unfair" because of the round-robin way the Pocket Queries bring in data, and would likely bog down the server enormously. Just displaying this Top 10 rankings would require that every cache's points be recomputed, LCP-wide, and every cacher's total hide and find scores be summed -- looks like you're taking the sane route of doing those computations off-line.
Like I mentioned above, the statistics updates are being done online, not offline. They are only being run once a day though and only then if no data gathering is taking place. We may update a cache's points as we gather data, but we will not be updating the statistics used to generate the leaderboards. Our feeling is that updating the statistics in realtime would be less meaningful than once a day and we took that into account when we came up with the current design. One problem with this is that while the statistics are being updated, anyone browsing the leaderboards will probably see a lot of garbage and have poor response. I'm thinking we may put up some banner or disable disable the leaderboard pages while this is going on.

Re: Nuts and Bolts

Posted: May 18th, 2012, 1:22 am
by Corfman Clan
Corfman Clan wrote:One problem with this is that while the statistics are being updated, anyone browsing the leaderboards will probably see a lot of garbage and have poor response. I'm thinking we may put up some banner or disable disable the leaderboard pages while this is going on.
Or optimally, we can use some form of data partition. Have an active partition with the current statistics and update the statistics in the inactive partition. Once the statistics are updated, make the inactive partition active.

Re: Nuts and Bolts

Posted: May 23rd, 2012, 12:21 am
by Corfman Clan
So was that it :?: No more questions :?:

Re: Nuts and Bolts

Posted: May 23rd, 2012, 7:57 am
by rocketsciguy
Sorry -- been too busy lately to write a response or follow-up, just enough time to check in to read. Thanks, this was helpful and enlightening. I will follow up soon!

Re: Nuts and Bolts

Posted: May 26th, 2012, 11:51 am
by rocketsciguy
Corfman Clan wrote:If the predicate is another select, then it has to be named. So, SELECT a FROM (SELECT c FROM d) b the "b" is giving a name to the select within the parenthesis. This is the case with the "r" and "s". I picked "r" for selecting the rank, and "s" for selecting the sum.
Ah, I did not know that. My only modern database (if you don't count dBASE IV for DOS waybackwhen) experience has been in MS Access SQL where most query building can be done with the GUI, although I've had to write or edit the underlying SQL myself sometimes. I will have to check to see if I can do that in Access too.
Corfman Clan wrote:The website is using Microsoft's .NET so the pages are essentially templates with server side code that runs when they are requested....
Well, when the leaderboard page is requested, there will be some parameters passed in the URL that specify the region and board and the page of the board to display. The server side code will evaluate all that and modify the query as needed then execute the query. When the results come back, the server side code will then generate the HTML for the page and that is then sent by the web server to your webbrowser and displayed.
Good -- that's what I had hoped to hear.
Corfman Clan wrote:Yes, things are set up to update the statistics once a day once things are up to date. We decided to not spend the time updating the statistics if data gathering is taking place. On the test site, that is set to 4:00 am Eastern time. Usually they're done a few minutes after 4, other times they don't run until a few hours later. Updating the statistics seems to take a few minutes of pretty intense number crunching on the database server. That time will only grow as the number of caches, cachers, and logs increase and as we add more statistics. The updates are done online at the server though. That is, all statistic updates are actually SQL update or insert queries....
Like I mentioned above, the statistics updates are being done online, not offline. They are only being run once a day though and only then if no data gathering is taking place. We may update a cache's points as we gather data, but we will not be updating the statistics used to generate the leaderboards. Our feeling is that updating the statistics in realtime would be less meaningful than once a day and we took that into account when we came up with the current design. One problem with this is that while the statistics are being updated, anyone browsing the leaderboards will probably see a lot of garbage and have poor response. I'm thinking we may put up some banner or disable disable the leaderboard pages while this is going on.
So if I read that right, you effectively have two layers to the database (perhaps it's really two separate databases, but whatever). The first layer is an accumulator -- it receives the data from the PQs and the API. The second layer is the daily snapshot. So at 4 AM ET, the server generates some UPDATE and INSERT queries for the database that makes aggregate statistic tables. Those tables become the new daily snapshot layer. At the user level, if a page requires aggregate data, it's pulled from the daily update layer. If it needs non-aggregate data, it could be pulled from the accumulator tables or from a static version of them generated at the same time as the daily snapshot. The advantage of using the accumulator data directly is could be more up-to-date, with the disadvantage that occasionally there will be a mismatch between (for instance) a cache's own reported â‚§ value and the value that appears on a top 100 list of caches. The upside and downside for using a static version would be vice versa. This is great! I'm glad you guys have thought this through. Thanks!

Care to share more details about how this is deployed and hosted? You mentioned it running on MS SQL Server and hosted on the east coast. Was the decision to go with MS SQL Server mostly due to compatibility with .NET and C# with which you are already proficient? Will there be ongoing licensing expenses associated with running on an Microsoft base that could be avoided if the project were migrated to free/open-source software (i.e. running on a LAMP server = Linux, Apache, MySQL, Perl/PHP/Python)?

Re: Nuts and Bolts

Posted: May 26th, 2012, 6:02 pm
by Corfman Clan
rocketsciguy wrote:So if I read that right, you effectively have two layers to the database (perhaps it's really two separate databases, but whatever). The first layer is an accumulator -- it receives the data from the PQs and the API. The second layer is the daily snapshot. So at 4 AM ET, the server generates some UPDATE and INSERT queries for the database that makes aggregate statistic tables. Those tables become the new daily snapshot layer. At the user level, if a page requires aggregate data, it's pulled from the daily update layer. If it needs non-aggregate data, it could be pulled from the accumulator tables or from a static version of them generated at the same time as the daily snapshot. The advantage of using the accumulator data directly is could be more up-to-date, with the disadvantage that occasionally there will be a mismatch between (for instance) a cache's own reported â‚§ value and the value that appears on a top 100 list of caches. The upside and downside for using a static version would be vice versa. This is great! I'm glad you guys have thought this through. Thanks!
Well no, I wouldn't describe the DB as having two layers. It has many tables and some tables are updated via the Geocaching.com Live API, and some are updated daily when we update the stats. One table is updated by both mechanisms. The basic tables are:
  • Geocache: Contains a row for each geocache. Each row contains all data for a geocache, such as GC code, name, location, owner, current point calculations, number of finds, etc. The DB on my laptop currently has 69,274 geocaches.
  • Geocacher: Contains a row for every geocacher that has found or hid a cache in the LonelyCache domain. Each row contains the geocacher ID and name. My laptop DB currently has 114,076 rows.
  • Logs: Contains a row for every geocache find. Each row contains the log ID, logger (geocacher), date, and geocache it is for. My laptop DB currently has 4,361,046 rows.
  • Region: Contains a row for each region. Each row contains a region ID, name, and region type. Currently we have 6 regions, one for LonelyCache wide, and one for AZ, CO, NV, NM, and UT.
  • GeocacheRegions: Contains a row for each region each geocache is in. Each row contains the cache ID and the region ID. Currently every cache is in two regions, LonelyCache Wide and whatever state it is in. If we add more regions, such as counties, then a cache may end up being in more than two regions.
  • GeocacherPoints: Each row contains the geocacher ID, region ID, job, and point totals for each cache type, along with a point total for all cache types.
The Geocache, Geocacher, and Log tables are updated through the Geocaching.com Live API. Each geocache will be updated after a certain amount of time has passed since it was last updated. New geocaches, geocachers, and logs are added as they are come across.

The statistics are updated about once a day. This procedure occurs once there are no geocaches ready to be updated. The idea here is why update the stats if we're still updating geocaches. There are several steps involved in updating the stats:
  • Update the points and number of finders for each geocache.
  • Delete every row in the GeocacheRegions table and then repopulate the table. This will most likely change when we add regions based on polygons in addition to the current LonelyCache wide and state regions.
  • Delete every row in the GeocacherPoints table and then repopulate the table. Population of the table is an iterative process for each job and geocache type.
Updating the stats works well, but there is a flaw in it, in that when the stats are being updated, there is essentially incomplete data to display back to users. It takes minutes to update the stats so there are minutes where users will get garbage when a page is displayed. Redfist and I need to take a look at some options on dealing with this, or at least minimizing the amount of time that this situation exists.
rocketsciguy wrote:Care to share more details about how this is deployed and hosted? You mentioned it running on MS SQL Server and hosted on the east coast. Was the decision to go with MS SQL Server mostly due to compatibility with .NET and C# with which you are already proficient? Will there be ongoing licensing expenses associated with running on an Microsoft base that could be avoided if the project were migrated to free/open-source software (i.e. running on a LAMP server = Linux, Apache, MySQL, Perl/PHP/Python)?
The web hosting company we are using has Microsoft IIS for the web servers so that is why we are using ADO.Net for the active server pages. We could have used PHP instead but we aren't except for this forum. The web hosting company standardizes on Microsoft SQL Server for the database server so that is why we are using it. I believe they also have MySQL available but we didn't pursue that. I won't go into the hosting costs here but suffice it to say that it is currently not an issue.

Re: Nuts and Bolts

Posted: June 10th, 2012, 12:46 pm
by Corfman Clan
Corfman Clan wrote:The basic tables are:
  • (Snipped descriptions for: Geocache, Geocacher, Logs, Region, and GeocacheRegions)
  • GeocacherPoints: Each row contains the geocacher ID, region ID, job, and point totals for each cache type, along with a point total for all cache types.
We updated the GeocacherPoints table to only include one Points value instead of one for each cache type. We also added a Rank and Row field and renamed the Job field to Board.

Values for Board are: FinderOverall, FinderTraditional, FinderMulti, ... FinderWherigo, HiderOverall, ..., Hider Wherigo, ..., CombinedOverall, ..., CombinedWherigo, ..., BCOverall, ..., BCWherigo, FDF.

Since more than one geocacher may have the exact same rank, the Row field is used to order them. Basically, geocachers with the same rank are ordered by caching name.

This added a lot more rows to the table and increased the time it takes to compute leader boards, but it really simplified things overall. It also speeded things up a lot when browsing the website. Displaying the Region and Geocacher pages, along with the leader board pages is much snappier. This is mostly because the ranking is now computed a priori instead of on demand.

As an example of how things are simplified, that ugly query that was shown at the top of this thread is now:

Code: Select all

SELECT Row,Ranking,CacherID,g.Name AS Cacher,Points AS Total
    ,(SELECT SUM(fp.Points) FROM GeocacherPoints f
      WHERE f.CacherID=p.CacherID AND f.RegionID=p.RegionID AND f.Board=(p.Board-20)
     ) AS FindsTotal
    ,(SELECT SUM(hp.Points) FROM GeocacherPoints h
      WHERE h.CacherID=p.CacherID AND h.RegionID=p.RegionID AND h.Board=(p.Board-10)
     ) AS HidesTotal
FROM GeocacherPoints p
JOIN Geocacher g ON (g.ID = p.CacherID)
WHERE RegionID=1 AND Board=21 AND Row BETWEEN 1 AND 100
ORDER BY Row

Re: Nuts and Bolts

Posted: June 11th, 2012, 9:13 pm
by rocketsciguy
Corfman Clan wrote:
Corfman Clan wrote:The basic tables are:
  • (Snipped descriptions for: Geocache, Geocacher, Logs, Region, and GeocacheRegions)
  • GeocacherPoints: Each row contains the geocacher ID, region ID, job, and point totals for each cache type, along with a point total for all cache types.
We updated the GeocacherPoints table to only include one Points value instead of one for each cache type. We also added a Rank and Row field and renamed the Job field to Board.

Values for Board are: FinderOverall, FinderTraditional, FinderMulti, ... FinderWherigo, HiderOverall, ..., Hider Wherigo, ..., CombinedOverall, ..., CombinedWherigo, ..., BCOverall, ..., BCWherigo, FDF.
..
This added a lot more rows to the table and increased the time it takes to compute leader boards, but it really simplified things overall. It also speeded things up a lot when browsing the website. Displaying the Region and Geocacher pages, along with the leader board pages is much snappier. This is mostly because the ranking is now computed a priori instead of on demand.
This is a good change, and I would expect more flexibility in pulling data (though you need to recode some stuff) and speedier queries even if you didn't calculate Rank and Row a priori. But calculating them will improve end-user experience at the slight cost of longer downtime for crunching. More rows is not a problem for databases, but messing with the columns (fields) will certainly break things once things are going. And a good rule of thumb is to avoid empty/null fields (not many cachers will have a score for hiding Wherigos in Nevada, but the database would hold an unnecessary zero there for them in the old structure if they've hid anything in that state [if the new update query that populates this table is generating rows for zero scores, you should filter them out as unnecessary]). And an other rule of thumb is to have only one unique piece of data per row, which this change does: All other fields are qualifiers to describe what that datapoint is -- who, what (hide/find), where (region), and which (cache type). A very good change, IMHO. I actually would bet that the size (bytes) of this GeocacherPoints table reduced in size after this change.

I will have to compare the SQL, but yes it does look cleaner now -- mostly due to having the rankings computed a priori.

Re: Nuts and Bolts

Posted: July 10th, 2012, 8:20 pm
by Corfman Clan
Corfman Clan wrote:Updating the stats works well, but there is a flaw in it, in that when the stats are being updated, there is essentially incomplete data to display back to users. It takes minutes to update the stats so there are minutes where users will get garbage when a page is displayed. Redfist and I need to take a look at some options on dealing with this, or at least minimizing the amount of time that this situation exists.
While on vacation, I was able to have a few really good discussions with two of my brothers about this. Both of them are database experts so this sort of issue is something they both are familiar with. We came up with two solutions. Either is just as effective as the other, though one gives some flexibility the other doesn't so that's the one I opted for.

In general, both solutions are the same in that we have a set of statistics that is marked active. We keep that set around while the new set of statistics is being generated. Once the new set of statistics is finished being generated, then it is marked active and the older set is marked out of date and later deleted. Changing the new set to active and the old to out of date is essentially instantaneous, so there is no down time as far as anyone can tell when viewing the website.

I know, that seems like a pretty obvious thing to do. However, the problem isn't the concept, but the implementation. During vacation, we came up with two possible implementations so I'm happy about that - I like options. Anyway, I hope to complete the implementation tonight, test it, and have it deployed by tomorrow. This was one item that I felt needed to be taken care of before we open the site up for testing

Re: Nuts and Bolts

Posted: July 11th, 2012, 9:34 am
by firennice
Thanks for the update... and the info. :)

Re: Nuts and Bolts

Posted: July 11th, 2012, 12:36 pm
by rocketsciguy
Corfman Clan wrote:In general, both solutions are the same in that we have a set of statistics that is marked active. We keep that set around while the new set of statistics is being generated. Once the new set of statistics is finished being generated, then it is marked active and the older set is marked out of date and later deleted. Changing the new set to active and the old to out of date is essentially instantaneous, so there is no down time as far as anyone can tell when viewing the website.
Happy to hear that you've found a way to fix this problem. A fortunate consequence of this solution is that you have the potential to generate historical (that is, time-varying) statistics. Old point values for individual caches probably aren't that interesting, but tracking cachers' score totals over time or leaderboard movements could be interesting. Some people have asked for an equivalent to DGP's front page stats of CP Gains and CP Finds -- who was the top lonely cache point finder in the last 30 days or 6 months? Who lost the most points (Achielle's, archived)?

I think there would be little need to save the old cache point calculations, but I think cacher histories might be cool. You could create a point total history table. Every time the stat update cycle is complete, you would append this table with cacher stats. Fields might include a date or timestamp, CacherID, Region (start with Region=1 site-wide only, but allow for future archiving of state and sub-region stats), Job (allows for distinguishing hides and finds), and PointTotal. I don't think it's necessary to save histories by cache type (Traditional, etc.).

But maybe it's simpler to take the entire GeocacherPoints table (CacherID, RegionID, Board, Rank, Row, Points), tack on a datestamp to every record, and append that to your history archive. That gives the potential to do a ton of history queries on the data, but it also means that the size of the history grows very large very quickly. You could limit the records that you save to only the top boardleaders (Rank or Row >= 20 or 100, etc.).

I guess that you don't really need an "active" and "old" database to do what I described, but it gave me the idea and it just kept going from there.

Re: Nuts and Bolts

Posted: July 12th, 2012, 1:03 am
by Corfman Clan
rocketsciguy wrote:Happy to hear that you've found a way to fix this problem. A fortunate consequence of this solution is that you have the potential to generate historical (that is, time-varying) statistics. Old point values for individual caches probably aren't that interesting, but tracking cachers' score totals over time or leaderboard movements could be interesting. Some people have asked for an equivalent to DGP's front page stats of CP Gains and CP Finds -- who was the top lonely cache point finder in the last 30 days or 6 months? Who lost the most points (Achielle's, archived)?
We are planning on implementing this but we haven't gotten around to it yet. The big problem with how you describe a possible way of doing it (and you allude to this too) is that there just would be too much data saved. For example, with today's stats, there are 2,133,286 rows in the GeocacherPoints table. If we saved 30 days worth, that would be over 60 million rows. For six months, it would be over 365 million rows.

When we do this, I imagine we'll only track total point scores (combined hider/finder points for all of the LonelyCache territory). We'll add a new table that will probably have the following columns:
  • GeocacherID
  • Date
  • Points
  • Points from Hides/finds in the last 30 days
  • Points from Hides/finds in the last 180 days
Then everyday when the stats are updated, a new row will be added to the table for each geocacher. One thing to note is that this depends on historical data and those data will not exist at first but will build as time goes on.