Wednesday, June 5, 2013

Poor Batch Scripts

I was whipping up some Batch scripts to rename .eml files for me so that the names included the date, subject, to and from fields to make it easier to organize and find my archived emails.
I have long used bat/cmd scripts to do stuff like this but this little project took me a full day with many breaks to relieve the frustration.
Knowing Steve Wood who built the cmd batch processor is cool but it still never helped me figure out this problem:

The variables in batch scripts can use a syntax which allows some string substitutions like:

%foovar:x=y%  which returns the value of the variable with all x strings replaced by y.
Its nice for simple cases but I found it impossible to substitute out * characters.
This is because doing:

%foovar:*=-% is specially interpreted to mean all of the variable contents up to whatever comes after the *.

I tried %foovar:^*=-%%foovar:"*"=-% and even %foovar:**=-% with no luck.

Hmmmm.

Wednesday, April 10, 2013

Ah the joys of working with Microsoft's SQL Server

The past 3 days have been largely unproductive because of the complexities arising from multiple versions of SQL Server and Visual Studio on my machine.  There is a known bug where if you installed VSExpress 2010 at sometime in the past and then try to install the SQL Server client tools you will have a dependency problem that prevents the SQL tool installation.
In my efforts to get this installed I pulled off all my SQL Servers and my Visual Studios but to no avail.  Even the hack found here failed me.
Today after many many hours of frustration I believe I have things back on track but man is my machine probably messed up getting to this state and I doubt I will remember all the steps and things I tried if I have to do this again.
Why is it that software for money is so complex?  I think Christ was right when he said that money is the root of all evil. 
The motivations to make money in software are contrary to the motivation to actually solve a problem.  New releases are needed to keep money flowing in - whether or not they are really needed.  And you can't charge someone for a bug-fix release - that is just too embarrassing to admit and to audacious to charge people for.  You need to package it in ways that maximizes the ROI.  You have to invent more fancy ways to use your solution, and you better not find a better solution that makes your old one obsolete - not unless you have competition forcing your hand.
SQL Server has historically been a pretty expensive product and not easy to get at the Microsoft Company Store.  Today I installed SQL Server Express 2012 - the competition has forced the great Microsoft to release its technology free.  MySQL and other free solutions are just too good to keep the fences up.
When working on Sidewalk, an early Microsoft website done before the days of ASP, a very clever developer on our team invented CQL (Conceptual Query Language) which consisted of an ODBC driver that accepted queries in a high-level language that was then translated into the optimal SQL query and cached.  Once this engine got loaded up it screamed and devs no longer needed to know much at all about the schema.
That dev was later hired by the SQL Server team, and I had high hopes to see CQL shipped as a revolutionary solution - but it never happened.
I doubt it could be due to a technical problem, as we were using it under real life loads in Sidewalk for years.
I think such a revolutionary solution would have put too many DB consultants out of business, including a whole division within Microsoft.
So we live with layer of crap over a pretty good basic DB engine and the whole world gets to download gigabyte plus sized install packages to deal with the immense complexities of products that have 20+ versions and packages out there.  Just look at this download page.
Code reuse - till the business model fails.

Sorry, I'm just burnt out on this stuff.

Thursday, February 14, 2013

SQL - the problem what should not be.

After spending about 5 hours trying to get a nested SQL query working I sit in frustration at how lame the language is and how even more lame the parsers are for the language.
They say the last 10% of any project is the hardest, and I agree.
We now have lame speech recognition but we don't have excellent SQL parsing.
What's worse, I know too much.  I know that this whole SQL thing was solved by the Sidewalk team at Microsoft back in 1997 by an ingenious developer who devised CQL (Conceptual Query Language) which was implemented using an ODBC driver on top of SQL Server.
We had a schema that was so big, it covered about 100 feet of wall-space around our offices when printed on huge 4x5 foot sheets of paper in 12pt font or so!  It became so difficult to generate efficient (even correct) SQL queries that CQL was devised.  It analyzed the schema and then, given a much more free-form language describing the problem in a table-less flat space would create the optimal SQL query for the CQL specification and cache it and return the results.  Parameters could be set up to substitute into the cached queries and, whalla, the whole SQL efficiency, complexity and parsing problem was gone.
That enterprising developer was hired by the SQL Server team.  Unfortunately, CQL never saw the light of day.  I suppose that it didn't make sense to solve a core problem like that.  There was just too much invested in SQL consulting and systems to bypass it.  I'm sure it came down to $.
Anyway, my specific problem is this:

Schema:
-- phpMyAdmin SQL Dump
-- version 3.4.5
-- http://www.phpmyadmin.net
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";
CREATE TABLE IF NOT EXISTS `links` (
  `pageId` int(10) unsigned NOT NULL,
  `linkId` int(10) unsigned NOT NULL,
  `whenUsed` datetime NOT NULL,
  `whenRendered` datetime NOT NULL,
  KEY `pageId` (`pageId`),
  KEY `linkId` (`linkId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `links` (`pageId`, `linkId`, `whenUsed`, `whenRendered`) VALUES
(1, 9, '2013-02-14 00:44:20', '2013-02-14 00:42:39'),
(1, 9, '2013-02-14 00:44:21', '2013-02-14 00:42:39'),
(1, 9, '2013-02-14 00:44:23', '2013-02-14 00:44:21'),
(1, 8, '2013-02-14 00:44:25', '2013-02-14 00:44:23'),
(1, 7, '2013-02-14 00:44:26', '2013-02-14 00:44:25'),
(1, 6, '2013-02-14 00:44:28', '2013-02-14 00:44:26'),
(1, 3, '2013-02-14 00:44:29', '2013-02-14 00:44:26'),
(1, 7, '2013-02-14 00:44:31', '2013-02-14 00:44:29'),
(1, 8, '2013-02-14 00:44:32', '2013-02-14 00:44:31'),
(1, 11, '2013-02-14 00:44:34', '2013-02-14 00:44:32');

Inner Query:
SELECT `linkId`, 
TIMEDIFF(`whenUsed`, `whenRendered`) AS 'URDiff', 
CONVERT(DATEDIFF(`whenUsed`, UTC_TIMESTAMP()), SIGNED) AS 'dateDiff',
CONVERT('dateDiff' = 0, UNSIGNED) AS 'usedToday',
CONVERT('dateDiff' < 8 AND 'dateDiff' > 0, UNSIGNED) AS 'usedThisWeek',
CONVERT('dateDiff' < 31 AND 'dateDiff' > 7, UNSIGNED) AS 'usedThisMonth', 
CONVERT('dateDiff' < 365 AND 'dateDiff' > 30, UNSIGNED) AS 'usedThisYear'
FROM `links` 
WHERE MINUTE('URDiff') < 15 AND HOUR('URDiff') = 0
AND `pageId` = '1'
) AS T

Results of inner query:
+--------+----------+----------+-----------+--------------+---------------+--------------+
| linkId | URDiff   | dateDiff | usedToday | usedThisWeek | usedThisMonth | usedThisYear |
+--------+----------+----------+-----------+--------------+---------------+--------------+
|      9 | 00:01:41 |        0 |         1 |            0 |             0 |            0 |
|      9 | 00:01:42 |        0 |         1 |            0 |             0 |            0 |
|      9 | 00:00:02 |        0 |         1 |            0 |             0 |            0 |
|      8 | 00:00:02 |        0 |         1 |            0 |             0 |            0 |
|      7 | 00:00:01 |        0 |         1 |            0 |             0 |            0 |
|      6 | 00:00:02 |        0 |         1 |            0 |             0 |            0 |
|      3 | 00:00:03 |        0 |         1 |            0 |             0 |            0 |
|      7 | 00:00:02 |        0 |         1 |            0 |             0 |            0 |
|      8 | 00:00:01 |        0 |         1 |            0 |             0 |            0 |
|     11 | 00:00:02 |        0 |         1 |            0 |             0 |            0 |

+--------+----------+----------+-----------+--------------+---------------+--------------+ Outer Query:


SELECT 
`linkId`,
SUM('T.usedToday') AS 'countToday',
SUM('T.usedThisWeek') AS 'countThisWeek',
SUM('T.usedThisMonth') AS 'countThisMonth',
SUM('T.usedThisYear') AS 'countThisYear'
FROM

(
SELECT `linkId`, 
TIMEDIFF(`whenUsed`, `whenRendered`) AS 'URDiff', 
CONVERT(DATEDIFF(`whenUsed`, UTC_TIMESTAMP()), SIGNED) AS 'dateDiff',
CONVERT('dateDiff' = 0, UNSIGNED) AS 'usedToday',
CONVERT('dateDiff' < 8 AND 'dateDiff' > 0, UNSIGNED) AS 'usedThisWeek',
CONVERT('dateDiff' < 31 AND 'dateDiff' > 7, UNSIGNED) AS 'usedThisMonth', 
CONVERT('dateDiff' < 365 AND 'dateDiff' > 30, UNSIGNED) AS 'usedThisYear'
FROM `links` 
WHERE MINUTE('URDiff') < 15 AND HOUR('URDiff') = 0
AND `pageId` = '1'
) AS T

GROUP BY `linkId`        
Outer query result:+--------+------------+---------------+----------------+---------------+
| linkId | countToday | countThisWeek | countThisMonth | countThisYear |
+--------+------------+---------------+----------------+---------------+
|      3 |          0 |             0 |              0 |             0 |
|      6 |          0 |             0 |              0 |             0 |
|      7 |          0 |             0 |              0 |             0 |
|      8 |          0 |             0 |              0 |             0 |
|      9 |          0 |             0 |              0 |             0 |
|     11 |          0 |             0 |              0 |             0 |
+--------+------------+---------------+----------------+---------------+


Go figure?  I can clearly see from the inner results that there are 3 rows in the usedToday column that are 1 (and I have converted them to unsigned so they should SUM ok).  Yet the countToday for linkId 9 is 0 and not 3.  Why?

MySQL gives me some interesting warnings:
Warning (Code 1292): Truncated incorrect time value: 'URDiff'
Warning (Code 1292): Truncated incorrect time value: 'URDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'dateDiff'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedToday'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisWeek'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisMonth'
Warning (Code 1292): Truncated incorrect DOUBLE value: 'T.usedThisYear'


From what I have been able to find online about this warning, it appears to happen for some people who didn't us quotes on their names (which I use everywhere).  Now why would a parser give such a warning for quote usage?

LAMEOTECH I say!

Answer:
My son correctly pointed out that I was quoting my column names (a no no I guess, with no help from the parser) and referencing aliases outside the GROUP BY, HAVING, and ORDER BY clauses (also no help from the parser).
The correct query becomes:
SELECT 
`linkId`,
SUM(T.usedToday)ascountToday,
SUM(T.usedThisWeek)as countThisWeek,
SUM(T.usedThisMonth)as countThisMonth,
SUM(T.usedThisYear)as countThisYear
FROM
(
    SELECT `linkId`, 
    TIMEDIFF(`whenUsed`, `whenRendered`) as URDiff, 
    DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) as `dateDiff`,
    if(DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) = 0, 1, 0)as usedToday,
    if(DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) < 8 AND DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) > 0, 1, 0)as usedThisWeek,
    if(DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) < 31 AND DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) > 7, 1, 0) as usedThisMonth, 
    if(DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) < 365 AND DATEDIFF(`whenUsed`, UTC_TIMESTAMP()) > 30, 1, 0) as usedThisYear
    FROM `links` 
    WHERE MINUTE(TIMEDIFF(`whenUsed`, `whenRendered`)) < 15 AND HOUR(TIMEDIFF(`whenUsed`, `whenRendered`)) = 0
    AND `pageId` = 1
) as T
GROUP BY `linkId`    
 
With a cheery results of:

+--------+--------------+---------------+----------------+---------------+
| linkId | ascountToday | countThisWeek | countThisMonth | countThisYear |
+--------+--------------+---------------+----------------+---------------+
|      3 |            1 |             0 |              0 |             0 |
|      6 |            1 |             0 |              0 |             0 |
|      7 |            2 |             0 |              0 |             0 |
|      8 |            2 |             0 |              0 |             0 |
|      9 |            3 |             0 |              0 |             0 |
|     11 |            1 |             0 |              0 |             0 |
+--------+--------------+---------------+----------------+---------------+


Friday, January 25, 2013

Where the real power of computers lies - raw text

There are two things that really drive computers:
  1. Conventions/Standards
  2. Raw Text
Almost every PC operation (with the exception of UI rendering of images, controls and sound) ultimately use simple raw text as the true container for information.  These simple text messages are sent across your machine and the Internet to transfer information.
Many times data is stored in a binary form and there are complex protocols for transferring and interpreting data, but really, when you unpack everything and see where it is ultimately used, it is generally just text.
Even the languages used to express programs consist of text.
This is because text embodies human-readable and interpretable information.  It is the ability of humans to read text that ultimately gives text its value and capacity to hold information.
This is a significant point which I think our modern gadget/image/sound driven world is forgetting as they design software.
Many times I see information (ultimately in the form of simple text of course), packaged into forms that are difficult to extract.  Even UI that presents information to you on say a web page or a PDF file often can prevent the extraction of the raw text for transfer to other places.
Just think about it.  Where would you be with out Cut/Copy and Paste?  The clipboard is probably the most-used feature of any GUI.  In the old command line days (still alive and well today) it was the ability to pipe text between commands that gave UNIX and DOS users the power to take disparate tools and combine them efficiently to perform complex tasks.
Can you imagine where we would be without piping and clipboards?
With standards like UTF-8 there just aren't any more excuses.  We can pass virtually all information in human-readable text and have the most efficient systems possible.
So here's my tip for the day - are you coding up tons of UI or interface code just to ultimately get some text across a wire or into another program or across to a user?  If so, you may be approaching your problem wrong.

Tuesday, January 8, 2013

Ideas on building software

Over my many years of programming time, I have often had ideas on how the tools should be made to make building software more efficient and pleasurable.
There have been lots of advances over the years but I, being the old fart that I am, still think very highly of Kernighan and Ritchie, and their first OS.
I long for the days of simplicity.
The creation of the IDE (Integrated Development Environment) is more of a boon to software companies wanting to rope developers into their pet APIs of choice.  After all, there is only so much a developer can learn so if you can keep him doing things your way you can sell him software forever.
There is so much creativity out there it's quite hard to corral these guys into a single paradigm, but for most it has been done.
I admit there is synergy in having a bunch of programmers all working on the same language and tools, I'm not against that, but I am against a single paradigm being boxed up and shoved down your throat - you want to work differently, then make your own tools.
The openness of a simple compiler, linker and make system I still like best.  The tools are individually comprehensible and manageable - especially if the input and output are mostly text files.
I wish they had gone further in developing the multi-phased approach to building software.
Here's a list of the parts I see as being nice to have separated and individually accessible:
A Pre-compile Phase Macro Package
The old C compiler had this built in which you could utilize by invoking a build so you could see the output for help with buggy macros.  But I think such an intermediate file is actually quite useful to have in the build process as a separate phase of the process.
It is incredibly useful to place things in one place and have other files build from those definitions in different ways. Maybe you have common constants you want to use across php, javascript, C and SQL.  Maybe you have some key core data that would really speed things up if it were hard-coded in each layer of your distributed design.  Keeping things identical can only be done easily if the parts are derived from a single source combined with formatting or other options to create the output.
These pre-compiled output files can themselves become checked-in source that is only rarely built when the core definitions change.  Why precompile EVERY time right?
File Dependency Modules
The real nightmare of make-files is keeping dependencies straight.  When done manually, they create a kind of straight-jacket that eventually ties you in knots.  I envision tools that inspect the source tree for how files are related.  It could be as simple as looking for #Include statements but it could get quite sophisticated as well.  Imagine just building each function in a separate file with no references to any headers or other modules of your code.  The dependency tool simply figures out based on the symbolic analysis of the modules what files it depends on.  Of course the modules would be language specific but they could also be invoked further down the build process like at link-time.  The output of these tools would be dependency description files.  Date/Time stamps can be used to optimize only re scanning things that changed and naturally, these intermediate files can be checked-in as well.
File location based on dependencies
I have long thought of this as one of my cooler ideas.  Suppose you had a set of directories titled Layer0, Layer1, ... where Layer0 constitutes all the base OS APIs you have at your disposal - things that you know will always be there and you can't change.  Above that in Layer1 is only code that depends on Layer0 code.  Linking this code need know nothing above Layer1 to work.  Layer2 is just code that depends on Layer1 and lower code, etc.
If  you make a change that adds a dependency to an equal or higher layer, the file gets moved to the higher layer automatically, or a warning gets issued and it won't build till you put it in the right place.
Programmers become very aware of how their changes effect the dependencies of a project and they instinctively know that the lower order layers are likely simpler and faster bits of code.
I would think such a system will have a natural ordering to it that will promote faster and better factored code.
Compilers to a common AST structure
An AST stands for Abstract Syntax Tree and is an internal structure that compilers use as an intermediate form of the code as it parses the source code.  What if you could output the AST in text form?  You could see how the parser is interpreting your code and errors would be much better understood if you could see the associated AST.  What is nice is that this intermediate form of the build process could also be used to help editors do smarter editing and syntax errors would have so much information to present on a problem as to make it very clear, most likely, what is wrong without needing to create really great error messages for the problems.
What is even more powerful is the ability to write parsers for multiple languages that output the exact same AST format.  Now we have cross-language support without runtime or target machine dependencies!
And, as suggested in other areas of this article, the AST files could also be checked in and the re-parsing of unchanged files skipped in the build process.
AST to target object compilers
The next phase involves taking input from the AST files and building linkable objects or IL (Intermediate Language) output files.  It's not a long move from AST form to IL or Linkable form - it could be a very fast process.  Optimization options go in this phase of processing I should think.
Run-Time and Link-Time tools
I recall trying to better understand .obj files sometimes to help me understand link dependencies better.  We just didn't have the tools.  But it seems at this point one could start testing, profiling and debugging your code before its even compiled into final form.  Unreachable code could be found and dependencies analyzed here.  These tools could feed back into the file location based dependencies by noting where and how many link ends there are.  Graphical models could be constructed to see dependencies across modules and suggestions made on better ways to pull parts together into modules.
Final Compilation and post compilation optimization
Another pass can be used to construct stand alone .exes or .dlls or IL assemblies.  Here is where massive configuration options can be used for proper packaging.
I have always hated COM because of the way it forces you to create so many layers of overhead that you might never really need.  COM has a great way of abstracting things out so you treat everything the same but many times its just not needed.  I have often wish I had the choice of how to connect with other code - use COM with IDispatch interfaces, use COM with core interfaces, use direct DLL export linking, or just paste the code into my code.  These are all legitimate ways to go, I wish I had a world where I could just use all of them together and change them as needed.
Build analysis tools
How long did each phase of the build of parts of a big project take?  How many times did the same module have to be linked in for others to use?  Can I see a dependency map of all parts of the process from start to finish?
Graphical Programming
Finally, the ultimate dream - building software like legos.  If there were a nice way to represent code via a hierarchy of ORM models with different shaped knobs on them to represent interfaces.  You pick up a piece and you instantly see a list of other parts it can connect to and you just graphically drag them together to form a new part.
It's all a dream but some ideas that I hope might inspire someone to think of a way to do this better.
I'm not a compiler writer so I can't really claim any real expertise here, but it just seems we could do things better.