Saturday, April 30, 2011

Fast Search for SharePoint 2010 (FS4SP) and AdvancedFilterPack file types

Topic: SharePoint 2010, FAST Search for SharePoint 2010 (FS4SP), AdvancedFilterPack, SearchExportConverter
Subject:  File types included in the Advanced Filter Pack
Problem: I can’t seem to find any documentation on what file types are included OOB for IFilters or what file types are included in the AdvancedFilterPack?
Response: This is a great question that I get all the time.  I see a lot of blogs regarding PDFs not indexing correctly and the advice of enabled the AdvancedFilterPack. I have blogged regarding problems with the built-in PDF Converter ( http://fs4sp.blogspot.com/2011/04/why-arent-my-pdfs-indexed-correctly-and.html ) but you may be surprised to learn that PDFs are enabled without the advanced filter pack.  The majority of issues I see with PDFs or other OOB types are permission related.  When using the FS4SP configuration wizard always remember to “Run as Administrator”.  The configuration wizard (which can be run manually and provides a whole lot more information) will propagate permissions through the FASTSearch installation directory.  If the permissions are not set correctly the IFilter conversion process may fail.  If you use the FAST Shell command “docpush” to test the installation never use a text based file such as .txt or .html as these files do not invoke the IFilterConverter.
DocPush Example:  “docpush –c <Collection Name – “sp” is default> “C:\TEST\1.pdf” –l verbose

So let’s take a look at how to determine what is included OOB and what is included with the AdvancedFilterPack.

Solution\Example:
1.      Here is the list of OOB file types
a.      .doc,.docm,.docx,.dot,.dotx,.eml,.html,.mht,.mhtml,.msg,.nws,.odp,.ods,.odt,.one,.pdf,.pot,.pps,.ppt,.pptm,.pptx,.pub,.rtf,.txt,.vdw,.vsd,.vss,.vst,.vsx,.vtx,.xlb,.xlc,.xls,.xlsb,.xlsm,.xlsx,.xlt,.xlm,.xps,.zip
b.      Pretty easy … I stole if right off TechNet. http://technet.microsoft.com/nb-no/library/gg471168(en-us).aspx
c.      It doesn’t make any reference to advanced filter pack file types and I wouldn’t blog about it if the answer was providing a link to technet.

2.      Let’s ignore Step 1 and start again. How can I find what is OOB and what is in AdvancedFilterPack?

3.      Most people simply enable the AdvancedFilterPack as soon as they know about it and it makes sense in most situations. If the corpus of documents that is going to be indexed all fall into the list provided in #1 there is no need to enable the advanced filter pack and add additional overhead to the FAST pipeline.

4.      Enable the Advance Filter Pack

a.      Open the FAST Command Shell as Administrator on the FAST Admin Node
                                                    i.     Navigate to the <FAST Install Drive>\FASTSearch\installer\scripts directory.
                                                   ii.     Issue: “AdvancedFilterPack –enabled”

5.      There are two ways to verify the Advanced Filter Pack is enabled.
a.      1st - Re-issue the “AdvancedFilterPack –enabled” command.
                                                    i.     If it is already enabled it will tell you.

b.      2nd – The optional processor “SearchExportConverter” will be toggled to active=”yes”
                                                    i.     Use Windows Explorer to navigate to:
1.      <FAST Install Driver>\FASTSearch\etc\config_data\DocumentProcessor
                                                   ii.     Open the optionalprocessing.xml file in Internet Explorer, WordPad, or notepad.
                                                  iii.     Locate the processor name ”SearchExportConverter”

When Enabled:
<optionalprocessing>
               <processor name="SearchExportConverter" active="yes" />
</optionalprocessing>

When Disabled:
<optionalprocessing>
               <processor name="SearchExportConverter" active="no" />
</optionalprocessing>

6.      The FAST Pipeline processor stage “SearchExportConverter” is the stage which is invoked by the AdvancedFilterPack.

7.      The SearchExportConverter process uses two configuration files:
a.      user_convert_rules.xml which is used extend the pipeline to use additional IFilters (http://fs4sp.blogspot.com/2011/04/fs4sp-and-userconverterrulesxml.html)
b.      converter_rules.xml.  This file holds the information as to what is included in the OOB file types and the Advanced Filter Pack file types

8.      Inspect the converter_rules.xml
a.      Use Windows Explorer to navigation to <FAST Install Drive>\FASTSearch\etc\formatdetector
b.      Open the convert_rules.xml using Internet Explorer, WordPad, or Notepad
c.      Inspect the <filetypes> node under the <OutsideIn> node
d.      Any file type where process=”true”  is a file type which is covered under the Advanced Filter Pack
e.      Any file type where process=”false” is a file type which is covered OOB.

9.      Let’s find out if Autodesk file types and PDFs are covered.  The extension for AutoCad is .dwg

10.   Search for “dwg” or “AutoCAD”
a.      Side Note: You may have to look at the comments as not all file types have an associated mime type.

11.   Search for “pdf” or “Adobe”

      <ConverterRules>
   <OutsideIn>
    <filetypes>
          ....
     <file id="1557" process="false" mimetype="application/pdf"/> <!-- Adobe Acrobat (PDF) -->
         ....     
   <file id="1552" process="true" mimetype="image/vnd.dwg"/> <!-- AutoCAD Drawing 12-->    
   <file id="1553" process="true" mimetype="image/vnd.dwg"/> <!-- AutoCAD Drawing 13 -->
       ….
  </filetypes>
     ....
  </OutsideIn>
        ....
</ConverterRules>

12.   Note that process=”true” for PDF and process=”false” for dwf.

13.   Inspect the <ignore> node under the <OutsideIn> node. If it looks familiar is it because it is the exact list as displayed from TechNet in #1.  It also happens to closely coincide with the <filetypes> node where process=”false”

        <ignore>
            <!--
            A list of extensions that OutsideIn should ignore. NOTE, the extension
            here is the format detected (or bypassed) extension of a document and may not
            necessarily correspond to the URL extension.
            -->
            <ext>.doc</ext>
            <ext>.docm</ext>
            <ext>.docx</ext>
            <ext>.dotx</ext>
            <ext>.dot</ext>
            <ext>.eml</ext>
            <ext>.html</ext>
            <ext>.mht</ext>
            <ext>.msg</ext>
            <ext>.nws</ext>
            <ext>.odp</ext>
            <ext>.ods</ext>
            <ext>.odt</ext>
            <ext>.one</ext>
            <ext>.pdf</ext>
            <ext>.pot</ext>
            <ext>.pps</ext>
            <ext>.ppt</ext>
            <ext>.pptm</ext>
            <ext>.pptx</ext>
            <ext>.pub</ext>
            <ext>.rtf</ext>
            <ext>.txt</ext>
            <ext>.vdx</ext>
            <ext>.vsd</ext>
            <ext>.vss</ext>
            <ext>.vst</ext>
            <ext>.vtx</ext>
            <ext>.xlb</ext>
            <ext>.xlc</ext>
            <ext>.xls</ext>
            <ext>.xlsb</ext>
            <ext>.xlsm</ext>
            <ext>.xlsx</ext>
            <ext>.xlt</ext>
            <ext>.xml</ext>
            <ext>.xps</ext>
            <ext>.zip</ext>
        </ignore>

Conclusion: Understanding the Processor stages and how they work within the pipeline and a little research can divulge a lot of information.   If you are wondering if a file type is covered by the AdvancedFilterPack the converter_rules.xml is the place to start. There are 456 different file types covered  (in some flavor i.e. Wordstar 5.0, 4.0, & 2000 or as in this example 2 versions for AutoCAD)  by the advanced filter pack. (Way too many to list in the blog.  Though this provides a list of file types covered by the Advanced Filter Pack the best way to know if a file type is covered is to try to index it with the Advanced Filter Pack enabled.

KORITFW

Tuesday, April 26, 2011

SharePoint 2010 Scalability and Host Headers

Topic: SharePoint 2010, WFE, Host Headers, Scalability, Content Databases, Multiple root Site Collections, MOSS Database Migration
Subject:  Using Host Headers for Scalability.
Problem: We have multiple WFEs and multiple Web Applications.  What is the best way to scale my SharePoint WFE’s?
Response: There are a number of ways to architect and implement your SharePoint Web Applications.  Some turn out to be more scalable than others.  This solution\example will focus on the most scalable solution and also touch on an implementation in which a company may be using a “Search First” approach.  In a “Search First” approach a company may already have an existing “MOSS” farm and are not ready to migrate their MOSS farm to SharePoint 2010 but want to take advantage of the enhanced capabilities of SharePoint 2010 or FS4SP Enterprise Search.  This approach allows the customer to get the quick benefit of implementing “Search First” and tackling the migration of applications later.  I added a lot in this blog to make sure multiple scenarios are covered and not just the one I wanted to blog about.
Do: Use Host Headers within SharePoint 2010 to allow for multiple root site collections contained within a single Web Application.  This is the most scalable solution with minimal maintenance but a little more manual work upfront.
Don’t:  Setup Host Headers on IIS.  Doing so requires manually adjusting all Web Front-ends to keep them in sync when making changes or adding additional WFEs to the FARM.
Don’t: Use multiple ports to host separate Applications.  Each Web Application will create a separate IIS site.   These sites will require more resources than a single IIS Site handling multiple Applications leading to more application pool recycles. It is also more difficult to perform load balancing on ports other than 80. If you don’t really need load balancing but you want fail-over this can be performed using DNS.  This will not work for ports other than 80.

Solution/Example:
The only thing that designates Web Front End Servers from Application Servers in a SharePoint 2010 Farm is the Services running on it.  Any server which has the Service “Microsoft SharePoint Foundation Web Application” enabled is considered a WFE.  When creating a Web Application the definition will be replicated to each server with this service enabled.  Servers which will not be used as a WFE should not have this service enabled as it will slow deployment and consume resources.
In this example we will use 2 Servers designated as WFE’s for the fictitious ACME Corporation which will host 3 Web Applications.  In this example I will use a simple DNS Round Robin to simulate Load Balancing/Fail-over.  DNS will not perform true load balancing as it does not use load on a server to determine which server a user is directed to.  DNS will use proximity to the Server IP based on the users IP to determine which server a user is directed. In the case of one server being unavailable all traffic will be directed to the available server.  NLB (Windows 2008 Network Load Balancing), F5, NetScalar or other load balancing software/hardware solutions can be used to replace the DNS in this example.
WFE:
1.      SERVERWFE01
2.      SERVERWFE02
       Applications:
1.      ACMESEARCH
2.      ACMEPORTAL
3.      ACMEMIGRATED

1.      Create a new Web Application on Port 80.  (Do not create a SiteCollection)

2.      Optional:  I like to use an HTML trick to help with future debugging issues.
a.      On each WFE add a file called “test.html” to the root of the IIS 80 Port typically found at C:\inetpub\wwwroot\wss\VirtualDirectories\80.   Fill in the appropriate server name for each WFE.

test.html
<html>
<B>ACMEWFE01</B>
</html>

3.      Create 2 new “Host (A or AAAA)…” records in the Forward Lookup Zones in DNS for each IP address of the WFEs.
a.      NAME: ACMESP

4.      Depending on your Domain this may take time to replicate.  Once you can ping ACMESP you are ready to continue.

5.      Modify Alternate Access path to use the load balanced new name “ACMESP”
a.      Central Administration -> System Settings -> Configure alternate access mappings

b.      Select the SharePoint – 80 Web Application

c.      Edit the Public URL

d.      Change Default from <yourserver:80> -> ACMESP

6.      Open a browser from you desktop and point to: http://acmesp/test.html
a.      The response will be ACMEWFE01 or ACMEWFE02

b.      I always suggest to browse from your Desktop and not the actual server
                                                    i.     You can run into issues accessing from the server
                                                   ii.     I don’t want to add more to this lengthy blog covering these issues
                                                  iii.     The end user will not be using the server to access SharePoint

c.      This will show which WFE is the rendering server.

d.      This is a useful trick which can be used when dealing with end users who are faced with an application error.  Have the user point to this address and you will have a starting point for which Hive Logs to peruse first.

7.      Create the 1st Host Header
a.      From SharePoint Shell as Administrator issue:
                                                    i.     New-SPSite -url “http://acmesearch -owneralias “domain\user” -HostHeaderWebApplication “http://acmesp

b.      The HostHeaderWebApplication can be looked up under CA -> Web Application. Use the Url for port 80 if you did not follow step #5.

8.      Add 2 new “Host (A or AAAA)…” records in the Forward Lookup Zones in DNS for each IP address of the WFEs.
                                                    i.     NAME: ACMESEARCH

9.      Depending on your Domain this may take time to replicate.  Once you can ping ACMESEARCH you are ready to continue.

10.   When the DNS entries have propagated, open a browser (from a desktop) to http://acmesearch
a.      The first time you access the URL you will be prompted to choose a Site Template

b.      Note: We could have specified the Template within the New-SPSite command

c.      In this case select a FAST Search Center or Enterprise Search Center

d.      Make sure the appropriate “Search Service Application Proxy” is set for the Template you choose.

11.   Let’s take a look at what we created from a Content Database Perspective
a.      From the SharePoint Shell as Administrator Issue:
                                                    i.     Get-SPContentDatabase –WebApplication http://acmesearch

              Id                             : 3da80c46-d0fd-464d-9aaa-f77a379f905f
              Name                      : WSS_Content
              WebApplication   : SPWebApplication Name=SharePoint - 80
              Server                    : <your database server>
              CurrentSiteCount : 1

12.   The New-SPSite command created the Application with the default “WSS_Content” database.
a.      This is perfectly acceptable but what happens if we are using the “Search First” approach I mentioned early and want separate content databases for the different applications.

b.      From the SharePoint Shell as Administrator issue:
                                                    i.     New-SPSite -url “http://acmeportal -owneralias “domain\user” -HostHeaderWebApplication “http://acmesp

c.      From the SharePoint Shell as Administrator issue:
                                                    i.     Get-SPContentDatabase –WebApplication http://acmeportal

Id                             : 3da80c46-d0fd-464d-9aaa-f77a379f905f
Name                      : WSS_Content
WebApplication    : SPWebApplication Name=SharePoint - 80
Server                     : <your database server>
CurrentSiteCount : 2

                                                   ii.     Both Host Header applications point to the same Content Database
                                                  iii.     On the Get-SPContentDatabase command both http://acmeportal and http://acmesearch will return the same results as both resolve to the same Web Application

d.      From the SharePoint Shell as Administrator issue:
                                                    i.     New-SPContentDatabase –Name “WSS_ACMEPORTAL” –WebApplication http://acmeportal
                                                   ii.     Get-SPContentDatabase –WebApplication http://acmeportal

Id                            : 42ee9865-82bf-4356-b1ee-1c241b626a29
Name                     : WSS_Content
WebApplication   : SPWebApplication Name=SharePoint - 80
Server                    : <your database server>
CurrentSiteCount : 2

Id                             : 9acc2238-3bfb-4a22-acd5-d76060ff97c8
Name                     : WSS_ACMEPORTAL
WebApplication   : SPWebApplication Name=SharePoint - 80
Server                    : <your database server>
CurrentSiteCount : 0

                                                  iii.     Closer but definitely not what we are looking for.  I choose these steps to show:
1.      Order matters
2.      Irrelevant of host headers we are still using one Web Application

e.      From SharePoint Shell as Administrator issue:
                                                    i.     Remove-SPSite –Identity http://acmeportal
                                                   ii.     New-SPSite -url “http://acmeportal -owneralias “domain\user” -HostHeaderWebApplication “http://acmesp” –ContentDatabase “WSS_ACMEPORTAL”
                                                  iii.     Get-SPContentDatabase –WebApplication http://acmeportal

Id                            : 42ee9865-82bf-4356-b1ee-1c241b626a29
Name                     : WSS_Content
WebApplication   : SPWebApplication Name=SharePoint - 80
Server                    : <your database server>
CurrentSiteCount : 1

Id                            : 9acc2238-3bfb-4a22-acd5-d76060ff97c8
Name                     : WSS_ACMEPORTAL
WebApplication   : SPWebApplication Name=SharePoint - 80
Server                    : <your database server>
CurrentSiteCount : 1

                                                  iv.     The results are much better
1.      1 Web Application
2.      2 Host Headers (2 root site collections)
3.      2 Content Databases

13.   Repeat Steps 8, 9 & 10 replacing the name “ACMESEARCH” with “ACMEPORTAL” and selecting the desired Template for the new Portal.

14.   To use an existing Content Database the steps are a little bit different
a.      In this example I am using a MOSS2007 database which will be migrated to SharePoint 2010 and converted to use Host Headers

15.   Restore an existing Content Database to SQL Server name “WSS_Content_Migrated” to the SharePoint Database Server
a.      I will use an MOSS Content Database to show the steps to migrate a Database as well but we could be moving from a SharePoint 2010 Development environment to a production environment.

b.      Note: To avoid Migration issues initially you can create a new Web Application in MOSS and add a simple document library

c.      Backup the existing MOSS Content Database

d.      Note: Do not use a Content DB from the same farm as the Id will already exist on the farm upon mounting the database

e.      Note: Do not use a Content DB from a farm which has a high CU patch level or a schema error will be thrown when mounting the Database

16.   Create a Web Application on a different Port
a.      I created an Application on Port 66

b.      For easy name the Content Database “WSS_Content_66”

c.      Create a Site Collection. Any template

17.   From the SharePoint Shell as Administrator issue:
a.      Dismount-SPContentDatabase –identity “WSS_Content_66”
                                                    i.     “WSS_Content_66” is the Content Database created in Step 16b.
                                                   ii.     You can also issue:
1.      Get-SPContentDatabase –WebApplication “http://<yourserver>:66” to retrieve the Name

b.      Mount-SPContentDatabase –name “WSS_Content_Migrated” –WebApplication “http://<yourserver>:66
                                                    i.     Errors will appear if problems exist

c.      Test-SPContentDatabase –name “WSS_Content_Migrated” –WebApplication “http://<yourserver>:66
                                                    i.     Errors or warnings will appear regarding issues which may need to be resolved before or after the migration

18.   Check the Migrated Site
a.      Open a browser to your Web Application “http://<yourserver>:66”

b.      Verify the Migration was successful

19.   Backup the new Web Application
a.      From the SharePoint Shell as Administrator issue:
                                                    i.     Backup-SPSite –Identity http://<yourserver>:66 –Path C:\Backup\Test66.bak

20.   Create the New Host Header Application
a.      From the SharePoint Shell as Administrator issue:
                                                    i.     New-SPContentDatabase “WSS_ACMEMIGRATED” –WebApplication “http://acmesp”
                                                   ii.     New-SPSite -url “http://acmemigrated -owneralias “domain\user” -HostHeaderWebApplication “http://acmesp” –ContentDatabase “WSS_ACMEMIGRATED”

21.   Restore the Migrated Site to the new HostHeader
a.      From the SharePoint Shell as Administrator issue:
                                                    i.     Restore-SPSite –Identity “http://acmemigrated” –Path C:\backup\Test66.bak –HostHeaderWebApplication “http://acmesp” –force –DataBaseName “WSS_ACMEMIGRATED”

22.   Repeat Steps 8 & 9 replacing the name “ACMESEARCH” with “ACMEMIGRATED”

23.   Use a browser (from a desktop) to test out all 3 Host Header Site Collection create on Port 80
a.      ACMESEARCH
b.      ACMEPORTAL
c.      ACMEMIGRATE


Conclusions:  I could simply have chosen to put a few commands in this blog and wished good luck but I wouldn’t have written the blog.  If you have read this far and attempted to implement the steps you should have a pretty good understanding of Host Headers and how to make them fit your situation whether you are starting with SharePoint 2010 from scratch or migrating to SharePoint 2010 from MOSS.  I am not saying it is a bad practice to create Web Applications on multiple ports but if the SharePoint Farm is going to grow overtime the most Scalable solution is to use Host Headers.
Site Notes: 
1.      How does Host Headers work behind the scene?  When setting up multiple root site collections on a single port using Host Headers, SharePoint will re-direct the user to the appropriate Site Collection based on the Url submitted on the Port.  In our example, SharePoint will direct the user to the ACMESEARCH site collection based on the Access url http://acmesearch.    

2.      When Migrating Content Databases and using Host Headers you may need to create managed paths if you have a Web Application which has multiple Content databases

a.      New-SPManagedPath "<path>" –HostHeader <your Host Header>

 KORITFW