12 Replies Latest reply on Jun 5, 2012 6:54 AM by hchiorean

    Querying contents of delimited file

    eliquious

      Hi,

       

      I'm just getting started with Modeshape and find the technology fascinating. I've followed the docs and the beginner's guide on the forum but still need a bit of help. I've modified the beginner's guide to also have a text sequencer but can't figure out how to query for the information in the file. It's a comma delimited file and from what I've read the content also ends up in the repo not just the metadata. Please correct me if I'm wrong.

       

      I'm using 2.8.1.Final.

       

      Can someone please tell me how to query the contents of the file once it's been added to a repository?

        • 1. Re: Querying contents of delimited file
          hchiorean

          Hi,

           

          Assuming the sequencing was successful, the structure of the output should be similar to what's defined here: https://docs.jboss.org/author/display/MODE28/Text+Sequencers

          To query this content, one possibility is to build & execute a JCR_SQL2, like so:

           

                  Query query = session.getWorkspace().getQueryManager().createQuery("select * from [text:column] where [text:data] like 'string' ", Query.JCR_SQL2);    

                  QueryResult result = query.execute(); 

                  RowIterator iter = result.getRows();

                  while (iter.hasNext()) {

                   //your code

                  }

           

          The query can be anything you need, take a look at https://docs.jboss.org/author/display/MODE28/Querying+and+Searching+using+JCR for more information

          • 2. Re: Querying contents of delimited file
            eliquious

            Thanks for the help. At first I thought I just didn't know how to query, which I didn't. But also, as it turns out, I'm not doing the sequencing correct either. Because when I execute a query similar to what you posted I get something like this:

             

            Exception in thread "main" javax.jcr.RepositoryException: Table 'text:column' does not exist

            Table 'text:column' does not exist

                      at org.modeshape.jcr.query.qom.JcrAbstractQuery.checkForProblems(JcrAbstractQuery.java:133)

                      at org.modeshape.jcr.query.JcrQuery.execute(JcrQuery.java:104)

                      at TestJCR.doExample(TestJCR.java:79)

                      at TestJCR.main(TestJCR.java:27)

             

            Which basically tells me I'm still not loading the file correctly. Here's the sequencer from my config and the code I'm using to load it:

             

            Config:

             

               <mode:sequencers>

                   <mode:sequencer jcr:name="Delimited Text Sequencer">

                       <mode:classname>

                        org.modeshape.sequencer.text.DelimitedTextSequencer

                       </mode:classname>

                       <mode:description>Sequences delimited files to extract values</mode:description>

                       <mode:pathExpression>//(*.(csv)[*])/jcr:content[@jcr:data] => /csv/$1</mode:pathExpression>

                       <mode:splitPattern>,</mode:splitPattern>

                   </mode:sequencer>

               </mode:sequencers>

             

            Java:

             

                    final JcrConfiguration jcrConfig = new JcrConfiguration();       // org.modeshape.jcr.JcrConfiguration

                    final JcrEngine engine = jcrConfig.loadFrom(configFile).build(); // org.modeshape.jcr.JcrEngine;

                    engine.start();

             

                    final JcrRepository repo = engine.getRepository(REPO_NAME);      // org.modeshape.jcr.JcrRepository

                    final Session session = repo.login(WORKSPACE_NAME);              // javax.jcr.Session

             

                    final Node rootNode = session.getRootNode();

                    final Node testFolder = rootNode.getNode("test_folder");

                    final Node fileNode = testFolder.addNode("people.csv", "nt:file");

             

                    session.save();

             

                    final String jql = "SELECT * FROM [text:column]";

             

                    final QueryManager queryManager = session.getWorkspace().getQueryManager();

                    final Query query = queryManager.createQuery(jql, Query.JCR_SQL2);

                    final QueryResult queryResult = query.execute();

             

            As I said in my first post, I'm just trying to modify the "beginner's guide" code in this forum to get comfortable using Modeshape. I'm just not sure how Modeshape distinguishes [nt:file] types from the delimited file types. I did look in the docs at the [text:column] and the [text:row] types, I'm just not sure where they go and how to use/specify them.

             

            Any other pointers you (or anyone else) could provide would go a long way in helping me get more familiar with Modeshape.

             

            Thanks!

            • 3. Re: Querying contents of delimited file
              hchiorean

              The sequencing configuration looks right, but there is something missing as far as the file content goes - something along the lines of:

               

              Node content = fileNode.addNode("jcr:content"); //creates the jcr:content child of the people.csv node

              content.setProperty("jcr:data, session.getValueFactory().createBinary(...use an input stream towards the actual file); //this is the property the sequencer is configured to look at

               

              Once you have those in place and call session.save(), there is another "small trick": since sequencing is done asynchronously, there may be a small (ms-wise) gap before the content to actually appears in the repository.

              In ModeShape 3 we've extended the JCR event model so that clients can be notified when sequencing is completed, but in 2.x you have to do a Thread.sleep(ms) to make sure the content is written to the repository.

               

              After that, the first thing you can do to make sure the sequencing was successful, is session.getNode(/csv/people.csv) != null

              • 4. Re: Querying contents of delimited file
                eliquious

                Horia,

                 

                Thanks for the help so far. I think the sequencing is working now. Kinda.

                 

                But there are a couple questions I have.

                 

                1. I used the session.getNode('/csv/people.csv') != null. It works just fine and returns true. Then I try to query the file with this: "SELECT [text:column] FROM [nt:file] where [jcr:path] = '/csv/people.csv'". It returns this:

                 

                    /csv/people.csv {jcr:createdBy=<anonymous>, jcr:primaryType=nt:file, jcr:created=2012-05-10T00:17:36.038-05:00, }

                        jcr:createdBy=<anonymous>

                        jcr:primaryType=nt:file

                        jcr:created=2012-05-10T00:17:36.038-05:00

                 

                But I still can't get to the rows and columns in the file. The output above was created by using the NodeIterator and PropertyIterator. So now that the sequencer seems to be working, am I querying correctly? What else do I need to do?

                 

                2. This relates to memory usage. Apparently after the file is sequenced, there seems to be a jump in memory consumption. I'm not sure if the sequencing is still going on in the background or what, but it grows to a GB then runs out of memory. I could just increase the heap size, but was curious as to what was causing it. The CSV is only 9MB with 500k rows and 5 cols. Any ideas on why this is happening?

                 

                Again, thanks so much.

                • 5. Re: Querying contents of delimited file
                  hchiorean

                  Hi,

                   

                  Regarding 1 - you need to query smth like: select [text:data] from [text:column]   to make sure you actually get the content from the file

                  Regarding 2 - I don't know what to say out of the box (we haven't encountered this so far) - there shouldn't be any memory impact regarding the sequencing, so maybe something else is going on behind the scenes (indexing maybe). It would help if you could profile this problem to see what parts of the system are causing this

                  • 6. Re: Querying contents of delimited file
                    eliquious

                    Horia,

                     

                    I'm still looking into the memory usage, but I'm attached what I've been able to gather so far. It's a snapshot before it crashed due to an OutOfMemoryError. I increased the heap to 2048m and it still crashed. I'm not sure how helpful the memory profile will be, but I posted it anyway.

                     

                    As for querying, I still get the same error as before, Table 'text:column' does not exist, when using SELECT * FROM [text:column]. I'll try again tomorrow, if I still can't get it I might post my code. Then I'll probably start playing with v3 alpha4, maybe I'll have better luck with the new version.

                     

                     

                    Sorry for the headache,

                     

                    eliquious

                    • 7. Re: Querying contents of delimited file
                      hchiorean

                      Thanks for investigating the memory usage. I had a quick look at the attached file and it's a bit worrying that SequencerOutputMap$PropertyValue takes up about 300MB of heap memory.

                       

                      It would be great if you could post the code snapshot and also, if possible, the file being sequenced - this may help reproduce the memory problem locally. Also, please post the modeshape-config used in this case.

                      • 8. Re: Querying contents of delimited file
                        eliquious

                        Sure, no problem. I've attached the code with the config file. The CSV is just a randomly generated file with 5 columns and 500k rows. Nothing special. I've also included the output of the program, which includes the query error.

                        • 9. Re: Querying contents of delimited file
                          hchiorean

                          I've just taken a look at the attached files:

                           

                          *) The memory increase it "normal" in the sense that it's a 9MB text file with 500k rows. To change the memory footprint we would actually have to redesign the sequecing ouput API, which I don't think is in the scope for 2.x (we're only doing patch releases for 2.x). However, in 3.x we should be a lot better in this area

                           

                          **) Since you're using a file system connector, the only allowed types are nt:file, nt:folder and nt:resource (see https://docs.jboss.org/author/display/MODE28/File+System+Connector for the whole documentation). This means that even though the sequencing is happening behind the scenes, none of the nodes are actually being saved into the repository. If you've configured logging, you should have an error like: " org.modeshape.graph.connector.RepositorySourceException: Primary type "beginners_source" for path "nt:unstructured" in workspace "/" in beginners_workspace is not valid for the file system connector.  Valid primary types are nt:file, nt:folder, nt:resource, and dna:resouce."

                           


                          Hope this helps

                          • 10. Re: Querying contents of delimited file
                            eliquious

                            I'll take that as good news. I was already planning on going with 3.x. I saw that error as well but wasn't sure on how to remidy it. I'll look into using the other connectors.

                             

                            I'm much more knowledgable about modeshape now than I was before. Thanks for all the help and suggestions.

                            • 11. Re: Querying contents of delimited file
                              eliquious

                              Hey Horia,

                               

                              I've started playing with v3.0.0.Alpha4 and have succeeded in sequencing the test file included with modeshape. However, when I query for the file contents, no results are returned. I can iterate through the nodes, rows and columns, but querying doesn't return the data from the file. As you might notice, I've taken most of the code below from the text sequencing test code.

                               

                              public static void main(String[] args) throws Exception {

                                  String REPO_NAME = "Test Repository";

                                  String filename = "multiLineCommaDelimitedFile.csv";

                                  String filePath = "delimited/" + filename;

                                  RepositoryConfiguration config = RepositoryConfiguration.read("config/repo-config.json");

                                  // Verify the configuration for the repository ...

                                  Problems problems = config.validate();

                                  if (problems.hasErrors()) {

                                      System.err.println("Problems starting the engine.");

                                      System.err.println(problems);

                                      System.exit(-1);

                                  }

                               

                                  System.out.println("Starting");

                                  JcrEngine engine = new JcrEngine();

                                  engine.start();

                                  JcrRepository repository = engine.deploy(config);

                                  Node rootNode = session.getRootNode();

                               

                                  System.out.println("Sequencing");

                                  Node parent = rootNode;

                                  String[] paths = filePath.split("/");

                                  for (int i = 0; i < paths.length - 2; i++) {

                                      parent = parent.addNode(paths[i], JcrConstants.NT_FOLDER);

                                  }

                                  parent = parent.addNode(paths[paths.length - 1], JcrConstants.NT_FILE);


                                  Node content = parent.addNode(JcrConstants.JCR_CONTENT,   JcrConstants.NT_RESOURCE);

                                  File file = new File(filePath);

                                  Binary binary = (((javax.jcr.Session) session).getValueFactory().createBinary(new FileInputStream(file)));

                                  content.setProperty(JcrConstants.JCR_DATA, binary);

                                  session.save();


                                  Node n = getSequencedNode(rootNode, filePath, 5);

                                  if(n == null)

                                      throw new Exception("Sequencing took longer than 5 seconds.");

                                  System.out.println(n);

                               

                                  System.out.println("Querying");

                                  final String jql = "SELECT * from [text:column]";

                                  Workspace ws = session.getWorkspace();

                                  QueryManager queryManager = ws.getQueryManager(); // javax.jcr.QueryManager

                                  Query query = queryManager.createQuery(jql, Query.JCR_SQL2);

                                  QueryResult queryResult = query.execute();

                                  System.out.println(queryResult);

                               

                                  System.out.println("Shutting down...");

                                  Future<Boolean> future = engine.shutdownRepository(REPO_NAME);

                                  future.get();

                                  engine.shutdown();

                               

                                  System.out.println("Exiting...");

                                  System.exit(0);

                              }


                              public static Node getSequencedNode(Node parentNode, String path, int maxWaitTimeSeconds) throws Exception {

                                  //TODO author=Horia Chiorean date=12/14/11 description=Change this hack once there is a proper way (events) of retrieving the sequenced node

                                  long maxWaitTime = TimeUnit.SECONDS.toNanos(maxWaitTimeSeconds);

                                  long start = System.nanoTime();

                                  Node n = null;

                                  while (System.nanoTime() - start <= maxWaitTime) {

                                     try {

                                          n = parentNode.getNode(path);

                                    

                                          if (n != null) {

                                              return n;

                                          }

                                     } catch (Exception e) {

                                     }

                                  }

                                  return null;

                              }

                               

                              I find it odd that I can iterate through the nodes, but nothing is found when querying. I've attached my configuration files. Do you have any ideas on what might be causing this behavior?

                              • 12. Re: Querying contents of delimited file
                                hchiorean

                                Hi,

                                 

                                I've had a quick look at the example code: if you look at the sequencing configuration, the path expression is: "pathExpressions" : [ "default:/(*.csv)/jcr:content[@jcr:data] => /delimited" ], which means that only .csv files added directly under the root node will trigger the sequencing.

                                The multiLineCommaDelimitedFile.csv is added under the /delimited/multiLineCommaDelimitedFile.csv path in the repo so the sequencer will not be triggered for it.

                                 

                                If you have a look at https://docs.jboss.org/author/display/MODE/Sequencing you can find out more about how the input - output path expressions can be configured.

                                 

                                The line: Node n = getSequencedNode(rootNode, filePath, 5); will retrieve the original file, not the sequenced out, which is why the code doesn't fail.

                                The SQL2 query is correct (I've tested the query locally, in a test, using a part of the people.csv file you attached earlier) and returns the expected results.

                                 

                                One final note regarding the retrieval of the sequencing output: in 3.0.0.Alpha4 we've already added sequencing events which are the preferred way of "waiting" for a sequecing operation to finish (instead of suspending the main thread). You can see a code sample of using the event listener here: https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/test/java/org/modeshape/jcr/sequencer/AbstractSequencerTest.java