Sublime Forum

Selecting Text Between Strings

#1

Need some help. I’ve got a couple hundred txt files with some JSON in it but it’s buried inside some markup/code. I’m not a coder. I’m a statistician. And the data I need is the lines of JSON. The good news is that the JSON data is located between two unique strings that don’t appear anywhere else on the page. So right before the JSON I’ve got (quotation marks are mine):

"]
, "

Basically I need to be able to select and copy everything between those character strings. Or perhaps even more simply, just delete everything before (including) and after the character strings. That would prevent me from having to paste the copied text elsewhere.

I thought Sublime might be able to accomplish this, but I’ve played around with a couple of packages and it’s not as immediately obvious how I would do this, or how I would do this without knowing some regex (which I don’t; again outside of R, I’m not much of a coder).

So any suggestions or even just some existing plug or snippet that gets me close I can modify would be extremely helpful. Thank you.

0 Likes

#2

You can search and select using regexp matching your sequence.

0 Likes

#3

Thanks for the reply. As I mentioned explicitly in the original post, I don’t know regular expressions.

0 Likes

#4

I’m having a hard time understanding what your files look like from the description. Is the text you want surrounded by “]” and ", " or by two instances of those two lines combined?

Would it be possible to post a sample from your files? Something that includes two or three instances of the text you want and the bits around it.

0 Likes

#5

estmatic:

Thanks for the reply. What I’ve got is a few hundred lines of code (some HTML and JS). The line where I need to start grabbing the JSON is actually inside a var declaration so it looks like:

var initialData = [1111,47,'Station_01',Station_02','03/13/2014 02:30:00','03/16/201',6,'FT','1 : 3','3 : 3',,,'3 : 3'] , [1111,'Station_01',7.05,...

So that’s two lines. If the string I search for includes the close bracket at the end of the first line, then I get something unique to the file (In other words if I just start with the ", " on the second line, it’s not unique but if I back up and grab the carriage return and the previous close bracket, it’s something that only exists once in the file, so it’s useful.

On a sidenote, I found a package called “Select Until” on github. It might be an out-of-the box solution that’s close enough. Using it I can ‘select’ from the top of the text file to the string indicated above, but it’s not selected in the sense that I normally understand. All of that code is just outlined (there is one giant border around all of it, but I can’t do anything to it). What is that behaviour called? And is there a quick way to make that a huge amount of selected text I can delete in a couple of keystrokes?

0 Likes

#6

Just for the sake of clarity, I’m trying to grab starting with that second line, so I want my first few characters to be:

[1111,'Station_01',7... etc.

Then that goes until another unique character string.

I’m happy either quickly deleting everything above and below what I need. Or getting something that just extracts the JSON in the middle of everything that I need. Thanks.

0 Likes

#7

BracketHighlighter plugin has the ability to select or delete and various other things between two brackets (string quotes are included). I frankly am a bit confused at what you are trying to do so I cannot tell if this will help you. But I thought I would throw it out here. If it doesn’t help, hopefully one of these people can help you find what you are looking for. If the issue is more complicated, learning regex may be something to consider for future productivity.

0 Likes

#8

I’ve got a bunch of text files (several hundred). They were all pulled from the web. So it’s a bunch of code (HTML and JS and A LOT of it) but in the middle of each page is a huge chunk of JSON. I’m trying to isolate the JSON because that’s all I want. If I can do that, I can easily convert the JSON to a csv, move that into R, and be on my way to getting work done. Now, because I’m going to have do this with several hundred pages I’m trying to find the easiest and fastest way to select just the JSON on each page.

I’m not sure I can put it in much simpler terms than that.

It’s like that bit in the Cryptonomicon where all of that gold is sitting in the jungle but it’s so inaccessible that’s its useless (although this is maddening because it seems like it should be super easy to get the JSON).

Because I figured out that the JSON sits between two unique character strings, I thought if I could find a way to tell a text editor, “Hey, just go Unique Character String 1 and start selecting all that JSON until you get to Unique Character String 2, then stop” and that would be about 95% of the work I need done.

I’m roughly ambivalent between “Hey, select everything before the JSON and delete it, then select everything after the JSON and delete it as well” and “Start at the beginning of the JSON then just select everything until the end of it.”

So that’s it. The code lines in the previous post are just an example of what’s immediately preceding where I’m trying to start. The unique character string that indicates “Hey, this is the end of what we’d like to grab” is slightly different.

0 Likes

#9

Can you post that bit?

Something like this gets the match started but it needs to know where to stop:

(?<=\]\n, \)\\s\S]+(?=<end pattern here>)
0 Likes

#10

Gladly. And thanks for the reply.

At the end:

...],'49212',[5,1] ,[2,3]

So if you start at the end of the first line with the “1]” then include the carriage return to the next line of “,[2,3]” that will make a unique string that occurs in each file once (that ‘49212’ is a value that changes in each file, so I need to keep that).

And not to sound completely stupid, but how exactly will I use that code snippet? Is it something that I just use in the search field in I do a cmd-f? And just out of curiosity, does anyone know what that behavior is that is essentially putting a massive ouline/fence around all that text but not actually selecting it (that’s when I try to use the “Select Until” package)?

Thanks for the help on this. I’m sure it seems simple enough when it’s something you use all the time, but I don’t code in python, I don’t do regex, I don’t even deal with much HTML. I understand syntax that’s not intuitive (e.g. in R) but when it’s an unintuitive syntax you don’t use, not being able to do something that seems like it should be easy becomes frustrating.

0 Likes

#11

I have to say, statistician, programmer, or something else, if this is something you are commonly going to be dealing with, learning regex is going to be extremely beneficial to you. You really don’t need formal classes to learn things like regex; there are a lot of good websites you can learn from. Learning some kind of scripting language to auto apply the regex and process the data would be even more beneficial. Just like with regex, you can learn almost any language online if you are dedicated and patient. Once you learn one, learning more is even easier. Anyways, it looks like you got some helpful people that can guide you on this specific issue.

0 Likes

#12

Ok, so the following will match everything between the start and end strings you provided:

(?<=\]\n, \)\\s\S]+(?=1\]\n,\[2,3\])

For reference, the two bits between the parentheses are a positive look-behind and a positive look-ahead. When you’re searching for some text, these allow you to match something that precedes or follows your search term without actually including it in your final match or selection. The middle bit, \s\S]+, is just saying “match anything” in between.

Yes, this is what you enter in the Find box (Cmd+F). You just need to make sure that the “Regular expression” toggle is turned on. For me it’s the left-most button that looks like an asterisk (but that may be platform/theme dependent, I’m not sure). Enter the search string, click Find, and it should select the text. Then you can just copy/paste it wherever.

I haven’t tried that “Select Until” extension but that behavior you describe may be something Sublime is doing itself. If you highlight a word in Sublime then it will put that outline around any other instances of that word within the document. It also does this in the Find panel to show if anything matches the search before you actually hit “Find”.

0 Likes

#13

Okay, read a quickie tutorial on regex. So I think I’ve got the ending string figured out.

So if I just search and click the .* icon that should let me do it by regular expression, correct?

Also one last little hiccup. It is selecting all the text properly, but when I click in the pane with the code, it gets deselected and goes back to that strange state where everything is just outlined (and no longer selected). Not sure what this behavior is even called. Is there a way to navigate between the search field and the main window without clicking? Or how do I just ‘re-select’ when all that text still has the fence thing around it?

0 Likes

#14

You can probably just do F3 and it will re-select it. Then hit Esc to close the Find panel and your text should remain selected.

0 Likes

#15

Ack… didn’t realize the thread had gone to a second page when I posted.

estmatic thank you so much for all your help.

To facelessuser, yeah, I’m not a programmer and this situation is not something I’m usually dealing with (in fact this is a first). Although I often have to clean up data, it’s never come to me in a form where it looked like someone didn’t even scrape it, but just copy-pasted the raw source code, tags, script and all.

I am usually very hesitant to just go post on a forum and say, “Hey I know nothing about this can someone do it for me?” But I actually spent the better part of two days first just trying to figure out what tools to use (looked at AppleScript but that didn’t seem very elegant for any number of reasons, then at TextMate but that wasn’t immediately obvious as how it would work, wasted even more time with TextSoap TextSpesso and even Automator). So when I finally got to Sublime I figured it wouldn’t be too difficult, but it’s not a very intuitive app when you first start using it. It’s extremely flexible and seems pretty powerful, but I’m just not in a situation where I often have to go in and change code/script to create a keyboard shortcut and things like that.

Look I am enormously grateful for the help and ultimately I did learn enough about regex to get the back half done (even before I saw the last two replies… and again, thank you estmatic) but ultimately this is not something I anticipate having to do repeatedly. I am very much a ‘learn to fish’ person (and enjoy learning things for their own sake (even code sometimes as I’m self taught on D3)), but at some point once you’ve lost the better part of two days of work for something that (you think) should be simple, just asking for help seemed to be the best way to avoid losing another day.

0 Likes

#16

No worries, but I will bet money everything you learned today will surface one day in the future :smile:.

And I do completely understand the asking, as a programmer, the learning curve is a bit smaller for me when I get into these kind of problems, just as I am sure statistical analysis problems come much easier for you. I mentioned learning regex because I find it incredibly useful and wished I had learned it a lot earlier than I did. There are a number of programmers who aren’t very good at regex, but once you do get good at it, you find tons of uses for it.

Anyways, glad you are finding a solution.

0 Likes

#17

So I think I’ve got a workflow that will make this doable in a way that isn’t soul crushing.

Thanks again for the help.

(Although the F3 didn’t re-select, and that giant fence-like boundary around the text is baffling. If anyone can shed some light on that still be interested to understand what’s happening there).

0 Likes

#18

I am certain you are correct. I’m also thinking it will be far enough in the future that I will have forgotten the bare bones of regex that I learned today, but at least I know now that a refresher of the basics is less than an hour of time.

0 Likes

#19

Ha. Didn’t even make it a day.

Once I got the first page cleaned up there were so many extraneous things buried in inconvenient places that it was obvious I’d need to take few more passes. So yeah, I’ll be learning and using more regex. I was wrong. My bad. I’ll wear the Stone of Shame.

0 Likes

#20

[quote=“ben_willard”]Ha. Didn’t even make it a day.

Once I got the first page cleaned up there were so many extraneous things buried in inconvenient places that it was obvious I’d need to take few more passes. So yeah, I’ll be learning and using more regex. I was wrong. My bad. I’ll wear the Stone of Shame.[/quote]

I knew it :laughing:. They should really teach regex in school. I feel in certain fields, it is not a question of if, but when you are going to use it. Anyone who deals with raw text data should learn it. It just makes scraping data so much easier.

0 Likes